Re: [openib-general] IB routing discussion summary

2007-02-26 Thread Michael Krause
At 11:49 AM 2/21/2007, Sean Hefty wrote:
I sent a message on this topic to the IBTA several days ago, but I am still
awaiting details (likely early next week).

Unclear if that will occur.  I just responded to some e-mail in the IBTA on 
the router subject as well.Given that discussion, I suspect it will be 
some time coming to fully answer the router dilemma.


 It should not be carried in the CM REQ.  The SLID / DLID of the router
 ports should be derived through local subnet SA / SM query.  When a CM REQ
 traverses one or more subnets there will be potentially many SLID / DLID
 involved in the communication.   Each router should be populating its
 routing tables in order to build the new LRH attached to the GRH / CM REQ
 that it is forwarding to the next hop.

I'm referring to configuration of the QP, not the operation of the routers.

To establish a connection, the passive side QP needs to transition from 
Init to
RTR.  As part of that transition, the modify QP verb needs as input the
Destination LID of its local router.  It sounds like you expect the 
passive side
to perform an SA query to obtain its own local routing information, which 
would
essentially invalidate the data carried in the primary and alternate path 
fields
in the CM REQ.

The source always queries to obtain a subnet-local router Port.   A sink 
can simply reflect back the LRH with source / destination LID reversed 
assuming it had such information or it can query to find the optimal / 
preferred subnet-local router Port.


 From reading 12.7.11, 13.5.1, and 17.4, I do not believe that such a 
 requirement
was expected to be placed on the passive side of a connection.  The initial
response I received agreed with this.

 I'd need to go back but the architecture is predicated that the SM and SA
 are strictly local and for security purposes their communication should
 remain local.  Higher level management entities built to communicate with
 SM and SA are responsible for cross subnet communications without exposing
 the SA or SM to direct interaction.  P_Key and Q_Key management across
 subnets is an example of such communication across subnets that would not
 be exposed to the SA and SM.

My initial thoughts are that this sounds like a good idea.  It's not 
eliminating
the need for interacting with a remote SA, so much as it abstracts it to 
another
entity.

My hope is that we can reach an agreement on the CM REQ.  Depending on 
that, it
still needs to determine if the existing SA attributes are sufficient to allow
forming inter-subnet connections, and if they are, can such attributes be
obtained.

A lot of discussion will be required within the IBTA to nail anything 
down.   As I noted above, I just provided answers to a number of questions 
posed as well as opened up perhaps a few more.   I am not aware of a TTM to 
complete this work but clearly some amount of standardization is required 
and it will take a bit to define the scope so that the specification does 
not become so large that it will take significant amount of time to develop 
and more importantly, significant resources and time to validate that the 
routing protocol is solid.   Routing protocols are not as simple as some 
may think - they vary as a function of the functional robustness and 
scalability provided.

For now, I'll assume this discussion is on hold until the IBTA gets its act 
together.

Mike




___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] IB routing discussion summary

2007-02-20 Thread Michael Krause
At 02:05 PM 2/15/2007, Sean Hefty wrote:
Is this first an IBTA problem to solve if you believe there is a problem?

Based on my interpretation, I do not believe that there's an error in the 
architecture.  It seems consistent.  Additional clarification of what 
PathRecord fields mean when the GIDs are on different subnets may be 
needed, and a change to the architecture may make things easier to 
implement, but that's a separate matter.

I contend CM does not require anything that is subnet local other than to
target a given router port which should be derived from local SM/SA only

Then please state how the passive side obtains the information (e.g. 
SLID/DLID) it needs in order to configure its QP.  I claim that 
information is carried in the CM REQ.

It should not be carried in the CM REQ.  The SLID / DLID of the router 
ports should be derived through local subnet SA / SM query.  When a CM REQ 
traverses one or more subnets there will be potentially many SLID / DLID 
involved in the communication.   Each router should be populating its 
routing tables in order to build the new LRH attached to the GRH / CM REQ 
that it is forwarding to the next hop.


The alternatives that I see are:

1. The passive side extracts the data from the LRH that carries the CM REQ.
2. The passive side issues its own local path record query.

Will you please clarify where this information comes from?

The router protocol determines path to the next hop.   As noted in prior 
e-mails, the router works in conjunction with the SM/SA to populate its 
database so that any CM or other query for a path record to get to / from 
the router can be derived and optimized based on local policy, e.g. QoS, 
within each subnet.


I will further state that SA-SA communication sans perhaps a
P_Key / Q_Key service lookup should be avoided wherever possible.

I agree - which is why my proposal avoided SA-SA communication.  I see 
nothing in the architecture that prohibits a node from querying an SA that 
is not on its local subnet.

I'd need to go back but the architecture is predicated that the SM and SA 
are strictly local and for security purposes their communication should 
remain local.  Higher level management entities built to communicate with 
SM and SA are responsible for cross subnet communications without exposing 
the SA or SM to direct interaction.  P_Key and Q_Key management across 
subnets is an example of such communication across subnets that would not 
be exposed to the SA and SM.

Mike




___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] Immediate data question

2007-02-15 Thread Michael Krause
At 09:37 PM 2/14/2007, Devesh Sharma wrote:
On 2/14/07, Michael Krause [EMAIL PROTECTED] wrote:
At 05:37 AM 2/13/2007, Devesh Sharma wrote:
 On 2/12/07, Devesh Sharma [EMAIL PROTECTED] wrote:
 On 2/10/07, Tang, Changqing [EMAIL PROTECTED] wrote:

Not for the receiver, but the sender will be severely slowed down by
having to wait for the RNR timeouts.
   
RNR = Receiver Not Ready so by definition, the data flow
isn't going to
progress until the receiver is ready to receive data.   If a
receive QP
enters RNR for a RC, then it is likely not progressing as
desired.   RNR
was initially put in place to enable a receiver to create
back pressure to the sender without causing a fatal error
condition.  It should rarely be entered and therefore should
have negligible impact on overall performance however when a
RNR occurs, no forward progress will occur so performance is
essentially zero.
  
   Mike:
   I still do not quite understand this issue. I have two
   situations that have RNR triggered.
  
   1. process A and process B is connected with QP. A first post a send to
   B, B does not post receive. Then A and B are doing a long time
   RDMA_WRITE each other, A and B just check memory for the RDMA_WRITE
   message. Finally B will post a receive. Does the first pending send 
 in A
   block all the later RDMA_WRITE ?
 According to IBTA spec HCA will process WR entries in strict order in
 which they are posted so the send will block all WR posted after this
 send, Until-unless HCA has multiple processing elements, I think even
 then processing order will be maintained by HCA
   If not, since RNR is triggered
   periodically till B post receive, does it affect the RDMA_WRITE
   performance between A and B ?
  
   2. extend above to three processes, A connect to B, B connect to C, 
 so B
   has two QPs, but one CQ.A posts a send to B, B does not post receive,
 post ordering accross QP is not guaranteed hence presence of same CQ
 or different CQ will not affect any thing.
   rather B and C are doing a long time RDMA_WRITE,or send/recv. But B
 If RDMA WRITE _on_ B, no effect on performance. If RDMA WRITE _on_ C,
I am sorry I have missed that in both cases same DMA channel is in use.
 _may_ affect the performance, since load is on same HCA. In case of
 Send/Recv again _may_ affect the performance, with the same reason.

Seems orthogonal.  Any time h/w is shared, multiple flows will have an
impact on one another.  That is why we have the different arbitration
mechanisms to enable one to control that impact.
Please, can you explain it more clearly?

Most I/O devices are shared by multiple applications / kernel 
subsystems.   Hence, the device acts as a serialization point for what goes 
on the wire / link.   Sharing = resource contention and in order to add any 
structure to that contention, a number of technologies provide arbitration 
options.   In the case of IB, the arbitration is confined to VL arbitration 
where a given data flow is assigned to a VL and that VL is services at some 
particular rate.   A number of years ago I wrote up how one might also 
provide QP arbitration (not part of the IBTA specifications) and I 
understand some implementations have incorporated that or a variation of 
the mechanisms into their products.

In addition to IB link contention, there is also PCI link / bus 
contention.   For PCIe, given most designs did not want to waste resources 
on multiple VC, there really isn't any standard arbitration 
mechanism.   However, many devices, especially a device like a HCA or a 
RNIC, already have the concept of separate resource domains, e.g. QP, and 
they provide a mechanism to associate how the QP's DMA requests or 
interrupts requests are scheduled to the PCIe link.


   must sends RNR periodically to A, right?. So does the pending message
   from A affects B's overall performance  between B and C ?
 But RNR NAK is not for very long time.possibly this performance
 hit you will not be able to observe even. The moment rnr_counter
 expires connection will be broken!

Keep in mind the timeout can be infinite.  RNR NAK are not expected to be
frequent so their performance impact was considered reasonable.
Thanks I missed that.

It is a subtlety within the specification that is easy to miss.

Mike 



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] IB routing discussion summary

2007-02-15 Thread Michael Krause
At 11:39 AM 2/15/2007, Sean Hefty wrote:
Ideas were presented around trying to construct an 'inter-subnet path record'
that contained the following:
- Side A GRH.SGID = active side's Port GID
- Side A GRH.DGID = passive side's Port GID
- Side A LRH.SLID = any active side's port LID
- Side A LRH.DLID = A subnet router
- Side A LRH.SL   = SL to A subnet router
- Side B GRH.SGID = Side A GRH.DGID
- Side B GRH.DGID = Side A GRH.SGID
- Side B LRH.SLID = any passive side's port LID
- Side B LRH.DLID = B subnet router
- Side B LRH.SL   = SL to B subnet router

Until I can become convinced that the above isn't needed, I've been trying 
to brainstorm of ways to obtain this information.

Is this first an IBTA problem to solve if you believe there is a 
problem?   I believe the track you are on is incorrect and any attempt to 
surface subnet local information across subnets will create unnecessary 
complexity and therefore makes such solutions less practical to execute 
within the industry.   I've tried to illustrate the role of the router, how 
the flows work, etc.  I believe these to be correct and are reflected not 
only in the existing specifications but also the prior router specification 
work and thinking.   They also parallel the IP world quite nicely which 
should also lend credence that subnet-local information does not need to be 
exchanged between subnets.   I contend CM does not require anything that is 
subnet local other than to target a given router port which should be 
derived from local SM/SA only information.  I will further state that SA-SA 
communication sans perhaps a P_Key / Q_Key service lookup should be avoided 
wherever possible.

I strongly urge you to take this problem to the IBTA where any issues 
regarding specification interpretation can be sorted out and an official 
position taken.   This will yield a faster and more successful 
investigation into whether there is a problem and if so, how best to solve it.

Mike


0. Have the SA return pairs of PathRecords for inter-subnet queries.

But, since this simply punts the problem to the SA, my other thought is to 
define the following:

1. Inter-subnet PathRecord/MultiPathRecord Get/GetTable requests require 
both an SGID and DGID, one of which must be subnet local to the processing SA.
2. PathRecord/MultiPathRecord Get/GetTable request fields are relative to
the subnet specified by the SGID.
3. PathRecord GetResp/GetTableResp response fields are relative to the
subnet local to the processing SA.
4. SAs are addressable by a well-known GID suffix.

I think this may allow establishing inter-subnet connections.  As an 
example of
its usage:

a. Active side issues a PathRecord query to the local SA with SGID=local,
DGID=remote.
b. SA responds with PathRecord(s).
c. Active side selects local PathRecord P1.
d. Active side issues a PathRecord query to the remote SA using PathRecord 
P1 to
format the request: SGID, DGID, SLID, DLID, TC, FL, SL, etc.
e. The remote SA responds with PathRecord(s).  The SA must ensure that 
packets injected into the internetwork using P1 will route to the returned 
records.
f. Active side selects remote PathRecord P2.
g. Active side validates that remote packets injected using P2 route to P1.

At this point, the active side should have path information that can be 
used to
configure the QPs for a connection.

Assuming that this will work, what I don't like about it is the validation 
at step g.  This adds a third query that I don't see a way to 
eliminate.  If the check fails, the client restarts at step c.

- Sean



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] IB routing discussion summary

2007-02-14 Thread Michael Krause


I do not see the need for any of this.   The router protocol should be 
designed to work with each subnet's SM / SA to provide information on what 
GID prefix is on each router Port.  This is used to look up the subnet 
local LRH fields.

The only cross-subnet challenges are global based, e.g. what is the P_Key 
to use and how to manage those across subnets or how should TClass be 
interpreted to achieve a consistent behavior independent of how the TClass 
is subnet local mapped to a SL.These were the types of challenges 
remaining when we stopped development of the router specification.   If the 
IBTA decides to develop a router specification then it might be best to 
join that effort and work it out in detail before attempting to develop the 
management infrastructure.  May be able to slightly lag in order to 
validate the technical directions that the spec will take without having to 
wait until 1.0 to say, yep, this looks good or here is where you need to 
change the spec.   Not clear what can be developed until there is a router 
specification to execute to in the industry.

Mike


At 01:17 PM 2/13/2007, Sean Hefty wrote:
Here's a first take at summarizing the IB routing discussion.

The following spec references are noted:

9.6.1.5 C9-54. The SLID shall be validated (for connected QPs).
12.7.11. CM REQ Local Port LID - is LID of remote router.
13.5.4: Defines reversible paths.

The main discussion point centered on trying to meet 9.6.1.5 C9-54.  This
requires that the forward and reverse data flows between two QPs traverse the
same router LID on both subnets.  The idea was made to try to eliminate this
compliance statement for packets carrying a GRH, but this is viewed as going
against the spirit of IBA.

Ideas were presented around trying to construct an 'inter-subnet path record'
that contained the following:

- Side A GRH.SGID = active side's Port GID
- Side A GRH.DGID = passive side's Port GID
- Side A LRH.SLID = any active side's port LID
- Side A LRH.DLID = A subnet router
- Side A LRH.SL   = SL to A subnet router

- Side B GRH.SGID = Side A GRH.DGID
- Side B GRH.DGID = Side A GRH.SGID
- Side B LRH.SLID = any passive side's port LID
- Side B LRH.DLID = B subnet router
- Side B LRH.SL   = SL to B subnet router

It is still unclear how such a record can be constructed.  But communication
with remote SAs might be achieved by using a well-known GID suffix.  It's also
unclear whether the fields in a path record are relative to the SA's subnet or
the SGID.

It's anticipated that SAs will need to interact with routers, but in an
unspecified manner.



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] Problem is routing CM REQ

2007-02-14 Thread Michael Krause
At 01:36 PM 2/14/2007, Sean Hefty wrote:
Assume that the active and passive sides of a connection request are on 
different subnets and:

Active side - LID 1
Active side router - LID 2
Passive side - LID 93
Passive side router - LID 94

What values are you suggesting are used for:

Active side QP - DLID
Passive side QP - DLID
CM REQ Primary Local Port LID

Subnet A is:
QP Port LID 1
Router A Port LID 2

Subnet B is:
QP Port LID 93
Router B Port LID 94


Process steps:

- Router A populates SM / SA A with the GID prefix it can route.   SM / SA 
A will have configured the router Port with the appropriate local route 
information and hence have assigned it LID 2.

- CM associated with Port LID 1 queries the SM / SA to identify a path to a 
GID Prefix.   SM / SA returns a path record indicating a global route, i.e. 
one that requires a GRH, is available and provides the CM with the 
information targeting router Port LID 2.

- CM creates a REQ and populates the global information to identify the 
remote endnode.  The LRH generated targets Port LID 2.  The GRH is 
generated to target the remote subnet so the router will comprehend how to 
process the packet.

- Router A receives the packet and examines the GRH.   Via its router 
protocol, it has previously identified what router Port will lead to the 
next hop on the path to the destination endnode.

- If the endnode is subnet local, say subnet B, then the router generates a 
LRH with QP LID 93 and emits that on router Port LID 94.

- QP in subnet B receives the CM REQ and validates the LRH.  Given these 
messages are via UD service and not RC / UC, the validation rules for the 
LRH are different.   The CM agent processes the request and returns an 
appropriate response by filling in a GRH that replaces the SGID with the 
DGID and so forth so the addresses are basically reflected back.   The 
response uses QP port LID 93 and targets router Port 94.

- Router B Port 94 receives the response.  It parses the GRH and determines 
the next hop port.   In this example, the response goes out router A Port 2 
and targets QP Port LID 1.  The LRH is generated using these 
fields.  Again, since CM is targeting a UD QP, the LRH validation rules are 
different.

- Once the connection is established, the QP on subnet A will send packets 
to QP on subnet B using a GRH that is processed by the router with each QP 
using a LRH that targets the router port locally attached to its 
subnet.   The router is responsible for generating a LRH to forward to the 
next hop.   These packets are now in a RC / UC data flow so the LRH 
validation is per the sections cited in this e-mail string.

In all cases, the router protocol is responsible for generation of a LRH 
that will work within each subnet.  There is no exchange of subnet local 
information between the subnets.  Each subnet's SM/SA only tracks what is 
local to it as well as what GID prefix can be routed via a given LID.   If 
multiple LID can route to a given GID prefix, multiple path records are 
returned.   Which to choose is not specified by the specifications so it 
can be any policy one desires. If the router protocol communicates a cost 
to a given path in order to give an indication of appropriateness for a 
given workload, then this should be communicated to the CM agent.

Mike 



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] IB routing discussion summary

2007-02-14 Thread Michael Krause
At 02:02 PM 2/14/2007, Sean Hefty wrote:
Mike, are you expecting that routers will modify CM messages as they flow 
between subnets?

The router parses the GRH, strips the LRH, attaches a new LRH to the next 
hop with the contents of the LRH filled in per its internal 
policies.   Nothing more for the main packet processing.   The router 
interacts with each subnet's SM/SA to insure the path records can be 
provided to the CM to fill in the right information.

Mike 



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] Problem is routing CM REQ

2007-02-13 Thread Michael Krause
At 03:48 PM 2/12/2007, Sean Hefty wrote:
An endnode look up should be to find the address vector to the 
remote.   A look up may return multiple vectors.   The SLID would 
correspond to each local subnet router port that acts as a first-hop 
destination to the remote subnet.I don't see why the router protocol 
would not simply enable all paths on the local subnet to a given remote 
subnet be acquired.  All of the work is kept local to the SA / SM in the 
source subnet when determining a remote path to take.
Why is there any need to define more than just this?

For an RC QP, we need at least two sets of LIDs.  In the simplest case, we 
need the SLID/router DLID for the local subnet, and the router SLID/DLID 
for the remote subnet.  The problem is in obtaining the SLID/DLID for the 
remote subnet.

Not quite.   The router protocol should determine the next hop LID to be 
used to either reach the destination endnode if in its local subnet or for 
the next router on the path to the remote.   CM only needs to be concerned 
with what is in a local subnet for finding the router or the endnode.  It 
does not need to comprehend the remote subnet(s) LID.   That is the router 
protocol to determine.  CM also must understand the GIDs involved which the 
router will process to figure out its LID mapping to the next hop.

Mike  



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] Immediate data question

2007-02-13 Thread Michael Krause
At 05:37 AM 2/13/2007, Devesh Sharma wrote:
On 2/12/07, Devesh Sharma [EMAIL PROTECTED] wrote:
On 2/10/07, Tang, Changqing [EMAIL PROTECTED] wrote:
   
   Not for the receiver, but the sender will be severely slowed down by
   having to wait for the RNR timeouts.
  
   RNR = Receiver Not Ready so by definition, the data flow
   isn't going to
   progress until the receiver is ready to receive data.   If a
   receive QP
   enters RNR for a RC, then it is likely not progressing as
   desired.   RNR
   was initially put in place to enable a receiver to create
   back pressure to the sender without causing a fatal error
   condition.  It should rarely be entered and therefore should
   have negligible impact on overall performance however when a
   RNR occurs, no forward progress will occur so performance is
   essentially zero.
 
  Mike:
  I still do not quite understand this issue. I have two
  situations that have RNR triggered.
 
  1. process A and process B is connected with QP. A first post a send to
  B, B does not post receive. Then A and B are doing a long time
  RDMA_WRITE each other, A and B just check memory for the RDMA_WRITE
  message. Finally B will post a receive. Does the first pending send in A
  block all the later RDMA_WRITE ?
According to IBTA spec HCA will process WR entries in strict order in
which they are posted so the send will block all WR posted after this
send, Until-unless HCA has multiple processing elements, I think even
then processing order will be maintained by HCA
  If not, since RNR is triggered
  periodically till B post receive, does it affect the RDMA_WRITE
  performance between A and B ?
 
  2. extend above to three processes, A connect to B, B connect to C, so B
  has two QPs, but one CQ.A posts a send to B, B does not post receive,
post ordering accross QP is not guaranteed hence presence of same CQ
or different CQ will not affect any thing.
  rather B and C are doing a long time RDMA_WRITE,or send/recv. But B
If RDMA WRITE _on_ B, no effect on performance. If RDMA WRITE _on_ C,
_may_ affect the performance, since load is on same HCA. In case of
Send/Recv again _may_ affect the performance, with the same reason.

Seems orthogonal.  Any time h/w is shared, multiple flows will have an 
impact on one another.  That is why we have the different arbitration 
mechanisms to enable one to control that impact.

  must sends RNR periodically to A, right?. So does the pending message
  from A affects B's overall performance  between B and C ?
But RNR NAK is not for very long time.possibly this performance
hit you will not be able to observe even. The moment rnr_counter
expires connection will be broken!

Keep in mind the timeout can be infinite.  RNR NAK are not expected to be 
frequent so their performance impact was considered reasonable.

Mike

 
  Thank you.
 
  --CQ
 
 
  
   Mike
  
  
  
 
  ___
  openib-general mailing list
  openib-general@openib.org
  http://openib.org/mailman/listinfo/openib-general
 
  To unsubscribe, please visit 
 http://openib.org/mailman/listinfo/openib-general
 
 



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] Problem is routing CM REQ

2007-02-13 Thread Michael Krause
At 04:10 PM 2/12/2007, Jason Gunthorpe wrote:
On Mon, Feb 12, 2007 at 03:31:15PM -0800, Michael Krause wrote:

  TClass is intended to communicate the end-to-end QoS desired.   TClass is
  then mapped to a SL that is local to each subnet.   A flow label is
  intended to much the same as in the IP world and is left, in essence, to
  routers to manage.An endnode look up should be to find the address
  vector to the remote.   A look up may return multiple vectors.   The SLID
  would correspond to each local subnet router port that acts as a first-hop
  destination to the remote subnet.I don't see why the router protocol
  would not simply enable all paths on the local subnet to a given remote
  subnet be acquired.  All of the work is kept local to the SA / SM in the
  source subnet when determining a remote path to take.   Why is there any
  need to define more than just this?  Define a router protocol to
  communicate the each subnet's prefix, TClass, etc. and apply KISS.   A
  management entity that wanted to manage out each subnet provides router
  management in terms of route selection, etc. can be constructed by using
  the existing protocols / tools combined with a new router protocol which
  only does DGID to next hop SLID mapping.

All of this complexity is due to the RC QP requirement that the SLID
of an incoming LRH match the DLID programmed into the QP.

Translated into a network with routers this means that for a RC flow
to successfully work both the *forward* and *reverse* direction must
traverse the same router *LID* not just *port* on both subnets.

That is a given since the LID = path and same path must be used to insure 
strong ordering is maintained.

Please see the little ascii diagram I drew in a prior email to
understand my concern.

There is no such restriction in a real IP network. It would be akin to
having a host match the source MAC address in the ethernet frame to
double check that it came from the router port it is sending outgoing
packets to. Which means simple one-sided solutions from IP land don't
work here.

Things work exactly the way you outline today for UD. They don't work
at all for the general case of RC. Get rid of the QP requirement and
things work the way you outline for RC too. Keep it in and you must
use the FlowLabel to force the flows onto the right router LID.

The same path must always be used to maintain strong ordering.  This is 
immutable part of IB technology.

That is why I said previously that the QP matching rules are a
mistake. The best way to solve this is to change C9-54 to only be in
effect if the GRH is not present.

I disagree.  We were very explicit in how and why we constructed those rules.

CM also introduces the much smaller problem of getting the LIDs to the
passive side - but that cannot be solved without a broad solution to
the RC QP SLID matching problem.

Mike 



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] Problem is routing CM REQ

2007-02-13 Thread Michael Krause
At 01:14 PM 2/13/2007, Sean Hefty wrote:
It does not need to comprehend the remote subnet(s) LID.
That is the router protocol to determine.  CM also must understand the 
GIDs involved which the router will process to figure out its LID mapping 
to the next hop.

The CM REQ carries the remote router LID (primary local port lid - 
12.7.11) and remote endpoint LID (primary remote port lid - 12.7.21).

Let me clarify what the specification is saying which is what I'm saying.

A LID is subnet local on that we can all agree.   The CM Req contains 
either the LID of a local subnet CA or the LID a local router which will 
move the packet to the next hop to the destination.   12.7.11 is basically 
saying that the remote LID is the router's LID of the local subnet's router 
Port.   12.7.21 also refers to the remote LID but in each subnet that is 
either the router Port's LID or the destination CA.

 From an operational flow perspective, CM would:

Query to see if the destination CA is on the local subnet
If yes, then obtain the associated records to find the local LID
If no, then obtain the set of records that contain the local addressing to 
a router Port that will progress connection establishment to the next hop 
on the way to the destination.

While there isn't a router specification any longer, the basic operation is 
very much like that of an IP subnet.   The router protocol establishes a 
set of routes for given subnet prefix and then communicates that to each 
SM/SA so that queries will resolve the optimal router Port.   Chapter 8 
provides clear guidance in this regard.  Chapter 12 is basically stating 
what to plug into various fields with all LIDs being only local to the 
subnet where they are managed.   The primary global knowledge that one must 
have across subnets are to establish a connection or communication flow.

- SGID
- DGID
- P_Key
- Q_Key

There really isn't much more than this to comprehend.  The TClass and Flow 
Labels were expected to be provided via the router protocol so the 
management requirements are really query look up.

Mike 



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] Problem is routing CM REQ

2007-02-13 Thread Michael Krause
At 02:02 PM 2/13/2007, Jason Gunthorpe wrote:
On Tue, Feb 13, 2007 at 12:49:57PM -0800, Michael Krause wrote:

  Translated into a network with routers this means that for a RC flow
  to successfully work both the *forward* and *reverse* direction must
  traverse the same router *LID* not just *port* on both subnets.
 
  That is a given since the LID = path and same path must be used to insure
  strong ordering is maintained.

I think you are missing what I'm saying. IB within a subnet has the
path selected by the DLID only.

The actual path selection is a policy decision outside the scope of the 
specification - it appears this is your main concern in that the 
specification does not state take these N parameters and apply the 
following algorithm to identify a path.   The address vector can be 
comprised of many fields including a LID range.  The actual DLID selected 
is done above as there can be a variety of policies or constraints imposed 
for a given data flow.   I agree that packet switching within is via a DLID.

So the construction process for a QP is to choose two enport LIDs, reverse 
them on one side and then query the SA for the forward and reverse SL. 
That gives you a pair of workable QPs.

SL, LID, etc. are all uploaded into the management database for the SM / SA 
to access and there can be much more robust information loaded as well that 
goes well beyond what the IBTA specified in order to provide additional 
interpretations / information to guide path selection.   A query can return 
multiple records if multi-path has been configured.  Policy above is used 
to construct the CM messages which communicate the preferred path.The 
CM messages for establishment across subnets should be sufficient in their 
existing content to work independent of how the actual routing is 
accomplished.

This same procedure doesn't work for routers.
Consider a case where a router port has LID 1 and an end port has
LIDs 3,4.
The end port establishes two RC QPs:
  #1: SLID=3, DLID=1
  #2: SLID=4, DLID=1
Both have the same DGID - how is the router expected to know that QP
#1 requires one set of LIDs and QP #2 requires a different set?

For all intents and purposes, within a local subnet, a router Port is 
treated the same as CA.  If there are multiple paths between a router Port 
and a given CA Port, i.e. multiple LIDs are configured, then the router is 
supposed to query the SM / SA database and obtain the appropriate records 
and make a decision that remains valid for the lifetime of the data 
flow.   The purpose of the TClass is to enable a local mapping to SL which 
can also be used as input into LID selection.   The flow label is left open 
in its value and was expected to be used much like it is in IP.   People 
considered encoding it or at a minimum, using it as an input parameter to 
identify the associated LID for the flow but that was not agreed to since 
the router vendors at the time wanted it left largely opaque.


Section 19.2.4.1 seems to make it explicit to me that this is a valid 
situation.

Yes, 19.2.4.1 supports multi-path within a given subnet.

To have this work the router must use the flow label to identify the
correct DLID. SA/CM must be enhanced in some way to let the two sides
exchange flow labels.

That is a policy decision or something for a TBD router protocol 
specification.   It is not required to use the Flow Label.

This problem is worse if you have multiple independent redundent
routers on your subnet, or LMC != 0. Then you now have the problem of
SLID matching as well as DLID matching.

It is no worse due to the existence of multi-path.   There are many 
variables involved in creating a viable router protocol specification which 
is in part, why the IBTA chose to not complete that work.

Strong ordering is maintained in all cases because the routers always
make consistent choices for the LRH.DLID on a session by session
basis.

Agreed,  The router is responsible for insuring a consistent path is used 
for a given flow.  That does not preclude multi-path nor does it make 
multi-path more complicated as a result.


  That is why I said previously that the QP matching rules are a
  mistake. The best way to solve this is to change C9-54 to only be in
  effect if the GRH is not present.
 
  I disagree.  We were very explicit in how and why we constructed those
  rules.

Do you know of a solution then?

If C9-54 is a very deliberate design then it must be that the CM
specification in Chapter 12 is not designed to handle the
ramifications of C9-54.

I just can't see how to fit both CM and C9-54 together into a workable
solution.

You are arguing about a router protocol problem that does not exist  or 
perhaps I just don't get it.   We did progress the router specification or 
at least the operating models behind it sufficiently to validate that both 
Chapter 9 and Chapter 12 worked as specified (as well as chapters 8 and 
19).   Yes, there are implementation issues within

Re: [openib-general] dapl broken for iWARP

2007-02-12 Thread Michael Krause
At 07:29 AM 2/9/2007, Kanevsky, Arkady wrote:
Mike,
this is not a DAPL issue.
There are 2 ways to deal with it.
One is for all ULPs to use private data to exchange CM info.
yes, some ULPs, like SDP do that in hello world message.

Another is to let CM handle it.
This way ULP does not have to deal with it.
This is analogous to the IBTA CM IP addressing Annex.
It ensure backwards compatibility and does not break any existing apps
which use MPA as specified by IETF.

No need to bother IETF until we have it working.

Given what it took to get MPA specified, I don't see changing the 
specification for this as likely welcomed by many.   The ULP used within 
the IETF are largely able to solve this problem at their login exchange so 
unless there is some ground swell of IETF ULP that can't solve it as these 
do, I think this may be a challenge to gain any traction.

Mike

Thanks,

Arkady Kanevsky   email: [EMAIL PROTECTED]
Network Appliance Inc.   phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
Waltham, MA 02451   central phone: 781-768-5300


  -Original Message-
  From: Michael Krause [mailto:[EMAIL PROTECTED]
  Sent: Thursday, February 08, 2007 4:27 PM
  To: Kanevsky, Arkady; Steve Wise; Arlin Davis
  Cc: openib-general
  Subject: Re: [openib-general] dapl broken for iWARP
 
  At 07:43 AM 2/8/2007, Kanevsky, Arkady wrote:
  That is correct.
  I am working with Krishna on it.
  Expect patches soon.
  
  By the way the problem is not DAPL specific and so is a proposed
  solution.
  
  There are 3 aspects of the solution.
  One is APIs. We suggest that we do not augment these.
  That is a connection requestor sets its QP RDMA ORD and IRD.
  When connection is established user can check the QP RDMA
  ORD and IRD
  to see what he has now to use over the connection.
  We may consider to extend QP attributes to support transport
  specific
  parameters passing in the future.
  For example, iWARP MPA CRC request.
  
  Second is the semantic that CM provides.
  The proposal is to match IBCM semantic.
  That is CM guarantee that local IRD is = remote ORD.
  This guarantees that incoming RDMA Read requests will not
  overwhelm the
  QP RDMA Read capabilities.
  Again there is not changes to IBCM only to IWCM.
  Notice that as part of this IWCM will pass down to driver
  and extract
  from driver needed info.
  
  The final part is iWARP CM extension to exchange RDMA ORD, IRD.
  This is similar to IBTA Annex for IP Addressing.
  The harder part that this will eventually require IETF MPA spec
  extension, and the fact that MPA protocol is implemented in
  RNIC HW by
  many vendors, and hence can not be done by IWCM itself.
 
  We looked at this quite a bit during the creation of the
  specification.   All of the targeted usage models exchange
  this information
  as part of their hello or login exchanges.As such, the
  hum was to
  not change MPA to communicate such information and leave it
  to software to
  exchange these values through existing mechanisms.   I
  seriously doubt
  there will be much support for modifying the MPA
  specification at this stage since the implementations are
  largely complete and a modification would have to deal with
  the legacy interoperability issue which likely would be
  solved in software any way.  It would be simpler to simply
  modify the underlying DAPL implementation to exchange the
  information and keep this hidden from both the application
  and the RNIC providers.
 
  Mike
 
 
  Thanks,
  
  Arkady Kanevsky   email: [EMAIL PROTECTED]
  Network Appliance Inc.   phone: 781-768-5395
  1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
  Waltham, MA 02451   central phone: 781-768-5300
  
  
-Original Message-
From: Steve Wise [mailto:[EMAIL PROTECTED]
Sent: Wednesday, February 07, 2007 6:12 PM
To: Arlin Davis
Cc: openib-general
Subject: Re: [openib-general] dapl broken for iWARP
   
On Wed, 2007-02-07 at 15:05 -0800, Arlin Davis wrote:
 Steve Wise wrote:

 On Wed, 2007-02-07 at 14:02 -0600, Steve Wise wrote:
 
 
 Arlin,
 
 The OFED dapl code is assuming the responder_resources and
 initiator_depth passed up on a connection request event
are from the
 remote peer.  This doesn't happen for iWARP.  In the
current iWARP
 specifications, its up to the application to exchange this
 information somehow. So these are defaulting to 0 on the
server side
 of any dapl connection over iWARP.
 
 This is a fairly recent change, I think.  We need to
  come up with
 some way to deal with this for OFED 1.2 IMO.
 
 
 Yes, this was changed recently to sync up with the
  rdma_cm changes
 that exposed the values.

 
 
 
 The IWCM could set these to the device max values for instance.
 
 
 That would work fine

Re: [openib-general] Immediate data question

2007-02-12 Thread Michael Krause
At 09:10 PM 2/11/2007, Devesh Sharma wrote:
On 2/10/07, Tang, Changqing [EMAIL PROTECTED] wrote:
  
  Not for the receiver, but the sender will be severely slowed down by
  having to wait for the RNR timeouts.
 
  RNR = Receiver Not Ready so by definition, the data flow
  isn't going to
  progress until the receiver is ready to receive data.   If a
  receive QP
  enters RNR for a RC, then it is likely not progressing as
  desired.   RNR
  was initially put in place to enable a receiver to create
  back pressure to the sender without causing a fatal error
  condition.  It should rarely be entered and therefore should
  have negligible impact on overall performance however when a
  RNR occurs, no forward progress will occur so performance is
  essentially zero.

Mike:
 I still do not quite understand this issue. I have two
situations that have RNR triggered.

1. process A and process B is connected with QP. A first post a send to
B, B does not post receive. Then A and B are doing a long time
RDMA_WRITE each other, A and B just check memory for the RDMA_WRITE
message. Finally B will post a receive. Does the first pending send in A
block all the later RDMA_WRITE ?
According to IBTA spec HCA will process WR entries in strict order in
which they are posted so the send will block all WR posted after this
send, Until-unless HCA has multiple processing elements, I think even
then processing order will be maintained by HCA
If not, since RNR is triggered

The source HCA is responsible for processing work requests in the order 
they are posted.   If the SEND cannot proceed and receives a RNR, then the 
subsequent RDMA Write should not proceed, i.e. the sequence numbers that 
define the valid window will not progress and given IB requires strong 
ordering within the fabric, nothing sent subsequently should be made 
visible at the sink HCA.   In your example, if A is sending a SEND followed 
by a RDMA Write, the first check should have been that B had provided an 
ACK with a credit indicating that a SEND is allowed.  If B subsequently 
removed access to the buffer that had to be posted to provide that credit, 
then it should trigger a RNR NAK and the subsequent RDMA Writes should not 
be visible at B since there is no an effective hole in the transmission stream.

periodically till B post receive, does it affect the RDMA_WRITE
performance between A and B ?

2. extend above to three processes, A connect to B, B connect to C, so B
has two QPs, but one CQ. A posts a send to B, B does not post receive,
rather B and C are doing a long time RDMA_WRITE, or send/recv. But B
must sends RNR periodically to A, right?. So does the pending message
from A affects B's overall performance  between B and C ?

Neither IB nor iWARP provide any ordering guarantees between different data 
flows.  This is strictly under application control.  Hence, if a RNR NAK or 
whatever occurs on a RC between A and B, then it has no impact on what 
occurs between A and C or B and C.   It is simply outside the scope of 
either technology to address.

Mike




___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] Problem is routing CM REQ

2007-02-12 Thread Michael Krause
At 12:56 PM 2/12/2007, Jason Gunthorpe wrote:
On Mon, Feb 12, 2007 at 09:23:06AM -0800, Sean Hefty wrote:
  Ah, I think I missed the key step in your scheme.. You plan to query
  the local SM for SGID=remote DGID=local? (ie reversed from 'normal'. I
  was thinking only about the SGID=local DGID=remote query direction)
 
  I'm not sure that the query needs the GIDs reversed, as long as the 
 path is
  reversible.  So, the local query would be:
 
  SGID=local, DGID=remote, reversible=1   (to SA)
 
  And the remote query would be:
 
  SGID=local, DGID=remote, reversible=1,  (to SA')
  TClass  FlowLabel=from previous query response

1) What does the TClass and FlowLabel returned from SGID=local
DGID=remote mean?
Do you use it in the Node1 - Node2 direction or the Node2 - Node1 
 direction
or both?
1a) If it is Node1 - Node2 then the local SA has to query SA' to figure
 what FlowLabel to return.
1b) If it is for both directions then somehow SA, SA' and all four
 router ports need to agree on global flowlabels.
2) In the 2nd query, passing SGID=local, DGID=remote is 'reversed'
since SGID=local is the wrong subnet for SA'.
I think defining this to mean something is risky.
2b) A PR query with TClass and FlowLabel present in the query is
 currently expected to return an answer with those fields matching.
 That implies #1b..

TClass is intended to communicate the end-to-end QoS desired.   TClass is 
then mapped to a SL that is local to each subnet.   A flow label is 
intended to much the same as in the IP world and is left, in essence, to 
routers to manage.An endnode look up should be to find the address 
vector to the remote.   A look up may return multiple vectors.   The SLID 
would correspond to each local subnet router port that acts as a first-hop 
destination to the remote subnet.I don't see why the router protocol 
would not simply enable all paths on the local subnet to a given remote 
subnet be acquired.  All of the work is kept local to the SA / SM in the 
source subnet when determining a remote path to take.   Why is there any 
need to define more than just this?  Define a router protocol to 
communicate the each subnet's prefix, TClass, etc. and apply KISS.   A 
management entity that wanted to manage out each subnet provides router 
management in terms of route selection, etc. can be constructed by using 
the existing protocols / tools combined with a new router protocol which 
only does DGID to next hop SLID mapping.

Mike


So, here is how I see this working..

- There is a single well known 'reversible' flowlabel. When a router
   processes a GRH with that flowlabel it produces a packet that
   has a SLID that is always the same, no matter what router port is
   used (A' or B' in my example). The LRH is also reversible according
   to the rules in IBA.

   A well known value side-steps the global information problem and
   allows the GRH to be reversible.
- Whenever a PR has reversible=1 the result returns the well known flowlabel.
   The router LID is always the single shared SLID.
- To get a more optimal path the following sequence of queries are used:
   to SA: SGID=Node1 DGID=Node2
[In the background SA asks SA' what flow label to use]
   to SA': SGID=Node1 DGID=Node2 FlowLabel=(from above)
   to SA': SGID=Node2 DGID=Node1 SLID=(dlid from above)
[In the background SA' asks SA what flow label to use]
   to SA: SGID=Node2 DGID=Node1 FlowLabel=(from above)

   It is almost guarenteed that the FlowLabel will be asymetric. This
   is to keep the flowlabel space local to each subnet.

   In the background quries SA and SA' also examine the global route
   topology to select an optimal no-spoof needed router LID. The
   background exchange is how the disambiguation problem with
   multiple-router path is solved.

Implicit in this are five IBA affecting things:
  - that PRs with SGID=non-local mean something specific
  - PRs with DGID=non-local cause the SA to communicate with the remote
SA to learn the GRH's FlowLabel
(except in the case where reversible=1)
  - clients can communicate with remote SA's
  - Routers do the SLID spoofing you outlined.
  - SA's and routers collaborate quite closely on how the
router produces a LRH. In particular the SA controls the SLID
spoofing

A new query type or maybe some kind of modified multi-path-record
query could be defined by IBA to reduce the 6 exchanges required to
something more efficient.

Does this match what you are thinking?

 SA  SA'
  Node1 -- (LID 1) Router A ---  Router A' (LID A) --- Node2
|- (LID 2) Router A  |
|- (LID 3) Router B ---  Router B' (LID B) --|
  
  Router A and Router B are independent redundant devices, not a route
  cloud of some sort. B - A' is not a possible path.
 
  Since A' and B' connect to the same subnet, B - A' should be a valid path.

Please don't 

Re: [openib-general] Problem is routing CM REQ

2007-02-12 Thread Michael Krause
At 02:47 PM 2/12/2007, Sean Hefty wrote:
  1) What does the TClass and FlowLabel returned from SGID=local
 DGID=remote mean?
 Do you use it in the Node1 - Node2 direction or the Node2 - Node1 
 direction
 or both?

Maybe it would help if we can agree on a set of expectations.  These are 
what I
am thinking:

1. An SA should be able to respond to a valid PR query if at least one of the
GIDs in the path record is local.

2. The LIDs in a PR are relative to the SA's subnet that returned the record.

3. An IB router should not failover transparently to QPs sending traffic 
through
that router.

There is no reason for such a restriction.  APM can work with routers and 
the IB protocol will recover from any out of order packet processing just fine.


4. A PR from the local SA with reversible=1 indicates that data sent from the
remote GID to the local GID using the PR TC and FL will route locally 
using the
specified LID pair.  This holds whether the PR SGID is local or remote.

5. A PR from a remote SA with reversible=1 indicates that data sent from the
local GID to the remote GID using the PR TC and FL will route remotely 
using the
specified LID pair.  This holds whether the PR SGID is local or remote.

6. A PR with reversible=0 is relative to SA's subnet.  The SGID-DGID data 
flow
over the PR TC and FL indicates the SLID-DLID mapping for that subnet.

Do your expectations differ from these?

The use of reversible between subnets is what's concerning me.  It may be 
that
an SA could not return any paths as reversible between two subnets without 
using
some trick like what you mentioned.

These add a requirement on the SA that they must be aware of the routes 
packets
take between two GIDs using a given TC and FL, but I don't believe that this
necessarily forces SA to SA communication.  The SA may only need to exchange
information with a router...?

It should not force SA to SA communication.   Such communication is overly 
complex and will be a major issue to control and manage in the end. 
Further, security concerns, partition management, etc. start to complex 
enough as it is without adding more fuel to the fire.

  Implicit in this are five IBA affecting things:
   - that PRs with SGID=non-local mean something specific

I don't think that we're changing any of the meanings of the fields though.

   - Routers do the SLID spoofing you outlined.

I'm not sure this is something that we do want now.  APM should really handle
path failover.

  There is alot of complex work in the router and SA side to make this
  kind of topology work, but it is critical that the clients use path
  queries that can provide enough data to the SA and return enough data
  to the client to support this.

I'm still deciding if the existing path record attribute is sufficient.

Our original IB router work I believe drove some of what is in the current 
records so I suspect they are fine as is.

Mike 



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] Immediate data question

2007-02-08 Thread Michael Krause
At 03:41 PM 2/7/2007, Roland Dreier wrote:
 Changqing What I mean is that, is there any performance penalty
 Changqing for receiver's overall performance if RNR happens
 Changqing continuously on one of the QP ?

Not for the receiver, but the sender will be severely slowed down by
having to wait for the RNR timeouts.

RNR = Receiver Not Ready so by definition, the data flow isn't going to 
progress until the receiver is ready to receive data.   If a receive QP 
enters RNR for a RC, then it is likely not progressing as desired.   RNR 
was initially put in place to enable a receiver to create back pressure to 
the sender without causing a fatal error condition.  It should rarely be 
entered and therefore should have negligible impact on overall performance 
however when a RNR occurs, no forward progress will occur so performance is 
essentially zero.

Mike 



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] dapl broken for iWARP

2007-02-08 Thread Michael Krause
At 07:43 AM 2/8/2007, Kanevsky, Arkady wrote:
That is correct.
I am working with Krishna on it.
Expect patches soon.

By the way the problem is not DAPL specific
and so is a proposed solution.

There are 3 aspects of the solution.
One is APIs. We suggest that we do not augment these.
That is a connection requestor sets its QP
RDMA ORD and IRD.
When connection is established user can check the QP RDMA ORD and IRD
to see what he has now to use over the connection.
We may consider to extend QP attributes to support transport specific
parameters passing in the future.
For example, iWARP MPA CRC request.

Second is the semantic that CM provides.
The proposal is to match IBCM semantic.
That is CM guarantee that local IRD is = remote ORD.
This guarantees that incoming RDMA Read requests will not overwhelm
the QP RDMA Read capabilities.
Again there is not changes to IBCM only to IWCM.
Notice that as part of this IWCM will pass down to driver and extract
from driver
needed info.

The final part is iWARP CM extension to exchange RDMA ORD, IRD.
This is similar to IBTA Annex for IP Addressing.
The harder part that this will eventually require IETF MPA spec extension,
and the fact that MPA protocol is implemented in RNIC HW by many vendors,
and hence can not be done by IWCM itself.

We looked at this quite a bit during the creation of the 
specification.   All of the targeted usage models exchange this information 
as part of their hello or login exchanges.As such, the hum was to 
not change MPA to communicate such information and leave it to software to 
exchange these values through existing mechanisms.   I seriously doubt 
there will be much support for modifying the MPA specification at this 
stage since the implementations are largely complete and a modification 
would have to deal with the legacy interoperability issue which likely 
would be solved in software any way.  It would be simpler to simply modify 
the underlying DAPL implementation to exchange the information and keep 
this hidden from both the application and the RNIC providers.

Mike


Thanks,

Arkady Kanevsky   email: [EMAIL PROTECTED]
Network Appliance Inc.   phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
Waltham, MA 02451   central phone: 781-768-5300


  -Original Message-
  From: Steve Wise [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, February 07, 2007 6:12 PM
  To: Arlin Davis
  Cc: openib-general
  Subject: Re: [openib-general] dapl broken for iWARP
 
  On Wed, 2007-02-07 at 15:05 -0800, Arlin Davis wrote:
   Steve Wise wrote:
  
   On Wed, 2007-02-07 at 14:02 -0600, Steve Wise wrote:
   
   
   Arlin,
   
   The OFED dapl code is assuming the responder_resources and
   initiator_depth passed up on a connection request event
  are from the
   remote peer.  This doesn't happen for iWARP.  In the
  current iWARP
   specifications, its up to the application to exchange this
   information somehow. So these are defaulting to 0 on the
  server side
   of any dapl connection over iWARP.
   
   This is a fairly recent change, I think.  We need to come up with
   some way to deal with this for OFED 1.2 IMO.
   
   
   Yes, this was changed recently to sync up with the rdma_cm changes
   that exposed the values.
  
   
   
   
   The IWCM could set these to the device max values for instance.
   
   
   That would work fine as long as you know the remote
  settings will be
   equal or better. The provider just sets the min of local device max
   values and the remote values provided with the request.
  
 
  I know Krishna Kumar is working on a solution for exchanging
  this info in private data so the IWCM can do the right
  thing.  Stay tuned for a patch series to review for this.
  But this functionality is definitely post OFED-1.2.
 
 
  So for the OFED-1.2, I will set these to the device max in the IWCM.
  Assuming the other side is OFED 1.2 DAPL, then it will work fine.
 
  Steve.
 
 
 
  ___
  openib-general mailing list
  openib-general@openib.org
  http://openib.org/mailman/listinfo/openib-general
 
  To unsubscribe, please visit
  http://openib.org/mailman/listinfo/openib-general
 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit 
http://openib.org/mailman/listinfo/openib-general



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] Problem is routing CM REQ was: Use a GRH when appropriate for unicast packets

2007-02-08 Thread Michael Krause
At 12:39 PM 2/8/2007, Hal Rosenstock wrote:
On Thu, 2007-02-08 at 14:54, Sean Hefty wrote:
  Hum, you mean to meet the LID validation rules of 9.6.1.5? That is a
  huge PITA..
  
  [IMHO, 9.6.1.5 C9-54 is a mistake, if there is a GRH then the LRH.SLID
   should not be validated against the QP context since it makes it
   extra hard for multipath routing and QoS to work...]

If you examine the prior diagram, the packet validation is quite precise 
and intent on catching any misrouted packets as early in the validation 
process as possible.  This particular compliance statement makes it clear 
as to the type of connection and how to pattern match.  The protocol was 
designed to work witin a single subnet as well as across subnets.  Hence, 
the GRH must be validated in conjunction with the LRH and the QP context in 
order to insure an intermediate component did not misroute the 
packet.As described, a RC QP must flow through at most a single path at 
any given time in order to insure packet ordering is maintained (IB 
requires strong ordering so multi-path within a single RC is not 
allowed).   As for QoS, one can arbitrate a packet for a RC QP relative to 
other flows without any additional complexity.   If one wants to segregate 
a set of RC QP onto different paths as well as arbitration slots that is 
allowed and supported by the architecture even if going between the same 
set of ports - simply use multiple LID and SL during connection 
establishment.

Mike

 
  Yes - this gets messy.
 
  Here is one thought on how to do this:
  To meet this rule each side of the CM must take the SLID from
  the incoming LRH as the DLID for the connection. This SLID will be
  one of the SLIDs for the local router. The other side doesn't need to
  know what it is. The passive side will get the router SLID from the
  REQ and the active side gets it from the ACK.
  
  The passive side is easy, it just path record queries the DGID and
  requests the DLID == the incoming LRH.SLID.
 
  This requires that the passive side be able to issue path record 
 queries, but I
  think that it could work for static routes.  A point was made to me 
 that the
  remote side could be a TCA without query capabilities.

Are you referring to SA query capabilities ? Would such a device just be
expected to work without change in an IB routed environment anyway ?

-- Hal

 
  There's still the issue of what value is carried in the remote port LID 
 in the
  CM REQ (12.7.21), and I haven't even gotten to APM yet...
 
  The nasty problem is with the active side - CMA will select a router
  lid it uses as the DLID and the router may select a different LID for
  it to use as the SLID when it processes the ACK. By C9-54 they have to
  be the same : So the active side might have to do another path record
  query to move its DLID and SL to match the routers choosen
  SLID. Double suck :P
 
  As long as the SA and local routers are in sync, we may be okay here 
 without a
  second path record query.
 
  - Sean
 
  ___
  openib-general mailing list
  openib-general@openib.org
  http://openib.org/mailman/listinfo/openib-general
 
  To unsubscribe, please visit 
 http://openib.org/mailman/listinfo/openib-general
 


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit 
http://openib.org/mailman/listinfo/openib-general



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] BandWidth doubt

2006-11-10 Thread Michael Krause
At 02:02 AM 11/10/2006, john t wrote:
Hi,

I got following readings in one of my experiments:

Single 64-bit xeon machine (2 dual-core 3.2 GHz Intel CPUs, linux FC4, 
OFED 1.0) with two Mellanox DDR (4x) HCAs (each having two ports and each 
connected to a PCI x8 interface) is connected to a switch (all the 4 DDR 
(4x) ports are connected to the switch).

If I send data from mthca0-1 to mthca0-1 meaning from same port to the 
same port i.e. same port doing send/recv (also same cable doing send/recv) 
I get a BW of around 10 Gb/sec.

Similarly, from mthca1-1 to mthca1-1 I get same i.e. around 10 Gb/sec.

So, individual port-to-port gives 10 Gb/sec.

But when I use them together i.e when I send the data from mthca0-1 to 
mthca0-1 AND from mthca1-1 to mthca1-1 at the same time (simultaneously) I 
get a BW of 6.7 Gb/sec on each port. This is less than 10 Gb/sec that is 
expected. Note that mthca0 and mthca1 are connected to two different 
PCI-x8 interfaces, so there is no question of bandwidth splitting. What 
could be causing such a behaviour ??

Just to add if the same thing is done between two different hosts i.e. If 
I send data from mthca0-0 and mthca1-1 of one host to mthca0-0 and 
mthca1-1 of other host, I get expected BW i.e. 10 Gb/sec on each port/link.


You have two links pounding on a shared PCIe Root Complex / memory 
controller.  This sounds like a chipset issue not an IB / software issue 
when it is placed under load.

Mike 



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] ibstatus support for speed

2006-10-31 Thread Michael Krause
At 02:43 PM 10/30/2006, Hal Rosenstock wrote:
On Mon, 2006-10-30 at 17:29, Michael Krause wrote:
  At 02:05 PM 10/30/2006, Roland Dreier wrote:
   Hal So rate = speed * width ?
  
  Yes, you should see the right think on DDR systems etc.
 
  Strange.  Bandwidth = signaling rate * width.   This of course is raw
  bandwidth prior to encoding, protocol, etc. overheads which will derate 
 the
  effective application bandwidth minimally be 20-25%.

Yes of course. It's just a simple diagnostic to display the width and
speed simply.

  If the goal is
  provide a true indication of the maximum peak bandwidth that an 
 application
  might see,

That's not the goal of this simplistic tool.

   then stating 10 Gbps for an IB x4 SDR is clearly a
  misrepresentation and out of alignment with other networking links such as
  Ethernet which customers understand its bandwidth to be minimally after 
 the
  encoding, etc. is removed from the equation.   The perpetual trend by
  marketing to use 10 Gbps IB as equivalent to 10 Gbps of application 
 data is
  actually detrimental not beneficial when it comes to customers.  It
  inevitably leads to the question of why the application is not achieving
  the stated bandwidth, i.e. why it is say 700-800MB/s theoretical peak 
 for a
  x4 while a 10 GbE is 1 GB/s peak.  So much marketing hype has gone forward
  already.   I realize I'm tilting at windmills but if you are to provide a
  tool that is supposed to project the maximum bandwidth possible and given
  the goal of OFA is to provide as much conceptual commonality with existing
  network stacks / links, then it would be beneficial to have this move
  towards a much more apple-to-apple communication of information.  I 
 know it
  would certainly help with having to repeatedly explain why IB 10 Gbps is
  not the same as 10 GbE to customers and analysts.

Agreed but this is a different issue from what the tool is for.

Understood.


IMO this issue largely started when IB decided to use the signalling
rate rather than the data rate like most other networks.

Blame it on marketroids who were more concerned about their naive attempt 
to look better than other technology and not about customers or the people 
who have to continually explain how their drivel is simply 
wrong.  Unfortunately, these same marketroids continual to perpetuate this 
message even now with their apple-to-orange comparisons.  Annoys customers 
who when educated end up with a slightly less favorable opinion of the 
technology.

Mike 



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] ibstatus support for speed

2006-10-30 Thread Michael Krause
At 02:05 PM 10/30/2006, Roland Dreier wrote:
 Hal So rate = speed * width ?

Yes, you should see the right think on DDR systems etc.

Strange.  Bandwidth = signaling rate * width.   This of course is raw 
bandwidth prior to encoding, protocol, etc. overheads which will derate the 
effective application bandwidth minimally be 20-25%.   If the goal is 
provide a true indication of the maximum peak bandwidth that an application 
might see, then stating 10 Gbps for an IB x4 SDR is clearly a 
misrepresentation and out of alignment with other networking links such as 
Ethernet which customers understand its bandwidth to be minimally after the 
encoding, etc. is removed from the equation.   The perpetual trend by 
marketing to use 10 Gbps IB as equivalent to 10 Gbps of application data is 
actually detrimental not beneficial when it comes to customers.  It 
inevitably leads to the question of why the application is not achieving 
the stated bandwidth, i.e. why it is say 700-800MB/s theoretical peak for a 
x4 while a 10 GbE is 1 GB/s peak.  So much marketing hype has gone forward 
already.   I realize I'm tilting at windmills but if you are to provide a 
tool that is supposed to project the maximum bandwidth possible and given 
the goal of OFA is to provide as much conceptual commonality with existing 
network stacks / links, then it would be beneficial to have this move 
towards a much more apple-to-apple communication of information.  I know it 
would certainly help with having to repeatedly explain why IB 10 Gbps is 
not the same as 10 GbE to customers and analysts.

Mike 



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] IPoIB Question

2006-10-24 Thread Michael Krause
At 10:00 PM 10/23/2006, Greg Lindahl wrote:
On Mon, Oct 23, 2006 at 07:53:06AM -0500, Hubbell, Sean C 
Contractor/Decibel wrote:

I currently have several applications that uses a legacy IPv4 protocol
  and I use IPoIB to utilize my infiniband network which works great. I
  have completed some timing and throughput analysis and noticed that I do
  not get very much more if I use an infiniband network interface than
  using my GigE network interface.

You might want to note that different InfinBand implementations have
quite different performance of IPoIB, especially for UDP.

Another issue is that IPoIB has quite different performance with
different Linux kernels. This is especially evident for TCP, although
you can use SDP to accelerate TCP sockets and avoid this issue.

  My question is, am I using IPoIB correctly or are these the typical
  numbers that everyone is seeing?

It is certainly the case that there are some message patterns and
situations for which InfiniBand is not much of an improvement over
gigE.

Unfortunately, the comparison of IB to GbE are often apple-to-orange 
comparisons even for IP over IB.  Until a HCA supplies the same level of 
functional off-load enabled by the IP network stack that is used with 
Ethernet, it really isn't a fair comparison.  The same is also true for 
many of the marketroids and their comparisons of IB to Ethernet based 
solutions.  Fortunately, most customers are getting a bit smarter and not 
falling for the marketing drivel these days - certainly the OEM don't fall 
for it thought the marketroids continue to come in and try to convince 
people it isn't an apple-to-orange comparison.The fact is both 
technologies have their pros / cons and it is really the workload or 
production environment that determines which is the best fit instead of the 
force fit.

In any case, not really a development issue so will drop further discussion.

Mike 



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] IPoIB Question

2006-10-23 Thread Michael Krause
At 10:19 AM 10/23/2006, Michael S. Tsirkin wrote:
Quoting r. Sean Hubbell [EMAIL PROTECTED]:
  I am looking at libsdp for the TCP funcationality and wanted to know if
  libsdp supports UDP as well

AFAIK, SDP can only emulate TCP sockets.

SDP is defined to work with AF_INET applications.  If using a shared 
library approach / pre-load, one can transparently enable any AF_INET 
application to utilize SDP without a recompile, etc.   The SDP Port Mapper 
specification for iWARP / service id for IB enable the connection 
management or whatever service it is implemented within to 
application-transparent discover the real target listen port and establish 
a SDP session nominally during connection establishment.Implementations 
may vary in the robustness or policies used to determine what to off-load, 
number of off-load sessions, etc.  - in other words, a lot of opportunity 
and flexibility is provided to use SDP.

Note: WinSocks Direct on Windows provides an equivalent service though uses 
a proprietary protocol.  Vista will have SDP as defined in the specifications.

There are currently no plans to develop an equivalent for datagram 
applications.   Any datagram application (user or kernel) can already 
access the hardware directly and given RDMA is not defined for datagram, it 
was felt such a specification would provide minimal value.

Mike  



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] APM support in openib stack

2006-10-13 Thread Michael Krause
At 11:24 AM 10/13/2006, Sean Hefty wrote:
 3. in req_handler() we follow the same steps as we have done without APM..
i.e. create qpairs, change qp state to RTR and then  send REP.
 
 however, when trying to change state to RTR usinb ib_modify_qp() I get
 an error (-22).
 
 two info: same code will work if I pass alt_path as NULL or change the
 alt_path as primary path.
 
 I must be missing something here, I assume this basic APM feature works
 in RHEL4 update 4 distribtion
 of openib stack.

I added code to the ib_cm to handle APM, but haven't ever tested it myself.  I
believe others have used it successfully though.

What differences are there between the primary and alternate paths?  I.e. are
just the LIDs different, or are other values also different?

The spec allows a full address vector to be specified not just LID.

Mike 



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] Dropping NETIF_F_SG since no checksum feature.

2006-10-11 Thread Michael Krause
At 02:46 AM 10/11/2006, Michael S. Tsirkin wrote:
Quoting r. David Miller [EMAIL PROTECTED]:
  Subject: Re: Dropping NETIF_F_SG since no checksum feature.
 
  From: Michael S. Tsirkin [EMAIL PROTECTED]
  Date: Wed, 11 Oct 2006 11:05:04 +0200
 
   So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM,
   data will be copied over rather than sent directly.
   So why does dev.c have to force set NETIF_F_SG to off then?
 
  Because it's more efficient to copy into a linear destination
  buffer of an SKB than page sub-chunks when doing checksum+copy.
 

Thanks for the explanation.
Obviously its true as long as you can allocate the skb that big.
I think you won't realistically be able to get 64K in a
linear SKB on a busy system, though, is not that right?

OTOH, having large MTU (e.g. 64K) helps performance a lot since it reduces 
receive side processing overhead.

One thing to keep in mind is while it may help performance in a 
micro-benchmark, the system performance or the QoS impacts to other flows 
can be negatively impacted depending upon implementation.  For example, 
consider multiple messages interleaving (heaven help implementations that 
are not able to interleave multiple messages) on either the transmit or 
receive HCA / RNIC and how the time-to-completion of any message is 
extended out in time as a result of the interleave.  The effective 
throughput in terms of useful units of work can be lower as a result.   The 
same effect can be observed when there are a significant number connections 
in a device being simultaneously processed.

Also, if the copy-checksum is not performed on the processor where the 
application resides, then the performance can also be negatively impacted 
(want to have the right cache hot when initiated or concluded).  While the 
aggregate computational performance of systems may be increasing at a 
significant rate (set aside the per core vs. aggregate core debate), the 
memory performance gains are much less.  If you examine the longer term 
trends, there may be a flattening out of memory performance improvements by 
2009/10 without some major changes in the way controllers and subsystems 
are designed.

Mike 



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] ipoib: ignores dma mapping errors on TX?

2006-10-10 Thread Michael Krause
At 10:24 AM 10/10/2006, Tom Tucker wrote:

Does anyone know what might happen if a device tries to bus master
bad_dma_address. Does it get a pci-abort, an NMI, a bus err interrupt, all
of the above?

It depends upon the platform.   Some will enter a containment mode and, for 
example, shutdown the PCI Bus or the PCIe Root Port.  Others may trigger a 
system error and shutdown the system.  These responses are in part, a 
policy of the implementation and how the system is implemented.  In future 
chipsets that contain IOMMU / Address Translation Protection Tables (ATPT) 
/ pick your favorite name, the error can be contained to a single device 
and the appropriate error recovery triggered without requiring the system 
to go down.   Again, all policy at the end of the day as to what action is 
triggered.  For most, the potential for silent data corruption is too high 
to risk that bus or Root Port from continuing to operate without a reset / 
flush so containment is used at a minimum.

Mike



On 10/9/06 1:01 PM, Roland Dreier [EMAIL PROTECTED] wrote:

  Michael It seems that IPoIB ignores the possibility that
  Michael dma_map_single with DMA_TO_DEVICE direction might return
  Michael dma_mapping_error.
 
  Michael Is there some reason that such mappings can't fail?
 
  No, it's just an oversight.  Most network device drivers don't check
  for DMA mapping errors but it's probably better to do so anyway.  I
  added this to my queue:
 
  commit 8edaf479946022d67350d6c344952fb65064e51b
  Author: Roland Dreier [EMAIL PROTECTED]
  Date:   Mon Oct 9 10:54:20 2006 -0700
 
  IPoIB: Check for DMA mapping error for TX packets
 
  Signed-off-by: Roland Dreier [EMAIL PROTECTED]
 
  diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
  b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
  index f426a69..8bf5e9e 100644
  --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
  +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
  @@ -355,6 +355,11 @@ void ipoib_send(struct net_device *dev,
  tx_req-skb = skb;
  addr = dma_map_single(priv-ca-dma_device, skb-data, skb-len,
   DMA_TO_DEVICE);
  + if (unlikely(dma_mapping_error(addr))) {
  +  ++priv-stats.tx_errors;
  +  dev_kfree_skb_any(skb);
  +  return;
  + }
  pci_unmap_addr_set(tx_req, mapping, addr);
 
  if (unlikely(post_send(priv, priv-tx_head  (ipoib_sendq_size - 1),
 
  ___
  openib-general mailing list
  openib-general@openib.org
  http://openib.org/mailman/listinfo/openib-general
 
  To unsubscribe, please visit 
 http://openib.org/mailman/listinfo/openib-general
 



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit 
http://openib.org/mailman/listinfo/openib-general



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] Multi-port HCA

2006-10-06 Thread Michael Krause



Off-line someone asked me to clarify my earlier e-mail. Given this
discussion continues, perhaps this might help explain the performance a
bit more. The Max Payload Size quoted here is what is typically
implemented on x86 chipsets though other chipsets may use a larger
value. From a pure bandwidth perspective (which is not typical of
many applications), this should be reasonable accurate. In any
case, this is just a fyi.

A x4 IB 5 GT/s is 20 Gbps raw (customers do comprehend the marketing hype
does not translate into that bandwidth being available for applications -
I have had to explain this to the press in the past about how raw does
equal application available bandwidth). Take off 8b/10b,
protocol overheads, etc. and assuming a 2KB PMTU, then one can expect to
hit perhaps 14-15 Gbps per direction depending upon the
workload. Let's assume an aggregate of 30 Gbps of potential
application bandwidth for simplicity. The PCIe x8 2.5 GT/s is
20 Gbps raw so take off the 8b/10b, protocol overheads, control /
application overheads, etc. and given it uses at most a 256B Max Payload
Size on DMA Writes and cache line sized DMA Read Completions (64B) though
many people use PIO Writes to avoid DMA Reads when it comes to
micro-benchmarks, the actual performance is unlikely to hit what IB might
drive depending upon the direction and mix of control and application
data transactions. Add in the impacts on memory controller which in
real-world applications is servicing the processors quite a bit more than
illustrated by micro-benchmarks and the ability of a system to drive an
IB x4 DDR device at link rate is very questionable. 
The question is whether this really matters. If you examine most
workloads on various platforms, they simply cannot generate enough
bandwidth to consume the external I/O bandwidth capacity. In many
cases, they are constrained by the processor or the combination of the
processor / memory components. This isn't a bad thing when you
think about it. For many customers, it means that the attached I/O
fabrics will be sufficiently provisioned to eliminate or largely mitigate
the impacts of external fabric events, e.g. congestion, and deliver a
reasonable solution using the existing hardware (issues of topology, use
of multi-path, etc. all come into bearing as a function of fabric
diameter). In the end, customers care about whether the
application performs as expected and where the real bottlenecks
lie. For most applications, it will come down to the processor /
memory subsystems and not the I/O or external fabric. 
While I haven't seen all of the latest DDR micro-benchmark results, I
believe the x4 IB SDR numbers largely align with what I've outlined
here. 
Mike


At 02:09 AM 10/6/2006, john t wrote:
Hi Shannon,

The bandwidth figures that you quoted below match with my readings for
single port Mellanox DDR HCA (both for unidirection and bidirection). So
it seems dual port SDR HCA performs as good as single port DDR HCA. It
would help if you can also tell the bandwidth that you got using one port
of your dual-port SDR HCA card. Was it half the bandwidth that you stated
below, which means having two SDR ports per HCA helps. 

In my case it seems having two ports (DDR) per HCA does not increase BW,
since PCI-e x8 limit is 16 Gb/sec per direction and each of the two HCA
ports (DDR) though capable of transferring 16 Gb/sec in each direction,
when used together can not go above 16 Gb/sec. 

Regards,
John T.

On 10/5/06, Shannon V. Davidson
[EMAIL PROTECTED]
 wrote: 


John,

In our testing with dual port Mellanox SDR HCAs, we found that not
all PCI-express implementations are equal. Depending on the PCIe
chipset, we measured unidirectional SDR dual-rail bandwidth ranging from
1100-1500 MB/sec and bidirectional SDR dual-rail bandwidth ranging from
1570-2600 MB/sec. YMMV, but had good luck with Intel and Nvidia
chipsets, and less success with the Broadcom Serverworks HT-1000 and
HT-2000 chipsets. My last report (in June 2006) was that Broadcom was
working to improve their PCI-express performance. 

Regards,

Shannon

john t wrote: 

Hi Bernard,



I had a configuration issue. I fixed it and now I get same BW (i.e.
around 10 Gb/sec) on each port provided I use ports on different HCA
cards. If I use two ports of the same HCA card then BW gets divided
between these two ports. I am using Mellanox HCA cards and doing simple
send/recv using uverbs. 



Do you think it could be an issue with Mallanox driver or could it be
due to system/PCI-E limitation.



Regards,

John T.



On 10/3/06, Bernard King-Smith
[EMAIL PROTECTED] 
wrote: 



John, 

Who's adapter (manufacturer) are you using? It is
usually an adapter implementation or driver issue that occures when you
cannot scale across multiple links. The fact that you don't scale up from
one link, but it appears they share a fixed bandwidth across N links
means that there is a driver or stack issue. At one time I think that
IPoIB and maybe other IB drivers used only 

Re: [openib-general] Drop in performance on Mellanox MT25204 single port DDR HCA

2006-10-03 Thread Michael Krause
At 02:43 PM 10/2/2006, Roland Dreier wrote:
 Robert Yes. 1250Mbytes/sec is what we expect.  You say the 128
 Robert value comes from the BIOS ? If so, we need to discuss this
 Robert with our BIOS team to find out why they limit it to 128,
 Robert perhaps it is a BIOS bug.

Yes, I believe that the BIOS is the only place that would set that
value.  We know that resetting the device makes it go back to a
different default value, and nothing in the kernel that I know of is
going to set it down to 128.

128B is the default minimum from PCIe so likely some BIOS engineer took a 
conservative view and chose the defaults (go figure).  Setting Max Read 
Request Size = 4096 is preferred on any implementation as it is basically 
free from a chipset perspective.  The chipset will likely return in cache 
line quantities but there is some obvious optimizations to be achieved by 
issuing a single DMA Read Request.

Mike  



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] [PATCH 0/10] [RFC] Support for SilverStorm Virtual Ethernet I/O controller (VEx)

2006-10-03 Thread Michael Krause

Silverstorm is executing a usage model that the IBTA used to develop the IB 
protocols.   What is the problem with that?  If it works and integrates 
into the stack, then this seems like an appropriate bit of functionality to 
support.   The fact that one can use a standard ULP to communicate to a TCA 
as an alternative which is supported by the existing stack is a customer 
product decision at the end of the day.   If Silverstorm or any IHV can 
show value and that it works in the stack, then it seems appropriate to 
support.  Isn't that a fundamental principle of being an open source effort?


Mike


At 12:31 PM 10/3/2006, Fabian Tillier wrote:
Hi Yaron,

On 10/3/06, Yaron Haviv [EMAIL PROTECTED] wrote:
 
  I'm trying to figure out why this protocol makes sense
  As far as I understand, IPoIB can provide a Virtual NIC functionality
  just as well (maybe even better), with two restrictions:
  1. Lack of support for Jumbo Frames
  2. Doesn't support protocols other than IP (e.g. IPX, ..)

Whether to use a router or virtual NIC approach for connectivity to
Ethernet subnets is a design decision.  We could argue until we are
blue in the face about which architecture is better, but that's
really not relevant.

  I believe we should first see if such a driver is needed and if IPoIB
  UD/RC cannot be leveraged for that, maybe the Ethernet emulation can
  just be an extension to IPoIB RC, hitting 3 birds in one stone (same
  infrastructure, jumbo frames for IPoIB, and Ethernet emulation for all
  nodes not just Gateways)

You're joking right?  Are you really arguing that SilverStorm should
not develop a driver to support its existing devices?  This really
isn't complicated:

1). SilverStorm has a virtual NIC hardware device.
2). SilverStorm is committed to support OpenFabrics.

The above two statements lead to the following conclusion: SilverStorm
needs a driver for its devices that works with the OpenFabrics stack.
This is totally orthogonal to and independent of working on IPoIB RC
or any IETF efforts to define something new.

- Fab

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit 
http://openib.org/mailman/listinfo/openib-general



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] A critique of RDMA PUT/GET in HPC

2006-08-29 Thread Michael Krause


At 08:56 AM 8/25/2006, Greg Lindahl wrote:
On Fri, Aug 25, 2006 at
10:13:01AM -0500, Tom Tucker wrote:
 He does say this, but his analysis does not support this conclusion.
His
 analysis revolves around MPI send/recv, not the MPI 2.0 get/put
 services.
Nobody uses MPI put/get anyway, so leaving out analyzing that
doesn't
change reality much.
Is this due to legacy or other reasons? One reason cited from
Winsocks Direct for using the bcopy vs. the RDMA zcopy operations was the
cost to register memory if done on a per operation basis, i.e. single
use. The bcopy threshold was ~9KB. With the new verbs
developed for iWARP and then added to IB v1.2, the bcopy threshold was
reduced to ~1KB. 
Now, if I recall correctly, many MPI implementations split their buffer
usage between what are often 1KB envelopes and what are large
regions. One can persistently register the envelopes so their size
does not really matter and thus could use send / receive or RDMA
semantics for their update depending upon how the completions are
managed. The larger data movements can be RDMA semantics if desired
as these are typically large in size.

 A valid
conclusion IMO is that MPI send/recv can
 be most efficiently implemented over an unconnected reliable
datagram
 protocol that supports 64bit tag matching at the data sink.
And not
 coincidentally, Myricom has this ;-)
As do all of the non-VIA-family interconnects he mentions. Since
we
all landed on the same conclusion, you might think we're on to
something. Or not.
We've had this argument multiple times and examined all of the known and
relatively volume usage models which includes the suite of MPI benchmarks
used to evaluate and drive implementations. Any interconnect
architecture is one of compromise if it is to be used in a volume
environment - the goal for the architects is to insure the compromises do
not result in a brain-dead or too diminished technology that will not
meet customer requirements. 
With respect to reliable datagram, unless one does software multiplexing
over what amounts to a reliable connection which comes with a performance
penalty as well as complexity in terms of error recover, etc. logic it
really does not buy one anything better than a RC model used today.
Given the application mix and the customer usage model, IB provided four
transport types to meet different application needs and allow people to
make choices. iWARP reduced this to one since the target
applications really were met with RC and reliable datagram as defined in
IB simply was not being picked up or demanded by the targeted ISV.
While some of us had argued for the software multiplex model, others
wanted everything to be implemented in hardware so IB is what it is
today. In any case, it is one of a set of reasonable
compromises and for the most part, I contend it is difficult to argue
that these interconnect technologies are so compromised that they are
brain dead or broken.
However, that's
only part of the argument. Another part is that the
buffer space needed to use RDMA put/get for all data links is huge.
And there are some other interesting points.
The buffer and context differences to track RDMA vs. Send are not
significant in terms of hardware. In terms of software, memory
needs to be registered in some capacity to perform DMA to it and hence,
there is a cost from the OS / application perspective. Our goals
were to be able to use application buffers to provide zero copy data
movements as well as OS bypass. RDMA vs. Send does not
incrementally differ in terms of resource costs in the end.

 I DO agree
that it is interesting reading. :-), it's definitely got
 people fired up.
Heh. Glad you found it interesting.
The article is somewhat interesting but does not really present anything
novel in this on-going debate on how interconnects should be
designed. There will always be someone pointing out a
particular issue here and there and in the end, many of these amount to
mouse nuts when placed into the larger context. When they don't, a
new interconnect is defined or extensions are made to compensate as
nothing is ever permanent or perfect.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] basic IB doubt

2006-08-28 Thread Michael Krause


At 10:14 AM 8/23/2006, Ralph Campbell wrote:
On Wed, 2006-08-23 at 09:47
-0700, Caitlin Bestler wrote:
 [EMAIL PROTECTED] wrote:
  Quoting r. john t [EMAIL PROTECTED]:
  Subject: basic IB doubt
  
  Hi
  
  I have a very basic doubt. Suppose Host A is doing RDMA
write (say 8
  MB) to Host B. When data is copied into Host B's local
  buffer, is it guaranteed that data will be copied starting
  from the first location (first buffer address) to the last
  location (last buffer address)? or it could be in any
order?
  
  Once B gets a completion (e.g. of a subsequent send), data
in
  its buffer matches that of A, byte for byte.
 
 An excellent and concise answer. That is exactly what the
application
 should rely upon, and nothing else. With iWARP this is very
explicit,
 because portions of the message not only MAY be placed out of 
 order, they SHOULD be when packets have been re-ordered by the
 network. But for *any* RDMA adapter there is no guarantee on
 what order the adapter flushes things to host memory or
particularly
 when old contents that may be cached are invalidated or
updated.
 The role of the completion is to limit the frequency with which
 the RDMA adapter MUST guarantee coherency with application
visible
 buffers. The completion not only indicates that the entire
message
 was received, but that it has been entirely delivered to host
memory.
Actually, A knows the data is in B's memory when A gets the
completion
notice. 
This is incorrect for both iWARP and IB. A completion by A only
means that the receiving HCA / RNIC has the data and has generated an
acknowledgement. It does not indicate that B has flushed the data
to host memory. Hence, the fault zone remains the HCA / RNIC and
while A may free the associated buffer for other usage, it should not
rely upon the data being delivered to host memory on B. This is one
of the fault scenarios I raised during the initial RDS transparent
recovery assertions. If A were to issue a RDMA Read to the B
targeting the associated RDMA Write memory location, then it can know the
data has been placed in B's memory.
B can't rely on anything
unless A uses the RDMA write with
immediate which puts a completion event in B's CQ.
Most applications on B ignore this requirement and test for the last
memory location being modified which usually works but doesn't
guarantee that all the data is in memory.
B cannot rely on anything until a completion is seen either through an
immediate or a subsequent Send. It is not wise to rely upon
IHV-specific behaviors when designing an application as even an IHV can
change things over time or due to interoperability requirements, things
may not work as desired which is definitely a customer complaint that
many would like to avoid.
BTW, the reason immediate data is 4 bytes in length is that was what was
defined in VIA. Many within the IBTA wanted to get rid of immediate
data but due to the requirement to support legacy VIA applications, the
immediate value was left in place. The need to support a larger
value was not apparent. One needs to keep in mind where the
immediate resides within the wire protocol and its usage model. The
past usage was to signal a PID or some other unique identifier that could
be used to comprehend which thread of execution should be informed of a
particular completion event. Four bytes is sufficient to
communicate such information without significantly complicating or making
the wire protocol too inefficient.
Mike 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] basic IB doubt

2006-08-28 Thread Michael Krause


At 02:58 PM 8/24/2006, Sean Hefty wrote:
We're trying to create
*inter-operable* hardware and
software in this community. So we follow the IB standard.
Atomic operations and RDD are optional, yet still part of the IB
standard. An
application that makes use of either of these isn't guaranteed to operate
with
all IB hardware. I'm not even sure that CAs are required to
implement RDMA
reads.
A TCA is not required to support RDMA Read. A HCA is
required. 
It is correct that atomic and reliable datagram are optional.
However, that does not mean they can be used or will not work in an
interoperable manner. The movement to a software multiplexing over
a RC (a technique HP delivered to some ISV years ago) may make RD
obsolete from an execution perspective but that does mean it is not
interoperable. As for atomics, well, they are part of IB and many
within MPI would like to see their support. Their usage should also
be interoperable.

 It's up to
the application to verify that the hardware that they're
 using provides the required features, or adjust accordingly,
and
 publish those requirements to the end users.

If that was being done (and it isn't), it would still be bad for
the
ecosystem as a whole.
Applications should drive the requirements. Some poll on memory
today. A lot
of existing hardware provides support for this by guaranteeing that the
last
byte will always be written last. This doesn't mean that data
cannot be placed
out of order, only that the last byte is
deferred.
Seems much of this debate is really about how software chose to implement
polling of a CQ versus polling of memory. Changing IB or iWARP
semantics to compensate for what some might view as a sub-optimal
implementation does not seem logical as others have been able to poll CQ
without such overheads in other environments. In fact, during the
definition of IB and iWARP, it was with this knowledge that we felt the
need to change the semantics was not required. 

Again, if a vendor
wants to work with applications written this way, then this
is a feature that should be provided. If a vendor doesn't care
about working
with those applications, or wants to require that the apps be rewritten,
then
this feature isn't important.
But I do not see an issue with a vendor adding value beyond what's
defined in the spec.
It all comes down to how much of the solution needs to be fully
interoperable and how much needs to be communicated as optional
semantics. You could always define API for applications to
communicate their capabilities that go beyond a specification. This
is in part the logic behind an iSCSI login or SDP Hello exchange where
the capabilities are communicated in a standard way so software does the
right thing based on the components involved. Changing fundamentals
of IB and iWARP seems a bit much when it is much easier to have the ULP
provide such an exchange of capabilities if people feel they are truly
required.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] basic IB doubt

2006-08-25 Thread Michael Krause


At 11:55 AM 8/25/2006, Greg Lindahl wrote:
On Fri, Aug 25, 2006 at
10:00:50AM -0400, Thomas Bachman wrote:
 Not that I have any stance on this issue, but is this is the text in
the
 spec that is being debated? 
 
 (page 269, section 9.5, Transaction Ordering):
 An application shall not depend upon the order of data writes
to
 memory within a message. For example, if an application sets up
 data buffers that overlap, for separate data segments within a
 message, it is not guaranteed that the last sent data will
always
 overwrite the earlier.
No. The case we're talking about is different from the example.
There's text elsewhere which says, basically, that you can't access
the data buffer until seeing the completion.
 I'm assuming that the spec
authors had reason for putting this in there, so
 maybe they could provide guidance here?
We put that text there to accommodate differing memory controller
architectures / coherency protocol capabilities / etc. Basically,
there is no way to guarantee that the memory is in a usable and correct
state until the completion is seen. This was intended to guide
software to not peek at memory but to examine a completion queue entry so
that if memory is updated out of order, silent data corruption would not
occur. 

I can't speak for
the authors, but as an implementor, this has a huge impact on
implementation.
For example, on an architecture where you need to do work such as
flushing the cache before accessing DMAed data, that's done in the
completion. x86 in general is not such an architecture, but they exist.
IB is intended to be portable to any CPU
architecture.
Invalidation protocol is one concern. The other is the a completion
notification also often acts as a flush of the local I/O fabric as
well. In the case of a RDMA Write, the only way to safely determine
complete delivery was to have a RDMA Write / Send with completion
combination or a RDMA Write / RDMA Read depending upon which side
required such completion knowledge.

For iWarp, the
issue is that packets are frequently reordered.
Neither IP or Ethernet re-order packets that often in practice.
Same is true for packet drop rates (the real issue for packet drop is the
impact on performance and recovery times which is why IB was not designed
to work over long or diverse topologies where intermediate elements may
see what might be termed a high packet loss rate).
Mike

-- greg

___
openib-general mailing list
openib-general@openib.org

http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit

http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] basic IB doubt

2006-08-25 Thread Michael Krause


At 12:53 PM 8/25/2006, Talpey, Thomas wrote:
At 03:23 PM 8/25/2006, Greg
Lindahl wrote:
On Fri, Aug 25, 2006 at 03:21:20PM -0400, [EMAIL PROTECTED]
wrote:

 I presume you meant invalidate the cache, not flush it, before

accessing DMA'ed 
 data. 

Yes, this is what I meant. Sorry!
Flush (sync for_device) before posting.
Invalidate (sync for_cpu) before processing.
On some architectures, these operations flush and/or invalidate
i/o pipeline caches as well. As they should.
Many platforms have coherent I/O components so the explicit requirements
on software to participate are often eliminated.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] basic IB doubt

2006-08-25 Thread Michael Krause


At 10:45 AM 8/25/2006, Tom Tucker wrote:
On Fri, 2006-08-25 at 12:51
-0400, Talpey, Thomas wrote: 
 At 12:40 PM 8/25/2006, Sean Hefty wrote:
  Thomas How does an adapter guarantee
that no bridges or other
  Thomas intervening devices reorder
their writes, or for that
  Thomas matter flush them to memory at
all!?
 
 That's a good point. The HCA would have to do a read
to flush the
 posted writes, and I'm sure it's not doing that (since it
would add
 horrible latency for no good reason).
 
 I guess it's not safe to rely on ordering of RDMA writes
after all.
 
 Couldn't the same point then be made that a CQ entry may come
before the data
 has been posted?
 
 When the CQ entry arrives, the context that polls it off the
queue
 must use the dma_sync_*() api to finalize any associated data
 transactions (known by the uper layer).
 
 This is basic, and it's the reason that a completion is so
important.
 The completion, in and of itself, isn't what drives the
synchronization.
 It's the transfer of control to the processor.
This is a giant rat hole. 
On a coherent cache architecture, the CQE write posted to the bus
following the write of the last byte of data will NOT be seen by the
processor prior to the last byte of data. That is, write ordering is
preserved in bridges.
The dma_sync_* API has to do with processor cache, not transaction
ordering. In fact, per this argument at the time you called
dma_sync_*,
the processor may not have seen the reordered transaction yet, so
what
would it be syncing?
Write ordering and read ordering/fence is preserved in intervening
bridges. What you DON'T know is whether or not a write (which was
posted
and may be sitting in a bridge FIFO) has been flushed and/or
propagated
to memory at the time you submit the next write and/or interrupt the
host. 
If you submit a READ following the write, however, per the PCI bus
ordering rules you know that the data is in the target. 
Unless, of course, I'm wrong ... :-)
A PCI read following a write to the same address will result validate
that all prior write transactions are flushed to host memory. This
is one way that people have used (albeit with a performance penalty) to
verify that a transaction it out of the HCA / RNIC fault zone and
therefore an acknowledgement to the source means the data is safe and one
can survive the HCA / RNIC failing without falling into a
non-deterministic state. PCI writes are strongly
ordered on any PCI technology offering. Relaxed ordering
needs to be taken into account w.r.t. writes vs. reads as well as read
completions being weakly ordered as well. 
Mike 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] basic IB doubt

2006-08-25 Thread Michael Krause


At 09:50 AM 8/25/2006, Caitlin Bestler wrote:

[EMAIL PROTECTED] wrote:
 Thomas How does an adapter guarantee that
no bridges or other
 Thomas intervening devices reorder their
writes, or for that
 Thomas matter flush them to memory at
all!?
 
 That's a good point. The HCA would have to do a read to
flush the
 posted writes, and I'm sure it's not doing that (since it would
add
 horrible latency for no good reason).
 
 I guess it's not safe to rely on ordering of RDMA writes after
all.
 
 Couldn't the same point then be made that a CQ entry may come
 before the data has been posted?
 
That's why both specs (IBTA and RDMAC) are very explicit that all
prior messages are complete before the CQE is given to the user.
It is up to the RDMA Device and/or its driver to guarantee this
by whatever means are appropriate. An implementation that allows
a CQE post to pass the data placement that it is reporting on the
PCI bus is in error.
The critical concept of the Work Completion is that it consolidates
guarantees and notificatins. The implementation can do all sorts
of strange things that it thinks optimize *before* the work
completion,
but at the time the work completion is delivered to the user
everything
is supposed to be as expected.

Caitlin's logic is correct and the basis for why these two specifications
call out this issue. And yes, Roland, one cannot rely upon RDMA
Write ordering whether for IB or iWARP. iWARP specifically allows out of
order delivery. IB while providing in-order delivery due to its
strong ordering protocol still has no guarantees when it comes to the
memory controller and I/O technology being used. Given not
everything was expected to operate over PCI, we made sure that the
specifications pointed out these issues so that software would be
designed to accommodate all interconnect attach types and usage
models. We wanted to maximize the underlying implementation options
while providing software with a consistent operating model to enable it
to be simplified as well.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Multicast traffic performace of OFED 1.0 ipoib

2006-08-02 Thread Michael Krause



Is the performance being measured on an identical topology and hardware
set as before? Multicast by its very nature is sensitive to
topology, hardware components used (buffer depth, latency, etc.) and
workload occurring within the fabric. Loss occurs as a function of
congestion or lack of forward progress resulting in a timeout and thus a
toss of a packet. If the hardware is different or the
settings chosen are changed, then the results would be expected to
change. 
It is not clear what you hope to achieve with such tests as there will be
other workloads flowing over the fabric which will create random HOL
blocking which can result in packet loss. Multicast workloads
should be tolerant of such loss. 
Mike

At 04:30 AM 8/2/2006, Moni Levy wrote:
Hi,
 we are doing some performance testing of multicast
traffic over
ipoib. The tests are performed by using iperf on dual 1.6G AMD PCI-X
servers with PCI-X Tavor cards with 3.4.FW. Below are the command
the
may be used to run the test.
Iperf server:
route add -net 224.0.0.0 netmask 240.0.0.0 dev ib0
/home/qa/testing-tools/iperf-2.0.2/iperf -us -B 224.4.4.4 -i 1
Iperf client:
route add -net 224.0.0.0 netmask 240.0.0.0 dev ib0
/home/qa/testing-tools/iperf-2.0.2/iperf -uc 224.4.4.4 -i 1 -b 100M
-t
400 -l 100
We are looking for the max PPT rate (100 byte packets size) without
losses, by changing the BW parameter and looking at the point where
we
get no losses reported. The best results we received were around 50k
PPS. I remember that we got some 120k-140k packets of the same size
running without losses.
We are going to look into it and try to see where is the time spent,
but any ideas are welcome.
Best regards,
Moni
___
openib-general mailing list
openib-general@openib.org

http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit

http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [openfabrics-ewg] OFED 1.1 release - schedule and features

2006-07-13 Thread Michael Krause


At 03:49 PM 7/12/2006, Fabian Tillier wrote:
Hi Mike,
On 7/12/06, Michael Krause [EMAIL PROTECTED] wrote:

At 09:48 AM 7/12/2006, Jeff Broughton wrote:
Modifying the sockets API is
just defining yet another RDMA API, and we have
so many already
I disagree. This effort has distilled the API to basically one for
RDMA
developers. Applications are supported over this via either MPI or
Sockets.
There's been a lot of effort to make the RDMA verbs easy to use.
With
the RDMA CM, socket-like connection semantics can be used to
establish
the connection between QPs. The connection establishment is the
hard
part - doing I/O is trivial in comparisson. This verbs and RDMA
CM
have nothing to do with MPI.
If an application is going to be RDMA aware, I don't see any reason
it
shouldn't just use the verbs directly and use the RDMA CM to
establish
the connections.
What's your point? It seems you are in agreement that there is a
single RDMA API that people can use.


 It
seems rather self limiting to think the traditional BSD synchronous
Sockets API is all the world should be able to use when it comes to
Sockets.
Sockets developers could easily incorporate the extensions into
their
applications providing them with improved designs and flexibility
without
having to learn about RDMA itself.
Wait, you want applications to be able to register memory and issue
RDMA operations, but not have to learn about RDMA? How does that
make
sense?
The Sockets API extensions allow developers to register
memory. That has been a desire by many when it comes to SDP
or copy avoidance technology as it optimizes the performance path by
eliminating the need to do per op registration. For many
applications which already known working sets, they can use this to
enable the OS and underlying infrastructure to take advantage of this
fact to improve performance and quality of the solution. The
extensions provide the async communications and event collection
mechanisms to also improve performance over the rather limiting select /
poll supported by Sockets today.
It currently does not support explicit RDMA but it is rather trivial to
add such calls and remove the need to interject SDP if
desired. The benefits of such new API extensions are
there for those that want to eliminate one more ULP with its unfortunate
IP cloud over head.


If the couple
of calls necessary to
extend this API to support direct RDMA would allow them to eliminate
SDP
entirely, well, that has benefits that go beyond just its all
Sockets;
For a socket implementation to support RDMA, the socket must have an
underlying RDMA QP. This means that if you want the application
to
not have to be verbs-aware, you can't really get rid of SDP - you're
just extending SDP to let the application have a part in memory
registration and RDMA, while still supporting the traditional BSD
operations. This is IMO more complex than just letting
applications
interface directly with verbs, especially since the SDP
implementation
will size the QP for its own use, without a means for negotiating
with
the user so that you don't cause buffer
overruns.
Please take a look at the API extensions. I never stated that
one gets rid of SDP unless one adds the RDMA-explicit calls.

As for complexity, well, the goal is to extend to Sockets developers the
optimal communication paradigm already available on OS such as Windows
without having to leave with the same unfortunate constraints imposed by
the OS. The same logic applies to extending the benefits derived
from MPI which supports async communications as well as put / get
semantics which would be analogous to the additional RDMA interfaces I
referenced. 
I find it strange that people would argue against improving the Sockets
developer's tool suite when the benefits are already proven elsewhere
within the industry and even within this open source effort. Giving
the millions of Sockets developers the choice of a set of extensions that
work over both RDMA and traditional network stacks seems like a no
brainer. Trying to force them to use a native RDMA API even if
semantically similar to Sockets seems like a poor path to pursue.
Leave the RDMA API to the middleware providers and those that need to be
close the metal.  


it also eliminates
the IP cloud that hovers over SDP licensing. Something
that many developers and customers would appreciate.
I believe that Microsoft's IP claims only apply to SDP over IB -- I
don't believe SDP over iWarp is affected. I don't know how the
RDMA
verbs moving towards a hardware independent (wrt IB vs. iWarp)
affects
the IP claims, but it should certainly make things interesting if a
single SDP code base can work over both IB and
iWarp.
SDP is SDP and it isn't just restricted to IB. I'll leave it
to the lawyers to sort it out but having a single SDP with minor code
execution path deltas for the IB-specifics isn't that hard to
construct. It has been done on other OS already.
Mike


___
openib

Re: [openib-general] OFED 1.1 release - schedule and features

2006-07-12 Thread Michael Krause


At 12:59 AM 7/12/2006, Tziporet Koren wrote:
Scott Weitzenkamp (sweitzen)
wrote:
 For SDP, I would like to see improved stability (maybe
you have this 
 in mind under beta quality), also how about AIO
support? The rest 
 of the list looks good.
 
Yes - beta quality means improved stability.
AIO is not planed for 1.1 (schedule issue). If needed we can add it to
1.2
Would be nice if people thought about implementing the Sockets API
Extensions from the OpenGroup. They provide explicit memory
management and async communications which will allow SDP performance to
be fully exploited. The benefits go beyond what is found in
AIO or on other OS such as Windows. If one were to extend slightly
to have explicit RDMA Read and Write from the Sockets API, then it would
be quite possible to eliminate SDP entirely for new applications leaving
SDP strictly for legacy Sockets environments.
Mike

Tziporet
___
openib-general mailing list
openib-general@openib.org

http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit

http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [openfabrics-ewg] OFED 1.1 release - schedule and features

2006-07-12 Thread Michael Krause


At 09:48 AM 7/12/2006, Jeff Broughton wrote:

Mike,

The whole purpose of SDP
is to make sockets go faster without having to have the applications
modified. This is what the customers want. I've heard this
time and time again, across a wide spectrum of
customers.
I am well aware of this. However, Linux / Unix do not support async
communications which severely limits the potential performance benefits
of SDP. When we wrote the SDP specification it was fully understood
that optimal performance is achieved through async
communications. We spent considerable time constructing SDP
to support both synchronous and asynchronous communication paradigms
which there are many applications that would benefit.
Customers want to be able to use RDMA interconnects without recompilation
and through the use of SDP and shared libraries this is certainly
practical to execute. Developers however are not the same as
customers and it is developers who would benefit from the Sockets
extensions and this would in turn benefit customers. 

Modifying the sockets API is
just defining yet another RDMA API, and we have so many already

I disagree. This effort has distilled the API to basically one for
RDMA developers. Applications are supported over this via either
MPI or Sockets. It seems rather self limiting to think
the traditional BSD synchronous Sockets API is all the world should be
able to use when it comes to Sockets. Sockets developers could
easily incorporate the extensions into their applications providing them
with improved designs and flexibility without having to learn about RDMA
itself. If the couple of calls necessary to extend this API to
support direct RDMA would allow them to eliminate SDP entirely, well,
that has benefits that go beyond just its all Sockets; it also eliminates
the IP cloud that hovers over SDP licensing. Something that
many developers and customers would appreciate.
In the end, this effort could choose to progress Sockets technology and
extend the number of developers and applications that can achieve optimal
performance with only minor knowledge growth or they can live with the
limitations of the BSD Sockets API and either accept performance loss or
be forced to jump through the hoops of using other rather niche or
obscure API to accomplish what is possible with a small number of Sockets
extensions which were defined by people with years of experience
implementing Sockets and working with application developers.
Mike

-Jeff




From:
[EMAIL PROTECTED]
[
mailto:[EMAIL PROTECTED]] On Behalf Of Michael
Krause

Sent: Wednesday, July 12, 2006 9:23 AM

To: Tziporet Koren; Scott Weitzenkamp (sweitzen)

Cc: OpenFabricsEWG; openib

Subject: Re: [openfabrics-ewg] [openib-general] OFED 1.1 release
- schedule and features


At 12:59 AM 7/12/2006, Tziporet Koren wrote:

Scott Weitzenkamp (sweitzen) wrote:

 For SDP, I would like to see improved stability
(maybe you have this 

 in mind under beta quality), also how about
AIO support? The rest 

 of the list looks good.

 

Yes - beta quality means improved stability.

AIO is not planed for 1.1 (schedule issue). If needed we can add it
to 1.2

Would be nice if people thought about implementing the Sockets API
Extensions from the OpenGroup. They provide explicit memory
management and async communications which will allow SDP performance to
be fully exploited. The benefits go beyond what is found in
AIO or on other OS such as Windows. If one were to extend slightly
to have explicit RDMA Read and Write from the Sockets API, then it would
be quite possible to eliminate SDP entirely for new applications leaving
SDP strictly for legacy Sockets environments.

Mike


Tziporet

___

openib-general mailing list

openib-general@openib.org



http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit

http://openib.org/mailman/listinfo/openib-general



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver.

2006-06-23 Thread Michael Krause


At 10:14 AM 6/23/2006, Grant Grundler wrote:
On Fri, Jun 23, 2006 at
04:04:31PM +0200, Arjan van de Ven wrote:
  I thought the posted write WILL eventually get to adapter
memory. Not
  stall forever cached in a bridge. I'm wrong?
 
 I'm not sure there is a theoretical upper bound 
I'm not aware of one either since MMIO writes can travel
across many other chips that are not constrained by
PCI ordering rules (I'm thinking of SGI
Altix...)
It is processor / coherency backplane technology specific as to the
number of outstanding writes. There is also no guarantee that such
writes will hit the top of the PCI hierarchy in the order they were
posted in a multi-core / processor system. Hence, it is up to
software to guarantee that ordering is preserved and to not assume
anything about ordering from a hardware perspective. Once a
transaction hits the PCI hierarchy, then the PCI ordering rules apply and
depending upon the transaction type and other rules, what is guaranteed
is deterministic in nature.

 (and if it's
several msec per bridge, then you have a lot of latency
 anyway)
That's what my original concern was when I saw you point this out.
But MMIO reads here would be expensive and many drivers tolerate
this latency in exchange for avoiding the MMIO read in the
performance path.
As the saying goes, MMIO Reads are pure evil and should be
avoided at all costs if performance is the goal. Even in a
relatively flat I/O hierarchy, the additional latency is non-trivial and
can lead to a significant loss in performance for the system.

Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Mellanox HCAs: outstanding RDMAs

2006-06-15 Thread Michael Krause



As one of the authors of IB and iWARP, I can say that both Roland and
Todd's responses are correct and the intent of the specifications.
The number of outstanding RDMA Reads are bounded and that is communicated
during session establishment. The ULP can choose to be aware of
this requirement (certainly when we wrote iSER and DA we were well aware
of the requirement and we documented as such in the ULP specs) and track
from above so that it does not see a stall or it can stay ignorant and
deal with the stall as a result. This is a ULP choice and has been
intentionally done that way so that the hardware can be kept as simple as
possible and as low cost as well while meeting the breadth of ULP needs
that were used to develop these technologies. 
Tom, you raised this issue during iWARP's definition and the debate was
conducted at least several times. The outcome of these debates is
reflected in iWARP and remains aligned with IB. So, unless you
really want to have the IETF and IBTA go and modify their specs, I
believe you'll have to deal with the issue just as other ULP are doing
today and be aware of the constraint and write the software
accordingly. The open source community isn't really the right forum
to change iWARP and IB specifications at the end of the day. Build
a case in the IETF and IBTA and let those bodies determine whether it is
appropriate to modify their specs or not. And yes, it is
modification of the specs and therefore the hardware implementations as
well address any interoperability requirements that would result (the
change proposed could fragment the hardware offerings as there are many
thousands of devices in the market that would not necessarily support
this change).
Mike


At 12:07 PM 6/6/2006, Talpey, Thomas wrote:
Todd, thanks for the set-up. I'm
really glad we're having this discussion!
Let me give an NFS/RDMA example to illustrate why this upper layer,
at least, doesn't want the HCA doing its flow control, or resource
management.
NFS/RDMA is a credit-based protocol which allows many operations in
progress at the server. Let's say the client is currently running
with
an RPC slot table of 100 requests (a typical value).

Of these requests, some workload-specific percentage will be reads,
writes, or metadata. All NFS operations consist of one send from
client to server, some number of RDMA writes (for NFS reads) or
RDMA reads (for NFS writes), then terminated with one send from
server to client.
The number of RDMA read or write operations per NFS op depends
on the amount of data being read or written, and also the memory
registration strategy in use on the client. The highest-performing
such strategy is an all-physical one, which results in one RDMA-able
segment per physical page. NFS r/w requests are, by default, 32KB,
or 8 pages typical. So, typically 8 RDMA requests (read or write)
are
the result.
To illustrate, let's say the client is processing a multi-threaded
workload, with (say) 50% reads, 20% writes, and 30% metadata
such as lookup and getattr. A kernel build, for example. Therefore,
of our 100 active operations, 50 are reads for 32KB each, 20 are
writes of 32KB, and 30 are metadata (non-RDMA). 
To the server, this results in 100 requests, 100 replies, 400 RDMA
writes, and 160 RDMA Reads. Of course, these overlap heavily due
to the widely differing latency of each op and the highly
distributed
arrival times. But, for the example this is a snapshot of current
load.
The latency of the metadata operations is quite low, because lookup
and getattr are acting on what is effectively cached data. The reads
and writes however, are much longer, because they reference the
filesystem. When disk queues are deep, they can take many ms.
Imagine what happens if the client's IRD is 4 and the server ignores
its local ORD. As soon as a write begins execution, the server posts
8 RDMA Reads to fetch the client's write data. The first 4 RDMA
Reads
are sent, the fifth stalls, and stalls the send queue! Even when
three
RDMA Reads complete, the queue remains stalled, it doesn't unblock
until the fourth is done and all the RDMA Reads have been
initiated.
But, what just happened to all the other server send traffic? All
those
metadata replies, and other reads which completed? They're stuck,
waiting for that one write request. In my example, these number 99
NFS
ops, i.e. 654 WRs! All for one NFS write! The client operation
stream
effectively became single threaded. What good is the rapid
initiation
of RDMA Reads you describe in the face of this?
Yes, there are many arcane and resource-intensive ways around it.
But the simplest by far is to count the RDMA Reads outstanding, and
for the *upper layer* to honor ORD, not the HCA. Then, the send
queue
never blocks, and the operation streams never loses parallelism.
This
is what our NFS server does.
As to the depth of IRD, this is a different calculation, it's a
DelayxBandwidth
of the RDMA Read stream. 4 is good for local, low latency
connections.
But over a 

Re: [openib-general] IB MTU tunable for uDAPL and/or Intel MPI?

2006-06-12 Thread Michael Krause


At 10:44 AM 6/9/2006, Scott Weitzenkamp (sweitzen)
wrote:
Content-class:
urn:content-classes:message
Content-Type: multipart/alternative;
boundary=_=_NextPart_001_01C68BEC.6C768F57
Content-Transfer-Encoding: 7bit
While we're talking
about MTUs, is the IB MTU tunable in uDAPL and/or Intel MPI via env var
or config file?

Looks like Intel MPI
2.0.1 uses 2K for IB MTU like MVAPICH does in OFED 1.0 rc4 and rc6, I'd
like to try 1K with Intel MPI.
IB MTU should be set on a per path basis by the SM. An application
should examine the PMTU for a given path and take appropriate action -
really only applies to UD as connected mode should automatically SAR
requests. Communicating PMTU to an application should not occur
unless it is datagram based. The same is true for iWARP where
TCP / IP takes care of the PMTU on behalf of the ULP / application.
If you want to control PMTU, then do so via the SM directly which was the
intention of the architecture and specification.
Mike

Scott



From: [EMAIL PROTECTED]
[
mailto:[EMAIL PROTECTED]] On Behalf Of Scott
Weitzenkamp (sweitzen)
Sent: Thursday, June 08, 2006 4:38 PM
To: Tziporet Koren; [EMAIL PROTECTED]
Cc: openib-general
Subject: RE: [openib-general] OFED-1.0-rc6 is
available
The MTU change undos the
changes for bug 81, so I have reopened bug 81
(
http://openib.org/bugzilla/show_bug.cgi?id=81).

With rc6, PCI-X osu_bw and
osu_bibw performance is bad, and PCI-E osu_bibw performance is bad.
I've enclosed some performance data, look at rc4 vs rc5 vs rc6 for
Cougar/Cheetah/LionMini.

Are there other benchmarks
driving the changes in rc6 (and rc4)?

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems





OSU MPI:
·


Added mpi_alltoall fine tuning
parameters
·


Added default
configuration/documentation file
$MPIHOME/etc/mvapich.conf
·


Added shell configuration files
$MPIHOME/etc/mvapich.csh ,
$MPIHOME/etc/mvapich.csh
·


Default MTU was changed back to 2K for
InfiniHost III Ex and InfiniHost III Lx HCAs. For InfiniHost card
recommended value is:
VIADEV_DEFAULT_MTU=MTU1024

___
openib-general mailing list
openib-general@openib.org

http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit

http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: Mellanox HCAs: outstanding RDMAs

2006-06-09 Thread Michael Krause



Whether iWARP or IB, there is a fixed number of RDMA Requests allowed to
be outstanding at any given time. If one posts more RDMA Read
requests than the fixed number, the transmit queue is stalled. This
is documented in both technology specifications. It is something
that all ULP should be aware of and some go so far as to communicate that
as part of the Hello / login exchange. This allows the ULP
implementation to determine whether it wants to stall or wants to wait
until Read Responses complete before sending another request. This
isn't something silent; this isn't something new; this is something for
the ULP implementation to decide how to deal with the issue.

BTW, this is part of the hardware and associated specifications so it is
up to software to deal with the limited hardware resources and the
associated consequences. Please keep in mind that there are a
limited number of RDMA Request / Atomic resource slots at the
receiving HCA / RNIC. These are kept in hardware thus one must know
the exact limit to avoid creating protocol problems. A ULP
transmitter may post to the transmit queue more than the allotted slots
but the transmitting (source) HCA / RNIC must not issue them to the
remote. These requests do cause the source to stall. This is
a well understood problem and if people give the iSCSI / iSER and DA
specs good read or SDP they can see that this issue is
comprehended. I agree with people that ULP designers / implementers
must pay close attention to this constraint as it is in the iWARP / IB
specifications for a very good reason and these semantics must be
preserved to maintain the ordering requirements that are the used by the
overall RDMA protocols themselves.
Mike

At 05:24 AM 6/6/2006, Talpey, Thomas wrote:
At 03:43 AM 6/6/2006, Michael S.
Tsirkin wrote:
Quoting r. Talpey, Thomas [EMAIL PROTECTED]:
 Semantically, the provider is not required to provide any such
flow control
 behavior by the way. The Mellanox one apparently does, but it is
not
 a requirement of the verbs, it's a requirement on the upper
layer. If more
 RDMA Reads are posted than the remote peer supports, the
connection
 may break.

This does not sound right. Isn't this the meaning of this field:
Initiator Depth: Number of RDMA Reads  atomic
operations
outstanding at any time? Shouldn't any provider enforce this
limit?
The core spec does not require it. An implementation *may* enforce
it,
but is not *required* to do so. And as pointed out in the other
message,
there are repercussions of doing so.
I believe the silent queue stalling is a bit of a time bomb for upper
layers,
whose implementers are quite likely unaware of the danger. I greatly
prefer an implementation which simply sends the RDMA Read request,
resulting in a failed (but unblocked!) connection. Silence is a very
dangerous thing, no matter how helpful the intent.
Tom.

___
openib-general mailing list
openib-general@openib.org

http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit

http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] QoS RFC - Resend using a friendly mailer

2006-05-30 Thread Michael Krause



High-level feedback:
- An IB fabric could be used for a single ULP and still require
QoS. The issue is how to differentiate flows on a given shared
element within the fabric.
- QoS controls must be dynamic. The document references initialization as
the time when decisions are made but obviously that is just a first pass
on use of the fabric and not what it will become in potentially a short
period of time.
- QoS also involves multi-path support (not really touched upon in terms
of specifics in this document). Distributing or segregating work
even if for the same ULP should be done across multiple or distinct
paths. In one sense this may complicate the work but in another it
is simpler in that arbitration controls for shared links become easier to
manage if the number of flows is reduced. 
- IP over IB defines a multicast group which is ultimately a spanning
tree. That should not constrain what paths are used to communicate
between endnode pairs. That only defines the multicast paths which
are not strongly ordered relative to the unicast traffic. Further
IP over IB may operate using the RC mode between endnodes. It is
very simple to replicate RC and then segregate these into QoS domains
(one could just align priority with the 802.1p for simplicity and
practical execution) which can in turn flow over shared or distinct
paths.
- IB is a centrally managed fabric. Adding in SID into records and
such really isn't going to help solve the problem unless there is also a
centralized management entity well above IB that can prioritize
communication service rates for different ULP and endnode pairs.
Given most of these centralized management entities are rather ignorant
of IB at the moment, this presents a chicken-egg dilemma which is further
complicated by developing SOA technology. It might be more valuable
in one sense to examine SOA technology and how it is translating itself
to say Ethernet and then see how this can be leveraged to IB.
- QoS needs to examine the sums of the consumers of a given path and
their service rate requirements. It isn't just about setting a
priority level but also about the packet injection rate to the fabric on
that priority. This needs to be taken into account as
well.
Overall, it is not clear to me what the end value of this document.
The challenge for any network admin is to translate SOA driven
requirements into fabric control knob setting. Without such
translation algorithms / understanding, it is not clear that there is
anything truly missing in the IBTA spec suite or that this RFC will
really advance the integration of IB into the data center in a truly
meaningful manner.
Mike

At 07:53 AM 5/30/2006, Eitan Zahavi wrote:
To: OPENIB
openib-general@openib.org
Subject: QoS RFC - Resend using a friendly mailer
--text follows this line--
Hi All 
Please find the attached RFC describing how QoS policy support could be
implemented in the OpenFabrics stack.
Your comments are welcome.
Eitan

RFC: OpenFabrics Enhancements for QoS Support

===
Authors: . Eitan Zahavi [EMAIL PROTECTED]
Date:  May 2006.
Revision: 0.1
Table of contents:
1. Overview
2. Architecture
3. Supported Policy
4. CMA functionality
5. IPoIB functionality
6. SDP functionality
7. SRP functionality
8. iSER functionality
9. OpenSM functionality
1. Overview

Quality of Service requirements stem from the realization of I/O
consolidation 
over IB network: As multiple applications and ULPs share the same fabric,
means 
to control their use of the network resources are becoming a must. The
basic 
need is to differentiate the service levels provided to different traffic
flows. 
Such that a policy could be enforced and control each flow utilization of
the 
fabric resources.
IBTA specification defined several hardware features and management
interfaces 
to support QoS:
* Up to 15 Virtual Lanes (VL) could carry traffic in a non-blocking
manner
* Arbitration between traffic of different VL is performed by a 2
priority 
 levels weighted round robin arbiter. The arbiter is programmable
with 
 a sequence of (VL, weight) pairs and maximal number of high
priority credits 
 to be processed before low priority is served
* Packets carry class of service marking in the range 0 to 15 in
their
 header SL field
* Each switch can map the incoming packet by its SL to a particular
output
 VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port,
SL)
* The Subnet Administrator controls each communication flow
parameters
 by providing them as a response to Path Record query
The IB QoS features provide the means to implement a DiffServ like
architecture. 
DiffServ architecture (IETF RFC2474 2475) is widely used today in highly
dynamic 
fabrics. 
This proposal provides the detailed functional definition for the various

software elements that are required to enable a DiffServ like
architecture over 
the OpenFabrics software stack.


2. Architecture

This proposal split 

Re: [openib-general] re RDS missing features

2006-05-03 Thread Michael Krause


At 10:42 AM 5/1/2006, Ranjit Pandit wrote:
On 5/1/06, Or Gerlitz
[EMAIL PROTECTED] wrote:
Can you elaborate on each of the
features, specifically the following
points are of interest to us:
+1 so you running Oracle Loopback traffic over RDS sockets? if yes,
what
the issue here?
the openib CMA supports listen/connect on loopback addresses (eg
127.0.0.1 or IPoIB local address)
Yes.
There is no issue. It's just next in line for me to implement.

+2 by failover, are you referring to APM? that is failover between
IB
pathes to/from the same HCA
over which the original connection/QP was established or you are
talking
on failover between HCAs
Failover within and across HCAs. APM does not work for failover across
HCAs.
That is because it is two different types of fail over being
discussed. APM is completely transparent to the IB RC connections
thus there is no disruption or loss of data. Fail over across HCA
is in effect replaying ULP transactions across a new RC connection.
Without an application / ULP level acknowledgement, there is still a hole
in the RDS proposal that has been raised and acknowledged in the past as
existing as recently as the Sonoma get together.
I still have not seen a response to my inquiry about the this ULP and API
changes being at least comprehended beyond the Oracle usage model and
perhaps being reviewed within the IETF given it represents changes in API
and communication semantics. If the goal is to have RDS be a
generic service then it should be reviewed and validated by other
potential consumers as well as those subsystems that may be
impacted.
Mike


+3 is the no
support for /proc like for RDS an issue to run crload or
demo Oracle (that is specific tuning
and usage of non defaults is needed for any/optimal
operation)
No, this does not affect core functionality. You should be able to
run
Oracle or crload without this feature.
That was a list of things that still need to be implemented for GA
and
not just demo

Or.
[openfabrics-ewg] Before we can start testing - we needto ensure
that
RDS is fully ported.
Pandit, Ranjit rpandit at silverstorm.com
Following features are yet to be implemented in OpenFabric Rds:
1. Failover
2. Loopback connections
3. support for /proc fs like Rds config,
stats and info.

Ranjit

___
openib-general mailing list
openib-general@openib.org

http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit

http://openib.org/mailman/listinfo/openib-general
___
openib-general mailing list
openib-general@openib.org

http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit

http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] re RDS missing features

2006-05-01 Thread Michael Krause



Given this is an extension to Sockets, should it not also be reviewed by
the Sockets owners?
What about the API itself? Any plans to make this portable to other
OS / endnodes or have a spec and associated wire protocol that is
reviewed perhaps in the IETF so it is applicable to more than just
Oracle? It seems this really should be standardized within the IETF
to gain broad adoption and insure it will be interoperable across all
implementations not just OpenFabric's.

At 10:42 AM 5/1/2006, Ranjit Pandit wrote:
On 5/1/06, Or Gerlitz
[EMAIL PROTECTED] wrote:
Can you elaborate on each of the
features, specifically the following
points are of interest to us:
+1 so you running Oracle Loopback traffic over RDS sockets? if yes,
what
the issue here?
the openib CMA supports listen/connect on loopback addresses (eg
127.0.0.1 or IPoIB local address)
Yes.
There is no issue. It's just next in line for me to implement.

+2 by failover, are you referring to APM? that is failover between
IB
pathes to/from the same HCA
over which the original connection/QP was established or you are
talking
on failover between HCAs
Failover within and across HCAs. APM does not work for failover across
HCAs.
For OpenFabric, one would need to have this work across RNIC as
well. APM is not part of iWARP so can't be relied upon.


+3 is the no
support for /proc like for RDS an issue to run crload or
demo Oracle (that is specific tuning
and usage of non defaults is needed for any/optimal
operation)
No, this does not affect core functionality. You should be able to
run
Oracle or crload without this feature.
That was a list of things that still need to be implemented for GA
and
not just demo

Or.
[openfabrics-ewg] Before we can start testing - we needto ensure
that
RDS is fully ported.
Pandit, Ranjit rpandit at silverstorm.com
Following features are yet to be implemented in OpenFabric Rds:
1. Failover
2. Loopback connections
3. support for /proc fs like Rds config,
stats and info.

Ranjit

___
openib-general mailing list
openib-general@openib.org

http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit

http://openib.org/mailman/listinfo/openib-general
___
openib-general mailing list
openib-general@openib.org

http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit

http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] mthca FMR correctness (and memory windows)

2006-03-29 Thread Michael Krause


At 05:10 PM 3/20/2006, Fabian Tillier wrote:
On 3/20/06, Talpey, Thomas
[EMAIL PROTECTED] wrote:
 Ok, this is a longer answer.

 At 06:08 PM 3/20/2006, Fabian Tillier wrote:
 As to using FMRs to create virtually contiguous regions, the
last data
 I saw about this related to SRP (not on OpenIB), and resulted in
a
 gain of ~25% in throughput when using FMRs vs the full
frontal DMA
 MR. So there is definitely something to be gained by
creating
 virutally contiguous regions, especially if you're doing a lot
of RDMA
 reads for which there's a fairly low limit to how many can be
in
 flight (4 comes to mind).

 25% throughput over what workload? And I assume, this was with
the
 lazy deregistration method implemented with the current
fmr pool?
 What was your analysis of the reason for the improvement - if it
was
 merely reducing the op count on the wire, I think your issue lies
elsewhere.
This was a large block read workload (since HDDs typically
give
better read performance than write). It was with lazy
deregistration,
and the analysis was that the reduction of the op count on the wire
was the reason. It may well have to do with how the target chose
to
respond, though, and I have no idea how that side of things was
implemented. It could well be that performance could be
improved
without going with FMRs.
Quite often performance is governed by the target more than the initiator
as it is in turn governed by its local cache and disc mech performance /
capacity. Large data movements typically are a low op count from
the initiator perspective therefore it seems a bit odd to state that
performance can be dramatically impacted by the op count on the
wire.

 Also, see
previous paragraph - if your SRP is fast but not safe, then only
 fast but not safe applications will want to use it. Fibre channel
adapters
 do not introduce this vulnerability, but they go fast. I can show
you NFS
 running this fast too, by the way.
Why can't Fibre Channel adapters, or any locally attached hardware
for
that matter, DMA anywhere in memory? Unless the chipset
somehow
protect against it, doesn't locally attached hardware have free
reign
over DMA?
As a general practice, future volume I/O chipsets across multiple market
segments will implement an IOMMU to restrict where DMA is allowed.
Both AMD and Intel have recently announced specifications to this effect
which reflect what has been implemented in many non-x86 chipset
offerings. Whether a given OS always requires this protection to be
enabled is implementation-specific but it is something that many within
the industry and customer base require.
Mike

Also, please don't
take my anectdotal benchmark results as an
endorsement of the Mellanox FMR design - the data was presented to
me
by Mellanox as a reason to add FMR support to the Windows stack
(which
currently uses the full frontal approach due to limitations
of the
verbs API and how it needs to be used for storage). I never had
a
chance to look into why the gains where so large, and it could be
either the SRP target implementation, a hardware limitation, or a
number of other issues, especially since a read workload results in
RDMA Writes from the target to the host which can be pipelined much
deeper than RDMA Reads.
- Fab
___
openib-general mailing list
openib-general@openib.org

http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit

http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: RFC: e2e credits

2006-03-24 Thread Michael Krause


At 03:31 AM 3/24/2006, Hal Rosenstock wrote:
On Thu, 2006-03-23 at 12:34,
Michael S. Tsirkin wrote:
 Quoting r. Hal Rosenstock [EMAIL PROTECTED]:
 Sean, just to wrap it
up, the API at the verbs layer will look
 like the below, and then
ULPs just put the value they want in
 the CM and CM will pass
it in to low level.
 
 So this is our question, right?
 
 
 CM REQ and REP messages include the following field:
 
 ---
 12.7.26 END-TO-END FLOW CONTROL
 Signifies whether the local CA actually implements End-to-End Flow
Control
 (1), or instead always advertises .invalid credits.(0). See
section
 9.7.7.2 End-to-End (Message Level) Flow Control for more
detail.
 ---
 
 Consider and implementation that advertises valid credits for
 connections, and always advertises invalid credits for other
connections.
 This is compliant since the IB spec says (end-to-end (message level)
flow
 control, Requester Behaviour):
 Even a responder which does generate end-to-end credits may
choose to send the
 'invalid' code in the AETH
I did some spec reading to find this and found the following which I
think makes the current requirement clear:
p.347 line 37 states HCA receive queues must generate
end-to-end
credits (except for QPs associated with a SRQ), but TCA receive
queues
are not required to do so. This appears to be informative text. I
first
found the following:
p. 348 has the compliances for this for both HCAs and TCAs:
C9-150.2.1: For QPs that are not associated with an SRQ, each HCA
re-
ceive queue shall generate end-to-end flow control credits. If a QP
is
associated with an SRQ, the HCA receive queue shall not generate
end-to-
end flow control credits.
o9-95.2.1: Each TCA receive queue may generate end-to-end credits
ex-
cept for QPs that are associated with an SRQ. If a TCA supports SRQ,
the
TCA must not generate End-to-End Flow Control Credits for QPs
associ-
ated with an SRQ.
C9-151: If a TCA's given receive queue generates End-to-End credits,
then the corresponding send queue shall receive and respond to those
credits. This is a requirement on each send queue of a CA.
The above informative text also references the CA requirements in
chapter 17 and on p. 1026 line 25 there is a row in the table for end
to
end flow control for RC consistent with the above. p.1028 has the
compliances for this.
 Is it compliant for CM implementations to set/clear the End-to-End
Flow Control
 field accordingly, taking it to mean
 
 whether the local CA actually implements End-to-End Flow
Control
 (1), or instead always advertises 'invalid credits'(0)
 *for the specific connection*
So IMO the intent of what was written is clear (on a per CA basis)
and
this is a spec change which is OK to propose but needs a different
writeup.
The spec was written to minimize the impact to TCA in terms of e2e
credits. HCA are expected to use e2e credits all the time sans SRQ
which is a special multiplexing case where e2e isn't that beneficial /
logical to support. Each connection must negotiate this per CA pair
as no single policy was deemed practical across all usage
models. I don't think there would be much support for
changing the spec.
BTW, iWARP does not support e2e credits as it relies upon the ULP to
advertise and track its buffer usage. It was therefore deemed
unnecessary.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] mthca FMR correctness (and memory windows)

2006-03-23 Thread Michael Krause


At 04:30 PM 3/20/2006, Talpey, Thomas wrote:
At 06:00 PM 3/20/2006, Sean
Hefty wrote:
Can you provide more details on this statement? When are you
fencing the send 
queue when using memory windows?
Infiniband 101, and VI before it. Memory windows fence later
operations
on the send queue until the bind completes. It's a misguided attempt
to
make upper layers' job easier because they can post a bind
and then
immediately post a send carrying the rkey. In reality, it introduces
bubbles
in the send pipeline and reduces op rates
dramatically.
The requirement / semantics were derived from the ULP being used to
construct the technology. The combination of a bind-n-send
operation was to reduce the software interactions with the device by
consolidating this into a combo operation. I do not follow your
logic that this creates a bubble in the send pipeline as there were also
ordering and correctness issues w.r.t. subsequent operations to the
send. The bind-n-send is a single operation and its fence semantics
were required to allow the bind to complete before informing the remote
of the subsequent information in order to avoid race conditions.

I argued against
them in iWARP verbs, and lost. If Linux could introduce
a way to make the fencing behavior optional, I would lead the
parade.
I fear most hardware is implemented otherwise.
Hardware generally implements operations in the order they are posted to
a given QP, i.e. it is a serial execution flow that allows pipelined
operations to be posted and executed by the hardware. Scaling is
achieved by executing across a set of QP and thus a set of
resources. The ordering domain requirements are kept simple to
allow low-cost hardware implementations. This does not preclude
software from executing across a set of QP in any order that it
desires. 
Yes, I know about
binding on a separate queue. That doesn't work, because windows are
semantically not fungible (for security
reasons).
You could always simply allow a region to be accessible across multiple
operations but then again storage argued that it must only be accessible
for a single op thus things like FMR, bind-n-send, etc. were all
created. To say that storage was not listened to or their needs
were not met or balanced against what is practical to implement in either
the creation of IB or iWARP is simply incorrect. 
Mike

Tom.
___
openib-general mailing list
openib-general@openib.org

http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit

http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] IPoIB and lid change

2006-02-10 Thread Michael Krause


At 09:43 AM 2/10/2006, Grant Grundler wrote:
On Fri, Feb 10, 2006 at
11:05:34AM -0500, Hal Rosenstock wrote:
  Hi, Roland!
  One issue we have with IPoIB is that IPoIB may cache a remote
node path
  for a long time. Remote LID may get changed e.g. if the SM is
changed,
  and IPoIB might lose connectivity.
I wonder if this is why when I reload the IB drivers on one node
I sometimes have to reload them on other nodes too. Otherwise
ping over IPoIB doesn't work.
If endnodes are not periodically refreshing their caches or are not
subscribing to event management to be informed a refresh is in order,
then endnodes will fall out of sync and would need to be restarted to
establish communication. This is a classic problem that was
illustrated in various early router protocols and is why today's
protocols rely implement a two-prong approach in many cases - limited
cache lifetime and proactive cache event updates.

 The remote LID
may get changed for other reasons too without an SM
 change (SM merge of 2 separate subnets). How can this be handled
?
Isn't this just another case of the SM changing for one of the
subnets?
A SM merge that involves updating LIDs is a non-trivial event. It
requires connections to be effectively restarted as one cannot ascertain
whether all packets are flushed from the fabric otherwise - that can
cause silent data corruption. For a subsystem such as IPoverIB, a
LID update should result in an unsolicited ARP / ND exchange which will
cause all remote endnodes to receive the new information.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal

2006-02-09 Thread Michael Krause


At 03:36 PM 2/8/2006, Arlin Davis wrote:
Roland Dreier wrote:
 Michael So,
here we have a long discussion on attempting to
 Michael perpetuate a concept that is not universal
across
 Michael transports and was deemed to have minimal value
that most
 Michael wanted to see removed from the
architecture.
But this discussion is being driven by an application developer who
does see value in immediate data.
Arlin, can you quantify the benefit you see from RDMA write with
immediate vs. RDMA write followed by a send?

We need speed and simplicity.
A very latency sensitive application that requires immediate notification
of RDMA write completion on the remote node without ANY latency penalties
associated with combining operations, HCA priority rules across QPs, wire
congestion, etc. An application that has no requirement for messaging
outside of remote rdma write completion notifications. The application
would not have to register and manage additional message buffers on
either side, we can just size the queues accordingly and post zero byte
messages. We need something that would be equivelent to setting there
polling on the last byte of inbound data. But, since data ordering within
an operation is not guaranteed that is not an option. So, rdma with
immediate data is the most optimal and simplistic method for indication
of RDMA-write completion that we have available today. In fact, I would
like to see it increased in size to make it even more
useful.
RDMA Write with Immediate is part of the IB Extended Transport
Header. It is a fixed-sized quantity and not one subject to change,
i.e. increasing its size.
Your argument above reinforces that the particular application need is
IB-specific and thus should not be part of a general API but a
transport-specific API. If the application will only operate
optimally using immediate data, then it is only suitable for an IB
fabric. This reinforces the need for a transport-specific
API.
Those applications that simply want to enable completion notification
when a RDMA Write has occurred can use a general purpose API that is
interconnect independent and whose code is predicated upon a RDMA Write -
Send set of operations. This will enable application portability
across all interconnect types.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA

2006-02-08 Thread Michael Krause


At 11:04 AM 2/8/2006, Michael S. Tsirkin wrote:
Quoting r. Steve Wise
[EMAIL PROTECTED]:
 Subject: Re: [openib-general] Re: [PATCH] [RFC] - example user
moderdmaping/pongprogram using CMA
 
 On Wed, 2006-02-08 at 19:10 +0200, Michael S. Tsirkin wrote:
  Quoting r. Sean Hefty [EMAIL PROTECTED]:
   Subject: RE: [openib-general] Re: [PATCH] [RFC] - example
user mode rdmaping/pongprogram using CMA
   
   Steve, looks like you have at most a single receive
work request posted at the
   receive workqueue at all times.
   If true, this is *really* not a good idea,
performance-wise, even if you
   actually have at most 1 packet in flight.
   
   Can you provide some more details on this?
  
  See 9.7.7.2 end-to-end (message level) flow control
  
 
 I just read this section in the 1.2 version of the spec, and I
still
 don't understand what the issue really is? 9.7.7.2 talks about
IBA
 doing flow control based on the RECV WQEs posted. rping always
ensures
 that there is a RECV posted before the peer can send. This is
ensured
 by the rping protocol itself (see the comment at the front of
rping.c
 describing the ping loop).
 
 I'm only ever sending one outstanding message via SEND/RECV. I
would
 rather post exactly what is needed, than post some number of RECVs
just
 to be safe. Sorry if I'm being dense. What am I
missing here?
 
 Steve.
 
As far as I know, the credits are only updated by the ACK messages.
If there is a single work request outstanding on the RQ,
the ACK of the SEND message will have the credit field value 0
(since exactly one receive WR was outstanding, and that is now
consumed).
As a result the remote side withh think that there are
no
receive WQEs and will slow down (what spec refers to as limited
WQE).
Correct. The ACK / NAK protocol used by IB is used to return
credits. In order to pipeline to improve performance, then you must
post multiple receive work requests in order to account for the expected
round trip time of the fabric and the associated CA processing.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA

2006-02-08 Thread Michael Krause


At 11:35 AM 2/8/2006, Steve Wise wrote:
  
  I just read this section in the 1.2 version of the spec, and I
still
  don't understand what the issue really is? 9.7.7.2 talks
about IBA
  doing flow control based on the RECV WQEs posted. rping always
ensures
  that there is a RECV posted before the peer can send.
This is ensured
  by the rping protocol itself (see the comment at the front of
rping.c
  describing the ping loop).
  
  I'm only ever sending one outstanding message via
SEND/RECV. I would
  rather post exactly what is needed, than post some number of
RECVs just
  to be safe. Sorry if I'm being dense. What am
I missing here?
  
  Steve.
  
 
 As far as I know, the credits are only updated by the ACK
messages.
 If there is a single work request outstanding on the RQ,
 the ACK of the SEND message will have the credit field value 0
 (since exactly one receive WR was outstanding, and that is now
consumed).
 
 As a result the remote side withh think that there are
no
 receive WQEs and will slow down (what spec refers to as limited
WQE).
Oh. I understand now. This is an issue with only 1 RQ WQE
posted and
how IB tries to inform the peer transport of the WQE count. For
iWARP,
none of this transport-level flow control happens (and I'm more
familiar
with iWARP than IB).
For iWARP, we decided to not implement application receiver based flow
control due to two items:TCP provides transport-level flow control (IB
does not provide the equivalent per se) and upon examination of the
majority of the ULP, they exchange and track the number of receive
buffers allowed to be processed thus there is no need to replicate this
in iWARP. There are some subtleties as well between a message-based
transport and a byte stream such as TCP that go into the equation but
these are not that important for most application writers to deal
with.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal

2006-02-08 Thread Michael Krause


At 09:16 PM 2/6/2006, Sean Hefty wrote:
The requirement is to
provide an API that supports RDMA writes with immediate
data. A send that follows an RDMA write is not immediate data,
and the API
should not be constructed around trying to make it so.
To be clear, I believe that write with immediate should be part of the
normal
APIs, rather than an extension, but should be designed around those
devices that
provide it natively.

One thing to keep in mind is that the IBTA workgroup responsible for the
transport wanted to eliminate immediate data support entirely but it was
retained solely to enable VIA application migration (even though the
application base was quite small). If that requirement could have
been eliminated, then it would have been gone in a heart beat.
Given a RDMA-WRITE followed by a SEND provides the same application
semantics based on the use models, iWARP chose not to support immediate
data. 
So, here we have a long discussion on attempting to perpetuate a concept
that is not universal across transports and was deemed to have minimal
value that most wanted to see removed from the architecture. One
has to question the value of trying to develop any API / software to
support immediate data instead of just enabling the preferred method
which is RDMA WRITE - SEND. I agree with those who have contended
that this is difficult to do in a general purpose fashion. When all
of this is taken into account, it seems the only good engineering answer
is to eliminate immediate data support by the software and focused on the
method that works across all interconnects.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB

2005-11-15 Thread Michael Krause


At 12:49 PM 11/14/2005, Nitin Hande wrote:
Michael Krause wrote:
At 01:01 PM 11/11/2005, Nitin
Hande wrote:
Michael Krause wrote:
At 10:28 AM 11/9/2005, Rick
Frank wrote:
Yes, the application is
responsible for detecting lost msgs at the application level - the
transport can not do this.

RDS does not guarantee that a message has been delivered to the
application - just that once the transport has accepted a msg it will
deliver the msg to the remote node in order without duplication - dealing
with retransmissions, etc due to sporadic / intermittent msg loss over
the interconnect. If after accepting the send - the current path fails -
then RDS will transparently fail over to another path - and if required
will resend / send any already queued msgs to the remote node - again
insuring that no msg is duplicated and they are in order. This is
no different than APM - with the exception that RDS can do this across
HCAs.

The application - Oracle in this case - will deal with detecting a
catastrophic path failure - either due to a send that does not arrive and
or a timedout response or send failure returned from the transport. If
there is no network path to a remote node - it is required that we remove
the remote node from the operating cluster to avoid what is commonly
termed as a split brain condition - otherwise known as a
partition in time.

BTW - in our case - the application failure domain logic is the same
whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc.
Basically, if we can not talk to a remote node - after some defined
period of time - we will remove the remote node from the cluster. In this
case the database will recover all the interesting state that may have
been maintained on the removed node - allowing the remaining nodes to
continue. If later on, communication to the remote node is restored - it
will be allowed to rejoin the cluster and take on application load.

Please clarify the following which was in the document provided by
Oracle.
On page 3 of the RDS document, under the section RDP
Interface, the 2nd and 3rd paragraphs are state:
 * RDP does not guarantee that a datagram is delivered to the
remote application.
 * It is up to the RDP client to deal with datagrams lost due
to transport failure or remote application failure.
The HCA is still a fault domain with RDS - it does not address flushing
data out of the HCA fault domain, nor does it sound like it ensures that
CQE loss is recoverable.
I do believe RDS will replay all of the sendmsg's that it believes are
pending, but it has no way to determine if already sent sendmsgs were
actually successfully delivered to the remote application unless it
provides some level of resync of the outstanding sends not completed from
an application's perspective as well as any state updated via RDMA
operations which may occur without an explicit send operation to flush to
a known state. 
If RDS could define a mechanism that the application could use to inform
the sender to resync and replay on catastrophic failure, is that a
correct understanding of your suggestion ?
I'm not suggesting anything at this point. I'm trying to reconcile the
documentation with the e-mail statements made by its proponents.
I'm still trying to ascertain
whether RDS completely
recovers from HCA failure
(assuming there is another HCA / path available) between the two
endnodes
Reading at the doc and the thread, it looks like we need src/dst port for
multiplexing connections, we need seq/ack# for resyncing, we need some
kind of window availability for flow control. Are'nt we very close to tcp
header ? ..
TCP does not provide end-to-end to the application as implemented by most
OS. Unless one ties TCP ACK to the application's consumption of the
receive data, there is no method to ascertain that the application really
received the data. The application would be required to send
its own application-level acknowledgement. I believe the
intent is for applications to remain responsible for the end-to-end
receipt of data and that RDS and the interconnect are simply responsible
for the exchange at the lower levels.Yes, a TCP ack only
implies that it has received the data, and means nothing to the
application. It is the application which has send a application level ack
to its peer.
TCP ACK was intended to be an end-to-end ACK but implementations took it
to a lower level ACK only. A TCP stack linked into an application
as demonstrated by multiple IHV and research does provide an end-to-end
ACK and considerable performance improvements over the traditional
network stack implementations. Some claim it is more than good
enough to eliminate the need for protocol off-load / RDMA which is true
for many applications (certainly for most Sockets, etc.) but not
true when one takes advantage of the RDMA comms paradigm which has
benefit for a number of applications.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org

Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB

2005-11-11 Thread Michael Krause


At 10:28 AM 11/9/2005, Rick Frank wrote:

Yes, the application is responsible for detecting lost msgs at the
application level - the transport can not do this.

RDS does not guarantee that a message
has been delivered to the application - just that once the transport has
accepted a msg it will deliver the msg to the remote node in order
without duplication - dealing with retransmissions, etc due to sporadic /
intermittent msg loss over the interconnect. If after accepting the send
- the current path fails - then RDS will transparently fail over to
another path - and if required will resend / send any already queued msgs
to the remote node - again insuring that no msg is duplicated and they
are in order. This is no different than APM - with the exception
that RDS can do this across HCAs. 

The application - Oracle in this case -
will deal with detecting a catastrophic path failure - either due to a
send that does not arrive and or a timedout response or send failure
returned from the transport. If there is no network path to a remote node
- it is required that we remove the remote node from the operating
cluster to avoid what is commonly termed as a split brain
condition - otherwise known as a partition in time.

BTW - in our case - the application
failure domain logic is the same whether we are using UDP / uDAPL /
iTAPI / TCP / SCTP / etc. Basically, if we can not talk to a remote node
- after some defined period of time - we will remove the remote node from
the cluster. In this case the database will recover all the interesting
state that may have been maintained on the removed node - allowing the
remaining nodes to continue. If later on, communication to the remote
node is restored - it will be allowed to rejoin the cluster and take on
application load. 
Please clarify the following which was in the document provided by
Oracle. 
On page 3 of the RDS document, under the section RDP
Interface, the 2nd and 3rd paragraphs are state: 
 * RDP does not guarantee that a datagram is delivered to the
remote application.
 * It is up to the RDP client to deal with datagrams lost due
to transport failure or remote application failure.
The HCA is still a fault domain with RDS - it does not address flushing
data out of the HCA fault domain, nor does it sound like it ensures that
CQE loss is recoverable.
I do believe RDS will replay all of the sendmsg's that it believes are
pending, but it has no way to determine if already sent sendmsgs were
actually successfully delivered to the remote application unless it
provides some level of resync of the outstanding sends not completed from
an application's perspective as well as any state updated via RDMA
operations which may occur without an explicit send operation to flush to
a known state. I'm still trying to ascertain whether RDS completely
recovers from HCA failure (assuming there is another HCA / path
available) between the two endnodes.
Mike


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB

2005-11-10 Thread Michael Krause


At 02:09 PM 11/9/2005, Greg Lindahl wrote:
On Wed, Nov 09, 2005 at
01:57:06PM -0800, Michael Krause wrote:
 What you indicate above is that RDS 
 will implement a resync of the two sides of the association to
determine 
 what has been successfully sent.
More accurate to say that it could implement that. I'm
just
kibbutzing on someone else's proposal.
 This then implies that the reliability of the underlying
 interconnect isn't as critical per se as the end-to-end RDS
protocol
 will assure that data is delivered to the RDS components in the
face
 of hardware failures. Correct?
Yes. That's the intent that I see in the proposal. The
implementation
required to actually support this may not be what the proposers had
in
mind.
If it is to be reasonably robust, then RDS should be required to support
the resync between the two sides of the communication. This aligns
with the stated objective of implementing reliability in one location in
software and one location in hardware. Without such resync being
required in the ULP, then one ends up with a ULP that falls shorts of its
stated objectives and pushes complexity back up to the application which
is where the advocates have stated it is too complex or expensive to get
it correct.

This sort of
message service, by the way, has a long history in distributed
computing.
Yep. 
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB

2005-11-10 Thread Michael Krause


At 10:48 AM 11/10/2005, Caitlin Bestler wrote:


Mike Krause wrote in response to Greg Lindahl:


If it is to
be reasonably robust, then RDS should be required to
support
 the resync between the two sides of the communication. This
aligns
with the
 stated objective of implementing reliability in one location in
software and
 one location in hardware. Without such resync being required
in the
ULP,
 then one ends up with a ULP that falls shorts of its stated
objectives
and
 pushes complexity back up to the application which is where the
advocates
 have stated it is too complex or expensive to get it correct.

I haven't reread all of RDS fine print to double-check this, but my
impression is that RDS semantics exactly match the subset of MPI
point-to-point communications where the receiving rank is required
to have pre-posted buffers before the send is allowed.

My concern is the requirement that RDS resync the structures in the face
of failure and know whether to re-transmit or will deal with
duplicates. Having pre-posted buffers will help enable the resync
to be accomplished but should not be equated to pre-post equals one can
deal with duplicates or will verify to prevent duplicates from
occurring.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB

2005-11-09 Thread Michael Krause


At 12:37 PM 11/8/2005, Hal Rosenstock wrote:
On Tue, 2005-11-08 at 15:33,
Ranjit Pandit wrote:
 Using APM is not useful because it doesn't provide failover across
HCA's.
Can't APM be made to work across HCAs ?
No. It requires state that is only within the HCA and there are
other aspects that prevent this, e.g. no single unified QP space across
all HCA, etc.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB

2005-11-09 Thread Michael Krause


At 12:33 PM 11/8/2005, Ranjit Pandit wrote:
 Mike wrote:
 - RDS does not solve a set of failure models. For
example, if a RNIC / HCA
 were to fail, then one cannot simply replay the operations on
another RNIC /
 HCA without extracting state, etc. and providing some end-to-end
sync of
 what was really sent / received by the application. Yes, one
can recover
 from cable or switch port failure by using APM style recovery but
that is
 only one class of faults. The harder faults either result in
the end node
 being cast out of the cluster or see silent data corruption
unless
 additional steps are taken to transparently recover - again app
writers
 don't want to solve the hard problems; they want that done for
them.
The current reference implementation of RDS solves the HCA failure case
as well.
Since applications don't need to keep connection states, it's easier
to handle cases like HCA and intermediate path failures.
As far as application is concerned, every sendmsg 'could' result in
a
new connection setup in the driver.
If the current path fails, RDS reestablishes a connection, if
available, on a different port or a different HCA , and replays the
failed messages.
Using APM is not useful because it doesn't provide failover across
HCA's.
I think others may disagree about whether RDS solves the problem.
You have no way of knowing whether something was received or not into the
other node's coherency domain without some intermediary or application's
involvement to see the data arrived. As such, you might see many
hardware level acks occur and not know there is a real failure. If
an application takes any action assuming that send complete means it is
delivered, then it is subject to silent data corruption. Hence, RDS
can replay to its heart content but until there is an application or
middleware level of acknowledgement, you have not solve the fault domain
issues. Some may be happy with this as they just cast out the
endnode from the cluster / database but others see the loss of a server
as a big deal so may not be happy to see this occur. It really
comes down to whether you believe loosing a server is worth while just
for a local failure event which is not fatal to the rest of the
server.
APM's value is the ability to recover from link failure. It has the
same value for any other ULP in that it recovers transparently to the
ULP.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB

2005-11-09 Thread Michael Krause


At 11:42 AM 11/9/2005, Greg Lindahl wrote:
On Tue, Nov 08, 2005 at
01:08:13PM -0800, Michael Krause wrote:
 If an application takes any action assuming that send complete
means
 it is delivered, then it is subject to silent data
corruption.
Right. That's the same as pretty much all other *transport* layers.
I
don't think anyone's asserting RDS is any different: you can't
assume
the other side's application received and acted on your message
until
the other side's application tells you that it did.
So, things like HCA failure are not transparent and one cannot simply
replay the operations since you don't know what was really seen by the
other side unless the application performs the resync itself.
Hence, while RDS can attempt to retransmit, the application must deal
with duplicates, etc. or note the error, resync, and retransmit to avoid
duplicates. 
BTW, host-based transport implementations can transparently recover from
device failure on behalf of applications since their state is in the host
and not in the failed device - this is true for networking, storage,
etc. HCA / RNIC / TOE / FC / etc. all loose state or cannot be
trusted thus must rely upon upper level software to perform the recovery,
resync, retransmission, etc. Unless RDS has implemented its own
state checkpoint between endnodes, this class of failures must be solved
by the application since it cannot be solved in the hardware.
Hence, RDS may push some of its reliability requirements to the
interconnect but it does not eliminate all reliability requirements from
the application or RDS itself.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB

2005-11-09 Thread Michael Krause


At 10:28 AM 11/9/2005, Rick Frank wrote:

Yes, the application is responsible for detecting lost msgs at the
application level - the transport can not do this.

RDS does not guarantee that a message
has been delivered to the application - just that once the transport has
accepted a msg it will deliver the msg to the remote node in order
without duplication - dealing with retransmissions, etc due to sporadic /
intermittent msg loss over the interconnect. If after accepting the send
- the current path fails - then RDS will transparently fail over to
another path - and if required will resend / send any already queued msgs
to the remote node - again insuring that no msg is duplicated and they
are in order. This is no different than APM - with the exception
that RDS can do this across HCAs. 

The application - Oracle in this case -
will deal with detecting a catastrophic path failure - either due to a
send that does not arrive and or a timedout response or send failure
returned from the transport. If there is no network path to a remote node
- it is required that we remove the remote node from the operating
cluster to avoid what is commonly termed as a split brain
condition - otherwise known as a partition in time.

BTW - in our case - the application
failure domain logic is the same whether we are using UDP / uDAPL /
iTAPI / TCP / SCTP / etc. Basically, if we can not talk to a remote node
- after some defined period of time - we will remove the remote node from
the cluster. In this case the database will recover all the interesting
state that may have been maintained on the removed node - allowing the
remaining nodes to continue. If later on, communication to the remote
node is restored - it will be allowed to rejoin the cluster and take on
application load. 
One could be able to talk to the remote node across other HCA but that
does not mean one has an understanding of the state at the remote node
unless the failure is noted and a resync of state occurs or the remote is
able to deal with duplicates, etc. This has nothing to do
with API or the transport involved but, as Caitlin noted, the difference
between knowing a send buffer is free vs. knowing that the application
received the data requested. Therefore, one has only reduced the
reliability / robustness problem space to some extent but has not solved
it by the use of RDS.
Mike


- Original Message - 


From: Michael Krause 

To: Ranjit Pandit


Cc:
openib-general@openib.org


Sent: Tuesday, November 08, 2005 4:08 PM

Subject: Re: [openib-general] [ANNOUNCE] Contribute
RDS(ReliableDatagramSockets) to OpenIB

At 12:33 PM 11/8/2005, Ranjit Pandit wrote:

 Mike wrote:

 - RDS does not solve a set of failure models. For
example, if a RNIC / HCA

 were to fail, then one cannot simply replay the operations on
another RNIC /

 HCA without extracting state, etc. and providing some end-to-end
sync of

 what was really sent / received by the application. Yes,
one can recover

 from cable or switch port failure by using APM style recovery
but that is

 only one class of faults. The harder faults either result
in the end node

 being cast out of the cluster or see silent data corruption
unless

 additional steps are taken to transparently recover - again app
writers

 don't want to solve the hard problems; they want that done for
them.

The current reference implementation of RDS solves the HCA failure
case as well.

Since applications don't need to keep connection states, it's
easier

to handle cases like HCA and intermediate path failures.

As far as application is concerned, every sendmsg 'could' result in
a

new connection setup in the driver.

If the current path fails, RDS reestablishes a connection, if

available, on a different port or a different HCA , and replays
the

failed messages.

Using APM is not useful because it doesn't provide failover across
HCA's.

I think others may disagree about whether RDS solves the
problem. You have no way of knowing whether something was received
or not into the other node's coherency domain without some intermediary
or application's involvement to see the data arrived. As such, you
might see many hardware level acks occur and not know there is a real
failure. If an application takes any action assuming that send
complete means it is delivered, then it is subject to silent data
corruption. Hence, RDS can replay to its heart content but until
there is an application or middleware level of acknowledgement, you have
not solve the fault domain issues. Some may be happy with this as
they just cast out the endnode from the cluster / database but others see
the loss of a server as a big deal so may not be happy to see this
occur. It really comes down to whether you believe loosing a server
is worth while just for a local failure event which is not fatal to the
rest of the server.

APM's value is the ability to recover from link failure. It has
the same value for any other ULP in that it recovers transparently to the
ULP

Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB

2005-11-09 Thread Michael Krause


At 01:24 PM 11/9/2005, Greg Lindahl wrote:
On Wed, Nov 09, 2005 at
12:18:28PM -0800, Michael Krause wrote:
 So, things like HCA failure are not transparent and one cannot
simply 
 replay the operations since you don't know what was really seen by
the 
 other side unless the application performs the resync
itself.
I think you are over-stating the case. On the remote end, the kernel
piece of RDS knows what it presented to the remote application,
ditto
on the local end. If only an HCA fails, and not the sending and
receiving kernels or applications, that knowledge is not lost.
Perhaps you were assuming that RDS would be implemented only in
firmware on the HCA, and there is no kernel piece that knows what's
going on. I hadn't seen that stated by anyone, and of course there
are
several existing and contemplated OpenIB devices that are
considerably
different from the usual offload engine. You could also choose to
implement RDS using an offload engine and still keep enough state in
the kernel to recover.
I hadn't assumed anything. I'm simply trying to understand the
assertions concerning availability and recovery. What you indicate
above is that RDS will implement a resync of the two sides of the
association to determine what has been successfully sent. It will
then retransmit what has not transparent to the application. This
then implies that the reliability of the underlying interconnect isn't as
critical per se as the end-to-end RDS protocol will assure that data is
delivered to the RDS components in the face of hardware
failures. Correct?
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [swg] Re: [openib-general] round 2 - proposal for socket based connectionmodel

2005-10-25 Thread Michael Krause



Just to correct one comment:
A ULP written to TCP/IP can use RDMA transport without change. An
example is SDP not that the ULP must use what SDP uses. Also,
please keep in mind that SDP on iWARP uses the port mapper protocol to
obtain the IP address and port to target for the connection
request. So, the TCP connection establishment is to the RDMA listen
endpoint from the start and the SDP hello exchange then fills in the rest
of the parameters required to determine whether the connection should
proceed and what resources should be configured when the response is
generated. 
I will also re-iterate what another person stated and that is to separate
out the interface from the wire protocol. IBTA defines wire
protocols / semantics while OpenIB is defining its API to communicate the
wire protocol and associated semantics. I agree with that person on
this point and their other point on the need for the IBTA to construct a
solid spec for the wire protocol and associated semantics. OpenIB
will then determine how best to implement but these are separate efforts
and it would be more productive for all to table the discussion for
now. The original request was whether something would break if the
private data size was changed. It was noted that one cannot know
what will or will not break thus the requirement is to provide a method
for software to note the difference in the layout. How is for the
IBTA to specify. 
Just a thought..
Mike

At 03:43 PM 10/25/2005, Sean Hefty wrote:
Kanevsky, Arkady wrote:
What are you trying to
achieve?
I'm trying to define a connection *service* for Infiniband that uses
TCP/IP addresses as its user interface. That service will have its
own protocol, in much the same way that SDP, SRP, etc. do today.
I am trying to define an IB REQ
protocol extension that
support IP connection 5-tuple exchange between connection
requestor and responder.
Why? What need is there for a protocol extension to the IB
CM? To me, this is similar to setting a bit in the CM REQ to
indicate that the private data format looks like SDP's private
data. The format of the _private_ data shouldn't be known to the
CM; that's why it's private data.
And define mapping between IP
5-tuple and IB entities.
No mapping between IP - IB addresses was defined in the
proposal. Defining this mapping is required to make this
work. Right now, the mapping is the responsibility of every
user.
That way ULP which was written
to TCP/IP, UDP/IP, CSTP/IP (and so on)
can use RDMA transport without change.
A ULP written to TCP/IP can use an RDMA transport without change.
They use SDP. However, an application that wants to take advantage
of QP semantics must change. (And if they want to take full
advantage of RDMA, they'll likely need to be re-architected as
well.) The goal in that case becomes to permit them to establish
connections using TCP/IP addresses.
To meet this goal, we need to define how to map IP address to and from IB
addresses. That mapping is part of the protocol, and is missing
from the proposal. And if the application isn't going to know that
they're running on Infiniband, then the mapping must also include mapping
to a destination service ID.
To modify ULP to know that it
runs on top of IB vs. iWARP
vs. (any other RDMA transport) is bad idea.
It is one thing to choose proper port to connect.
Completely different to ask ULP to parse private data
in transport specific way.
The same protocol must support both user level ULPs
and kernel level ULPs.
Defining an interface that allows a ULP to use either iWarp, IB, or some
other random RDMA transport is an implementation issue. However, it
requires something that maps IP to IB addresses (including service
IDs).
To be more concrete, you've gone from having source and destination
TCP/IP addresses to including them in a CM REQ. What translated the
source and destination IP addresses into GIDs and a PKey? Who
converted those into IB routing information? How was the
destination of the CM REQ determined? What service ID was
selected?
- Sean

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] TCP/IP connection service over IB

2005-10-24 Thread Michael Krause


At 12:50 PM 10/21/2005, Fab Tillier wrote:
 From: James Lentini
[
mailto:[EMAIL PROTECTED]]
 Sent: Friday, October 21, 2005 12:38 PM
 
 On Fri, 21 Oct 2005, Sean Hefty wrote:
 
   sean version(8) | reserved(8) | src port (16)
  version(1) | reserved(1)
| src port (2)
   sean src ip (16)
   sean dst ip (16)
   sean user private data
(56)
/* for version 1
*/
  
   Are the numbers in parens in bytes or bits? It looks like
a mixture to me.
 
  Uhm.. they were a mix. Changed above to bytes.
 
 Ok. I assume that your 1 byte of version information is broken into
2
 4-bit pieces, one for the protocol version and one for the IP
version.
Doesn't leading-zero-padding the IPv4 addresses to be 16 bytes eliminates
the
need for an IP version field?
Not really. The same logic was used in the SDP port mapper for
iWARP where there was still an IP version provided so that the space
remained constant while the end node would know how to parse the
message.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Re: [swg] Re: private data...

2005-10-20 Thread Michael Krause



This is really an IBTA issue to resolve and to insure that backward
compatibility with existing applications is maintained. Hence, this
exercise of who is broken or not is inherently flawed in that one cannot
comprehend all implementations that may exist. Therefore, the spec should
use either a new version number or a reserved bit to indicate that there
is a defined format to the private data portion or not. This
is no different than what is done in other technologies such as
PCIe. Those applications that require the existing semantics will
be confined to the existing associated infrastructure. Those that
want the new IP semantics set the bit / version and operate within the
restricted private data space available. It is that
simple.
Mike

At 07:31 AM 10/20/2005, Jimmy Hill wrote:
A Linux
uDAPL-based system infrastructure application I am working on at IBM
currently depends on 64-bytes of Private Data for Connect and Accept as
well. 
-- jimmy 


Oracle currently depends on 64 bytes of private data
for connect and accept. 
 
- Original Message - 
From:

Kanevsky, Arkady 
To:

Davis, Arlin
R ;

[EMAIL PROTECTED]
 ;

Grant Grundler 
Cc:

[EMAIL PROTECTED] ;

openib-general@openib.org
 
Sent: Wednesday, October 19, 2005 11:31 AM 
Subject: RE: [dat-discussions] RE: [openib-general] Re: iWARP
emulationprotocol 
Arlin, 
just to clarify, Intel MPI will not
have problems with useing less than 64 bytes 
of private data.

If a solution will provide you with
48 bytes of private data will it be sufficient? 
Arkady 
 
 
Arkady
Kanevsky
email:

[EMAIL PROTECTED] 
Network
Appliance
phone: 781-768-5395 
375 Totten Pond
Rd.
Fax: 781-895-1195 
Waltham, MA
02451-2010 central
phone: 781-768-5300 
 
-Original Message-
From: Davis, Arlin R
[
mailto:[EMAIL PROTECTED]] 
Sent: Wednesday, October 19, 2005 11:30 AM
To:


[EMAIL PROTECTED]
; Grant Grundler
Cc:

[EMAIL PROTECTED]
;


openib-general@openib.org
Subject: RE: [dat-discussions] RE: [openib-general] Re: iWARP
emulationprotocol

Arkady, 
 
Intel MPI (real consumer of uDAPL)
has no problem with this change. 
 
-arlin 
 




From:
[EMAIL PROTECTED]
[
mailto:[EMAIL PROTECTED]] On Behalf Of Kanevsky,
Arkady
Sent: Wednesday, October 19, 2005 6:40 AM
To: Grant Grundler; Caitlin Bestler
Cc: Roland Dreier; [EMAIL PROTECTED];
[EMAIL PROTECTED]; openib-general@openib.org
Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulation
protocol 
 
Grant,
The developers of the application(s) in questions are aware of the
discussion.
I will leave it to them to respond.
I bring the discussion point at the weekly DAT Collaborative meeting
which we have every Wednesday.
I appologize that the DAT Collaborative charter does not allow
to submit contribution without joining DAT Collaborative.
But this is no different from Linux not accepting any contrubutions
without proper license.
Byt be rest assure that as a Chair I bring the concerns
and suggestions stated in email discussion at the DAT meetings.
Arkady
Arkady
Kanevsky
email: [EMAIL PROTECTED]
Network
Appliance
phone: 781-768-5395
375 Totten Pond
Rd.
Fax: 781-895-1195
Waltham, MA
02451-2010 central
phone: 781-768-5300

 -Original Message-
 From: Grant Grundler
[mailto:[EMAIL PROTECTED]
] 
 Sent: Tuesday, October 18, 2005 8:02 PM
 To: Caitlin Bestler
 Cc: Grant Grundler; Roland Dreier; Kanevsky, Arkady; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 openib-general@openib.org
 Subject: Re: [openib-general] Re: iWARP emulation protocol
 
 
 On Tue, Oct 18, 2005 at 04:40:54PM -0700, Caitlin Bestler
wrote:
   Roland (and the rest of us) would like to see someone name
a
   real consumer of the proposed interface. ie who depends on

   this change?
   Then the dependency for that use/user can be discussed and

   appropriate tradeoffs made. Make sense?
  
  Unfortunately not every application that is under 
 development, or even 
  deployed, can be discussed in a google-searchable public 
 forum. That 
  especially applies to user-mode development.
 
 Well, this is open source. While I don't want to preclude 
 closed source developement, it's usually necessary to have an 
 open source consumer that any open source developer can test
with.
 
  So I could have actually tested such applications and still not
be 
  free to cite them here.
 
 Understood. I'm not asking *you* to cite one unless you
 happen to own one of the consumers. 
 
  With any luck some of them
  are following the discussion and will jump in on their own.

  Unfortunately, since they are developing to uDAPL they are

 unlikely to 
  be following this discussion.
 
 It doesn't help that the DAT yahoo-groups.com mailing list is 
 rejecting my replies. It would be helpful if someone 
 following this forum could share Roland's question with DAT 
 mailing list if it didn't make it there already and possibly 
 explain why naming a consumer is necessary.
 
 hth,
 grant
 


SPONSORED
LINKS 


Protocol


Communication
and networking


Wireless
communication and 

Re: [openib-general] I/O controllers

2005-10-19 Thread Michael Krause


At 10:41 PM 10/18/2005, Mohit Katiyar, Noida wrote:
Content-class:
urn:content-classes:message
Content-Type: multipart/alternative;

boundary=_=_NextPart_001_01C5D46F.B45AF930
Hi all,
Can anyone tell me are there any specific I/O controller for the
connection between the TCA and SCSI devices or any I/O controller will
work between the TCA and SCSI devices
See various IB vendors for their offerings which include attachment to
various I/O device types. Their web pages contain plenty of
appropriate information.
Mike


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] IB and FC

2005-10-17 Thread Michael Krause



These types of discussions should be taken up with IB technology / OEM
vendors directly as they have nothing to do with development.
Mike

At 06:28 AM 10/15/2005, Mohit Katiyar, Noida wrote:
Hi all,
Sorry previous mail got scrapped due to HTML pictures so now with
text
pictures
I just cant clear a doubt about IB.
In the first figure given below the max speed that can be obtained
between the client and the IO storage is 2Gb/s
Client |
Client
|

|--- FC
Switch---|

.
 |


|


 |

.

|---FC
Cables--|

 |-I/O storage

.

|Each
client|

 |
Client
|
connected|--- FC
Switch---|


To both
switch


Figure 1

While in the figure given below the client to IB FC gateway speed is

10 GB/s and from Gateway to I/O storage is 2GB/s and if port
aggregation
is applied at gateway then 4GB/s. So the total effective speed from
client to I/O storage can max be reached at 4GB/s

 IB cables 
Client |
Client
|



 |- FC
Switch---|






.

|IB cables


 |
|




.

|IB FC --
|

|---FC
Cables--I/O storage

.

|

Gateway/Router |
|
Client
|



 |- FC
Switch---|


Figure 2

So can anyone explain me am I correct in my approach? Are there any
other advantages in shifting from figure 1 architecture to figure 2
architecture?
It does not seem any advantageous in shifting from FC SAN to IB FC
SAN
through such a pattern?
Can anyone help me in deciding about this??



Thanks in advance

Mohit Katiyar
___
openib-general mailing list
openib-general@openib.org

http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit

http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] [RFC] IB address translation using ARP

2005-10-13 Thread Michael Krause


At 03:14 PM 10/12/2005, Caitlin Bestler wrote:

 -Original Message-
 From: [EMAIL PROTECTED] 

[
mailto:[EMAIL PROTECTED]] On Behalf Of Sean
Hefty
 Sent: Wednesday, October 12, 2005 2:36 PM
 To: Michael Krause
 Cc: openib-general@openib.org
 Subject: Re: [openib-general] [RFC] IB address translation using
ARP
 
 Michael Krause wrote:
  1. Applications want to use existing API to identify remote

 endnodes / 
  services.
 
 To clarify, the applications want to use IP based addressing 
 to identify remote endnotes. The connection API is under
development.
 

No, I think Mike's comment was dead on. Applications want to
use the existing API. They want to use the existing API even
when the API is clearly defective. Note that there are several
generations of host-resolution APIs for the IP world, with the
earlier ones clearly being heavily inferior (not thread safe,
not IPv4/IPv6 neutral, etc). But they have not been eliminated.
Why, because applications want to use the existing API.
If application developers were rationale and totally open to
adopt new ideas instantly then the active side would ask to
make a connection to a *service*, not to a host with a service
qualifier.
A new API may be under development to meet new needs. But keep in
mind that the application developers expect it to be as close to
what they are used to as possible, and will grumble that it is
not 100% compatible. 
This all comes down to economics which is why some ULP such as SDP are
created. Let's examine SDP for a moment. The purpose of SDP
to enable synchronous and asynchronous Sockets applications to
transparently run unmodified over a RDMA capable
interconnect. Unmodified means no source code changes and no
recompile required (this is possible if the Sockets library is a shared
library and dynamically linked). The first part of unmodified
means that the existing address / service resolution API calls work
(further, no change to the address family, etc. is required to make this
work either). Hence, pick any of the get* API calls that are in use
today and they should just work. 
How does this work? The SDP implementation takes on the burden for
the application developer. For iWARP, there really isn't anything
special that has to be done as these calls all should provide the
necessary information. The port mapper protocol would be invoked
which would map to the actual RDMA listen QP and target RNIC. For
IB, there is some additional work both in using SID as well as resolving
the IP address to the IB address vector but the work isn't that hard
to implement (we know this because this has all been
implemented on various OS within the industry). The same will be
true for NFS/RDMA and iSER - again all use the existing interfaces to
identify the address / service and map to an address vector (and again,
all of this has been implemented on various OS within the
industry).
The above makes ISV and customers very happy as they can take advantage
of RDMA technologies without having to go through the lengthy and
expensive qualification process that comes when any application is
modified / recompiled. This keeps costs low and improves
TTM. As for the RDMA connection API, that is simply attempting to
abstract to a common interface that any ULP implementation can use to
access either iWARP or IB. The RDMA connection API should not
be viewed as something end application developers will use but towards
middleware developers. This allows everyone to use IP addresses,
port spaces, etc. through the existing application API while allowing
RDMA to transparently add some intelligence to the process and eventually
enable new capabilities like policy management (e.g. how best to map ULP
QoS needs to a given path, service rate,etc.) without permuting
everything above. Keeping things transparent is best for all.
Attempting to require end application developers to modify their code
will result in slower adoption and reduced utilization of RDMA
technologies within the industry. It really is all about economics
and re-using the existing ecosystem / infrastructure.
Mike


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] [RFC] IB address translation using ARP

2005-10-12 Thread Michael Krause


At 09:59 AM 10/12/2005, Caitlin Bestler wrote:




From:
[EMAIL PROTECTED]
[
mailto:[EMAIL PROTECTED]] On Behalf Of Michael
Krause

Sent: Wednesday, October 12, 2005 8:24 AM

To: Hal Rosenstock; Sean Hefty

Cc: Openib

Subject: RE: [openib-general] [RFC] IB address translation using
ARP


At 07:45 AM 10/10/2005, Hal Rosenstock wrote:

On Sun, 2005-10-09 at 10:19, Sean Hefty wrote: 

 I think iWARP can be on top of TCP or SCTP. But why wouldn't
it care ?

 

 I'm referring to the case that iWarp is running over TCP.
I know that it can

 run over SCTP, but I'm not familiar with the details of that
protocol. With

 TCP, this is an end-to-end connection, so layering iWarp over
it, only the

 endpoints need to deal with it. I believe the same is true
for SCTP.

Yes, SCTP is similar in those regards.

SCTP creates a connection and then multiplexes a set of sessions over
it. You can conceptually think of it as akin to IB RD but where all
QP are bound to the same EEC.


SCTP preserves all QP to
QP semantics, including buffers posted to specific
buffers and credits. So SCTP will allows multiple in-flight messages for
each
RDMA stream in the association.
Yep. This is where iWARP differs from IB RD in that IB restricts
this to a single in-flight message per EEC at a time while iWARP allows
multiple in-flight over either transport type supported. The logic
behind why IB RD was constructed the way it was is somewhat complex but
one of the core requirements was to enable a QP to communicate across
multiple EEC while preserving an ordering domain within an EEC.
Given all of this needed to be implemented in hardware, i.e. without host
software intervention, for both main data path and error management, the
restriction to a single message was required. I and several others
had created a proprietary RDMA RC followed by a RD implementation 10+
years ago so we had a reasonable understanding of the error / complexity
trade-offs. Given the distances were within a usec or each other
and one could support multiple EEC per endnode pair, the performance /
scaling impacts were not seen as overly restrictive and met the software
application usage models quite nicely. Anyway, there are
differences between iWARP / SCTP and IB RD so people cannot equate them
beyond some base conceptual level aspects.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] IRQ sharing on PCIe bus

2005-10-11 Thread Michael Krause


At 02:05 PM 10/10/2005, Roland Dreier wrote:
 Roland
BTW, for INTx emulation on PCI Express, there are no
 Roland physical interrupt lines -- interrupts are
asserted and
 Roland deasserted with messages. So PCI
Express interrupts are
 Roland unshared.
 Michael They are messages upstream that any
device.

^ sent
Sorry. Insert sent above.
That doesn't parse
for me. Was what I said wrong?
No. Just clarifying that they are not unique per device. INTx
being a message does not change the fundamental semantics of a
wire being asserted. Hence, if the wire was shared
before, then there is no reason why this would not be the same with PCIe
sans. It really is an OS issue as to how INTx interrupts are
assigned to different processors and to what extent then end up being
shared. The host bridge can play some tricks as well as you
noted. Again, the goal within the PCI-SIG is to move people to
MSI-X and to eliminate INTx long-term. In fact, one area under
development is asking the SIG's members whether INTx can be eliminated
entirely which would go a long ways to simplifying designs both in
hardware and software.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module

2005-10-11 Thread Michael Krause


At 01:09 PM 10/10/2005, Christoph Hellwig wrote:
On Mon, Oct 10, 2005 at
12:53:29PM -0700, Michael Krause wrote:
 standards. There are also the new standard Sockets extension
API available 
 today that might be extended sometime in the future to include
explicit 
which is never going to get into linux. one more of these
braindead
standards people masturbating in a dark room and coming up with a
frankenstein bastard cases. 
Everyone is free to have an opinion. Sockets extensions are not
braindead nor created using whatever methods you envision. The
extensions were created by Sockets engineers with 20+ years
experience. But, hey, why put any faith into people who develop and
implement Sockets for a living? One day perhaps you'll learn a bit
of professionalism and perhaps open your mind that there are people out
in the world besides yourself you don't take a NIH approach to the world
and are actually qualified engineers who have a clue. All you get
with these constant unprofessional diatribes is a continual loss in
credibility. But, hey, that is just an opinion.
BTW, do you feel the same way about the people who created IB? How
about iWARP? How about PCIe? Are all of the engineers who
work on trying to accelerate technology, its performance, etc. who take
into account and try to find a balanced approach to problem solving
simply all in dark little rooms? All of these specs are created by
companies. Those same companies who fund open source efforts and many of
the people working here. 
One last thing, I'm not the only person who feels this way about your
unprofessional behavior. There are many others who have simply
don't want to bother writing or have simply written you off as
whatever. Sad state to be in and I suspect you don't care since you
view them all as in dark little rooms anyway. Just something you
might want to keep in mind. There is a much larger world out there
where people value other people's professional opinions and ideas.
They don't simply discount what they produce because it was not done in
whatever form you prefer. It is called reality. Get used to
it.
Mike 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module

2005-10-10 Thread Michael Krause


At 12:13 PM 10/10/2005, Fab Tillier wrote:
 From: Sean Hefty
[
mailto:[EMAIL PROTECTED]]
 Sent: Monday, October 10, 2005 11:16 AM
 
 Michael S. Tsirkin wrote:
  Maybe rdma_connection (these things encapsulate connectin
state)?
  Or, rdma_sock or rdma_socket, since people are used to the fact
that
  connections are sockets?
 
 Any objection to rdma_socket?
I don't like rdma_socket, since you can't actually perform any I/O
operations on
the rdma_socket, unlike normal sockets. We're dealing only with the
connection
part of the problem, and the name should reflect that. So
rdma_connection,
rdma_conn, or rdma_cid seem more appropriate.
Naming should not involve sockets as that is part of existing
standards. There are also the new standard Sockets extension API
available today that might be extended sometime in the future to include
explicit RDMA support should people decide to bypass SDP and go straight
to a more robust API definition. The Sockets Extensions already
comprehend explicit memory management, async comms, etc. making a
significant improvement over the existing sync Sockets as well as going
further in solving areas like memory management beyond what was done in
Winsocks.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [RFC] IB address translation using ARP

2005-10-10 Thread Michael Krause


At 10:40 AM 10/10/2005, Sean Hefty wrote:
Hal Rosenstock wrote:
What about the case of iWARP
- IB ?
Crossing IB shouldn't matter. iWarp should simply cross the IB
subnet using IPoIB. You could build a gateway to make the transfer
across IB more efficient, but it's not required.
I don't understand this statement. iWARP is RDMA based and if
someone wanted to build a gateway with IB in between, it should be mapped
to an IB RC connection 1:1. Going through IPoIB is a waste and
would result in a very poor performing solution (not that such a solution
would deliver stellar performance to start with. Prior similar
solutions used ULP over IB and the gateway then provided ULP over TOE and
would then be easily extended to do iWARP. In general, you would
want to have defined domains for each interconnect and not try to add
poor ROI superset functionality of one over the other - waste of time and
money.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] IRQ sharing on PCIe bus

2005-10-10 Thread Michael Krause


At 09:22 AM 10/10/2005, Roland Dreier wrote:
 yipee Hi,
My setup is a 3GHz Xeon (x86_64) with a 2.6.13.2
 yipee kernel. A Mellanox memfree PCIe ddr HCA is
connected. Why
 yipee do I see IRQ sharing although I'm using
msi_x and PCIe? 
 yipee Doesn't IRQ sharing only happen on older non
PCIe busses?
I think the messages you see are coming from the ACPI interrupt
routing that is done when the driver calls pci_enable_device().
However, if you use MSI-X then that interrupt won't actually be
used.
If you check /proc/interrupts you should see ib_mthca using 3
non-shared interrupts.
BTW, for INTx emulation on PCI Express, there are no
physical
interrupt lines -- interrupts are asserted and deasserted with
messages. So PCI Express interrupts are unshared.

They are messages upstream that any device.
However, the PCI
Express host bridge turns those interrupts into real interrupts to the
system's interrupt controller, and for that part of the story, it's
entirely possible for two different PCI Express devices to end up sharing
the same interrupt line.
Correct, the host bridge may map them to a monarch processor
and thus any or all devices can share the same interrupt. This is
why within the PCI-SIG we recommend using MSI-X and long-term, many of us
would simply like to drop INTx and make MSI-X mandatory.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [RFC] IB address translation using ARP

2005-10-10 Thread Michael Krause


At 01:59 PM 10/10/2005, Sean Hefty wrote:
Michael Krause wrote:


What about the case of iWARP
- IB ?
Crossing IB shouldn't matter. iWarp should simply cross the IB
subnet using IPoIB. You could build a gateway to make the transfer
across IB more efficient, but it's not required.I don't
understand this statement. iWARP is RDMA based and if someone

I was referring to the case where both endpoints are running over iWarp,
with IB being one of the subnets being crossed. I believe that
you're referring to one side running over iWarp, and the other running
over IB, with an application level gateway in between.
For the latter case, I would think that the gateway needs to establish
iWarp connections for any IP addresses that reside on the IB subnet
behind it, with a separate IB connection on the back-end. It seems
to me that this would occur transparently to the application using
iWarp.
iWARP with IB in between seems like a waste of time to do (very small if
any market for such a beast). IB HCA on a host with an iWARP edge
device may be reasonable but again seems like a waste to construct.
These types of corner usage models while of interest to comprehend to see
if there is any architectural issues to insure they are not precluded
really are just that, corner cases, and little time or effort should be
spent on their support.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] [RFC] IB address translation using ARP

2005-10-07 Thread Michael Krause


At 06:38 AM 9/30/2005, Caitlin Bestler wrote:

 -Original Message-
 From: [EMAIL PROTECTED] 

[
mailto:[EMAIL PROTECTED]] On Behalf Of Roland
Dreier
 Sent: Thursday, September 29, 2005 6:50 PM
 To: Sean Hefty
 Cc: Openib
 Subject: Re: [openib-general] [RFC] IB address translation using
ARP
 
 Sean Can you explain how RDMA works in
this case? This is simply
 Sean performing IP routing, and not IB
routing, correct? Are you
 Sean referring to a protocol running on
top of IP or IB directly?
 Sean Is the router establishing a second
reliable connection on
 Sean the backend? Does it simply
translate headers as packets
 Sean pass through in this case?
 
 I think the usage model is the following: you have some magic 
 device that has an IB port on one side and something
else 
 on the other side. Think of something like a gateway that

 talks SDP on the IB side and TCP/IP on the other side.
 
 You configure your IPoIB routing so that this magic device is 
 the next hop for talking to hosts on the IP network on the other
side.
 
 Now someone tries to make an SDP connection to an IP address 
 on the other side of the magic device. Routing tables + ARP

 give it the GID of the IB port of this magic device. It 
 connects to the magic device and run SDP to talk to the magic 
 device, and the magic device magically splices this into a 
 TCP connection to the real destination.
 
 Or the same idea for an NFS/RDMA - NFS/UDP gateway,
etc.
 
Those examples are all basically application level gateways.
As such they would have no transport or connection setup
implications. The application level gateway simply offers
a service on network X that it fulfills on network Y. But
as far as network X is concerned the gateway IS the
server.
It must be viewed as such. The cross over point between the two
domains represents independent management domains, trust domains,
reliable delivery domains, etc. 
I do not believe it
is possible to construct a transport
layer gateway that bridges RDMA between IB and iWARP while
appearing to be a normal RDMA endpoint on both networks.
Higher level gateways will be possible for many
applications, but I don't see how that relates to
connection establishment. That would require having
an end-to-end reliable connection, complete with flow
control semantics, that bridged the two networks by
some method other than encapsulation or tunneling.
We took steps to insure that both IB and iWARP could transmit packets in
the main data path very efficiently between the two interconnects but it
was never envisioned that a connection was truly end-to-end transparent
across the gateway component. I think most of the architects would
not support such an effort to define such a beast. There are many
issues in attempting such an offering. Just examine all of the
problems with the existing iSCSI to FC solutions; they ignore a number of
customer issues and hence have been relegated in many customer minds as
TTM, play toys not ready for prime time. This is one of the many
reasons why iSCSI has not taken off as the hype portrayed.
It would be best to define a CM architecture that enabled communication
between like endpoints and avoid the gateway dilemma. Let the gateway
provider work out such issues as there are many requirements already on
each side of these interconnects.
Mike


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] [RFC] IB address translation using ARP

2005-10-07 Thread Michael Krause


At 06:24 AM 9/30/2005, Yaron Haviv wrote:
 -Original
Message-
 From: Roland Dreier
[
mailto:[EMAIL PROTECTED]]
 Sent: Thursday, September 29, 2005 9:50 PM
 To: Sean Hefty
 Cc: Yaron Haviv; Openib
 Subject: Re: [openib-general] [RFC] IB address translation using
ARP
 
 I think the usage model is the following: you have some magic
device
 that has an IB port on one side and something else on
the other
 side. Think of something like a gateway that talks SDP on the
IB side
 and TCP/IP on the other side.
 
Also applicable to two IB ports, e.g. forwarding SDP traffic from one
IB
partition to SDP on another partition (may even be the same port
with
two P_Keys), and doing some load-balancing or traffic management in
between, overall there are many use cases for that. 
While I can envision how an endpoint could communicate with another in
separate partitions, doing so really violates the spirit of the
partitioning where endpoints must be in the same partition in order to
see one another and communicate. Attempting to create an
intermediary who has insights into both and then somehow is able to
communicate how to find one another using some proprietary (can't be
through standards that I can think of) method, seems like way too much
complexity to be worth it.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] [RFC] libibverbs completion event handling

2005-09-22 Thread Michael Krause


At 03:33 PM 9/21/2005, Caitlin Bestler wrote:
I'm not sure I follow what a
completion channel is.
My understanding is that work completions are stored in
user-accessible memory (typically a ring buffer). This 
enables fast-path reaping of work completions. The OS
has no involvement unless notifications are enabled.
The completion vector is used to report completion
notifications. So is the completion vector a *single*
resource used by the driver/verbs to report completions,
where said notifications are then split into user
context dependent completion channels?
The RDMAC verbs did not define callbacks to userspace
at all. Instead it is assumed that the proxy for user
mode services will receive the callbacks, and how it
relays those notifications to userspace is outside
the scope of the verbs.
Correct.

Both uDAPL and
ITAPI define relays of notifications
to AEVDS/CNOs and/or file descriptors. Forwarding
a completion notification to userspace in order to
make a callback in userspace so that it can kick
an fd to wake up another thread doesn't make much
sense. The uDAPL/ITAPI/whatever proxy can perform
all of these functions without any device dependencies
and in a way that is fully optimal for the usermode
API that is being used. 
Exactly. This was the intention. Does not really matter what the
API is but that there by an API that does this work on behalf of the
consumer.
For kernel clients,
I don't see any need for anything beyond the already defined
callbacks direct from the device-dependent code.
This was the intention when we designed the verbs.
Even in the typical
case where the usermode application
does an evd_wait() on the DAT or ITAPI endpoint, the
DAT/ITAPI proxy will be able to determine which thread
should be woken and could even do so optimally. It
also allows the proxy to implemenet Access Layer features
such as EVD thresholding without device-specific support.

Correct.

 -Original
Message-
 From: [EMAIL PROTECTED] 

[
mailto:[EMAIL PROTECTED]] On Behalf Of Roland
Dreier
 Sent: Wednesday, September 21, 2005 12:22 PM
 To: openib-general@openib.org
 Subject: [openib-general] [RFC] libibverbs completion event
handling
 
 While thinking about how to handle some of the issues raised 
 by Al Viro in

http://lkml.org/lkml/2005/9/16/146, I 
 realized that our verbs interface could be improved to make 
 delivery of completion events more flexible. For example,

 Arlin's request for using one FD for each CQ can be 
 accomodated quite nicely.
 
 The basic idea is to create new objects that I call 
 completion vectors and completion
channels. Completion 
 vectors refer to the interrupt generated when a completion 
 event occurs. With the current drivers, there will always be

 a single completion vector, but once we have full MSI-X 
 support, multiple completion vectors will be
possible.
When I proposed the use of multiple completion handlers, it was based on
the operating assumption that either MSI or MSI-X be used by the
underlying hardware. Either is possible - MSI limits it to a single
address with 32 data values which allows different handlers to be bound
to each value though targeting a single processor. MSI-X builds
upon technology we've been shipping for nearly 20 years now and allows up
to 2048 different addresses which may target or multiple
processors. Any API should be able to deal with both approaches
thus should not assume anything about whether one or more handlers are
bound to a given processor.
 Orthogonal to
this is the notion of a completion channel. 
 This is a FD used for delivering completion events to
userspace.
 
 Completion vectors are handled by the kernel, and userspace 
 cannot change the number of vectors that available. On the

 other hand, completion channels are created at the request of 
 a userspace process, and userspace can create as many 
 channels as it wants.
 
 Every userspace CQ has a completion vector and a completion
channel.
 Multiple CQs can share the same completion vector and/or the 
 same completion channel. CQs with different completion 
 vectors can still share a completion channel, and vice versa.
 
 The exact API would be something like the below.
Thoughts?
Why wouldn't it just be akin to the verbs interface - here are the event
handler and callback routines to associate with a given CQ. The
handler might be nothing more than an index into a set of functions that
are stored within the kernel - these functions are either device-specific
(i.e. supplied by the IHV) or a OS-specific such as dealing with error
events (might also have a device-specific component as well). When
the routine is invoked, it has basically has three parameters: CQ to
target, number of CQE to reap, address to store CQE. I do not see
what more is required.
Mike
 
 Thanks,
 Roland
 
 struct ibv_comp_channel {

int


fd;
 };
 
 /**
 * ibv_create_comp_channel - Create a completion event 
 channel */ extern struct ibv_comp_channel 
 *ibv_create_comp_channel(struct ibv_context 

Re: [openib-general][PATCH][RFC]: CMA IB implementation

2005-09-22 Thread Michael Krause


At 05:30 PM 9/21/2005, Caitlin Bestler wrote:

On 9/21/05, Sean Hefty
[EMAIL PROTECTED]
 wrote:


Caitlin Bestler wrote:

 That's certainly an acceptably low overhead for iWARP IHVs,

 provided there are applications that want this control and

 *not* also need even more IB-specific CM control. I still 

 have the same skepticism I had for the IT-API's exposing

 of paths via a transport neutral API. Namely, is there

 really any basis to select amongst multiple paths from

 transport neutral code? The same applies to caching of 

 address translations on a transport neutral basis. Is

 it really possible to do in any way that makes sense?

 Wouldn't caching at a lower layer, with transport/device

 specific knowledge, make more sense? 

I guess I view this API slightly differently than being just a
transport neutral

connection interface. I also see it as a way to connect over IB
using IP

addresses, which today is only possible if using ib_at. That
is, the API could 

do both.


Given that purpose I can envision an IB-aware application that
needed
to use IP addresses and wanted to take charge of caching the
translation.
But viewing this in a wider scope raises a second question.
Shouldn't
iSER be using the same routines to establish
connections?
While many applications do use IP addresses, unless one goes the route of
defining an IP address per path (something that iSCSI does comprehend
today), IB multi-path (and I suspect eventually Ethernet's multi-path
support) will require interconnect specific interfaces. Ideally,
applications / ULP define the destination and QoS requirements - what we
used to call an address vector. Middleware maps those to a
interconnect-specific path on behalf of the application / ULP. This
is done underneath the API as part of the OS / RDMA
infrastructure. Such an approach works quite well for many
applications / ULP however it should not be the only one supported as it
assumes that the OS / RDMA infrastructure is sufficiently robust to apply
policy management decisions in conjunction with the fabric management
being deployed. Given IB SM will vary in robustness, there must
also exist API that allow applications / ULP to comprehend the set of
paths and select accordingly. I can envision how to construct such
a knowledge that is interconnect independent but it requires more
standardization about what defines the QoS requirements -
latency, bandwidth, service rate, no single point of failure, etc. What I
see so far does not address these issues.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: RDMA Generic Connection Management

2005-08-31 Thread Michael Krause


At 07:46 AM 8/31/2005, James Lentini wrote:

On Tue, 30 Aug 2005, Roland
Dreier wrote:
 I just committed this SRP fix, which should make sure we don't use
a
 device after it's gone. And it actually simplifies the code a
teeny bit...
The device could still be used after it's gone. For example:
- the user is configuring SRP via sysfs. The thread in 
 srp_create_target() has just called ib_sa_path_rec_get()

 [srp.c line 1209] and is waiting for the path 
 record query to complete in wait_for_completion()
- the SA callback, srp_path_rec_completion(), is called. This 
 callback thread will make several verb calls (ib_create_cq,

 ib_req_notify_cq, ib_create_qp, ...) without any
coordination with 
 the hotplug device removal callback, srp_remove_one

Notice that if the SA client's hotplug removal function, 
ib_sa_remove_one(), ensured that all callbacks had completed before 
returning the problem would be fixed. This would protect all ULPs from

having to deal with hotplug races in their SA callback function. The

fix belongs in the SA client (the core stack), not in SRP.
All the ULPs are deficient with respect to their hotplug 
synchronization. Given that there is a common problem, doesn't it make

sense to try and solve it in a generic way instead of in each
ULP?
There are two approaches to device removal to consider - both are
required to have a credible solution:
(1) Inform all entities that a planned device removal is to occur and
allow them to close gracefully or migrate to alternatives. Ideally,
the OS comprehends whether the removal will result in the loss of any
critical resources and not inform or take action unless it knows the
removal is something that the system can survive. Doing this
requires the ULP to register interest with the OS in a particular
hardware resource. This also allows the OS to construct a resource
analysis tool to determine whether the removal of a device will be a good
idea or not. This is really outside the scope of an RDMA
infrastructure and should be done by the OS through an OS defined API
which is applicable to all types of hardware resources and
sub-systems.
(2) Design all ULP to handle surprise removal, e.g. device failure, from
the start and allow them to close gracefully or migrate to
alternatives. The OS would inform the device driver of the failure
if the device driver has not already discovered the problem. The OS
would also inform interested parties of the device failure. The
device driver would simply error out all users of the device instance -
there are already error codes defined for IB and iWARP for this
purpose. The associated verbs resources should be released as the
ULP closes out its resources through the verbs API (we did define the
verbs to clean up resources that the infrastructure may allocate on
behalf of the ULP). Activities such as listen entries would be
released just like what is done for Sockets, etc. today. 

Device addition is simply a matter of informing policy or whatever
service management within the OS that determines what services should be
available on a given device. The device driver really does not need
to do anything special. One area to consider is whether a planned
migration of a service needs to be supported. This is generally
best handled by the ULP with only a small set of services required of the
infrastructure, e.g. get / set of QP / LLP context and then coordinating
any other aspects with the appropriate SM or network services such
updating address vectors or fabric management / configuration.
In general, the ULP should already be designed to handle the error
condition and whether they support a managed / planned removal or
migration is perhaps the only potential area of deficiency.
Mike 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: uverbs comp events

2005-08-22 Thread Michael Krause


At 11:11 AM 8/19/2005, Roland Dreier wrote:
 Arlin
Yes, this is certainly another option; albeit one that
 Arlin requires more system resources. Why not take
full advantage
 Arlin of the FD resource we already have? It's
your call, but
 Arlin uDAPL and other multi-thread applications
could make good
 Arlin use of a wakeup feature with these event
interfaces. An
 Arlin event model that allows users to create
events and get
 Arlin events but requires them to use side band
mechanisms to
 Arlin trigger the event seems incomplete to
me.
I disagree. Right now the CQ FD is a pretty clean concept: you
read
CQ events out of it. If you want to trigger a CQ event, then
you
could post a work request to a QP that generates a completion event.
Adding a new system call for queuing synthetic events seems like
growing an ugly wart to me.
If we look at the analogous design of a multi-threaded network
server,
where a thread might block waiting for input on a socket, we see
that
there's no system call to inject synthetic data into a network
socket.
I'd rather fix the uDAPL design instead of adding ugliness to the
kernel to work around it.
Please take a look at the Sockets API Extensions standard that was
published quite awhile back to insure that the infrastructure can support
this API as well. The API was developed by a set of Sockets
developers and addresses a number of concerns for async communications,
event management, explicit memory management, etc. It is also well
suited to have SDP transparently implemented underneath it.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator

2005-08-19 Thread Michael Krause


At 08:04 AM 8/19/2005, Yaron Haviv wrote:
 -Original
Message-
 From: Christoph Hellwig
[mailto:[EMAIL PROTECTED]]
 Sent: Friday, August 19, 2005 10:22 AM
 To: Roland Dreier
 Cc: Yaron Haviv; Christoph Hellwig; Grant Grundler; open-
 [EMAIL PROTECTED]; openib-general@openib.org
 Subject: Re: [openib-general] [ANNOUNCE] Initial trunk checkin
of
 ISERinitiator
 
 On Thu, Aug 18, 2005 at 09:24:24PM -0700, Roland Dreier wrote:
  Yaron Not every one wants to keep
on doing target discovery
with
  Yaron Python scripts,
 
  Come on, this is just a stupid statement. The whole point
of
putting
  device management in userspace is so that everybody has
the
  flexibility to use whatever discovery mechanism they want.
 
 And just FYI. If you ever want an iSER implementation merged
it will
 have to work the same way. Look at how the open-iscsi TCP
initator
does
 it.
Good point, the high-level functionality in iSER
is all done in Open-iSCSI and its userspace extensions
iSER just deals with the data transfer and is layered under
Open-iSCSI
by the way can you point me to the iSCSI HBA that delivers better
performance, latency, and memory consumption 
and what about the price of that HBA and the attached 10GbE switch

Is any of this really relevant? The focus here is open source and
creating a RDMA infrastructure for ULP to use. The market will
decide whether a given technology survives or not. It isn't up to
the open source community. Please take personal opinions on whether
a technology will succeed elsewhere. 
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re:[ULP] how to choose appropriate ULPs for application

2005-07-13 Thread Michael Krause


At 06:39 AM 7/13/2005, James Lentini wrote:

On Tue, 12 Jul 2005, xg wang
wrote:
 Frankly
speaking, I can not distinguish the function of SDP and DAPL. Since
Lustre is a file system, it runs on kernel. So I think maybe kDAPL is
better.
SDP stands for the Sockets Direct Protocol. The protocol is designed to
support the Berkley Sockets API. This allows code already using the
Sockets API to easily use InfiniBand by simply changing the socket
type.
One clarification: SDP supports both synchronous and asynchronous
sockets. The OpenGroup ICSC released the sockets extensions API a
number of months back that enables the full performance provided by SDP
to be tapped. The SDP specifications can be found at the IBTA for
IB and at the RDMAC for RNIC web sites.

kDAPL is the kernel
Direct Access Provider Library. It is an API that supports RDMA networks
(InfiniBand, iWARP, etc.).
 But for ULP
application, what is the advantage and disadvantage of SDP and DAP
? While you implementation an application, will you use SDP or
DAPL, and why? I just wonder the difference between them from the
application view.
First off, SDP is a protocol and kDAPL is an API. Since SDP is a
protocol, you will only be able to communicate with other nodes that
implement SDP.
Another thing to consider is the differences in the APIs. SDP accessed
with traditional Sockets API. This makes porting applications to it easy,
but doesn't give you much fine grained control over how the RDMA network
is used. 
Sockets by definition is interconnect and topology independent.
Network controls are managed separately. The best that an
application should do is signal its requirements, e.g. using diffserv or
similar standard.
kDAPL was designed
specifically for RDMA networks with lots of features that allow you to
control how the network is used. This is good if you are writing new
code, but means that old code needs substantial
porting.
Ideally, applications stay out of such decisions. Middleware's job
is to handle application optimization, etc. so that the end consumer
stays as ignorant as possible thus focused on their application's needs
not the networks. The middleware API - whether DAPL, IT API, RNIC
PI, whatever - can provide the hooks needed to manage the usage from a
given endnode's perspective. But even here, the real network
management, what routes are actually used, the arbitration for QoS, etc.
should also be outside of the middleware's control. It simply
manages a set of local resources and allows the fabric management to do
the rest. There is more to this than that but that is how IB was
constructed which is no different in many respects from how IP works as
well.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re:[ULP] how to choose appropriate ULPs for application

2005-07-13 Thread Michael Krause


At 11:18 AM 7/13/2005, James Lentini wrote:

On Wed, 13 Jul 2005, Michael
Krause wrote:
At 06:39 AM 7/13/2005, James
Lentini wrote:
kDAPL was designed specifically
for RDMA networks with lots of features that allow you to control how the
network is used. This is good if you are writing new code, but means that
old code needs substantial porting.
Ideally, applications stay out of such decisions. Middleware's job
is to handle application optimization, etc. so that the end consumer
stays as ignorant as possible thus focused on their application's needs
not the networks. The middleware API - whether DAPL, IT API, RNIC
PI, whatever - can provide the hooks needed to manage the usage from a
given endnode's perspective. But even here, the real network
management, what routes are actually used, the arbitration for QoS, etc.
should also be outside of the middleware's control. It simply
manages a set of local resources and allows the fabric management to do
the rest. There is more to this than that but that is how IB was
constructed which is no different in many respects from how IP works as
well.
Let me clarify: kDAPL users can specify exactly how data is transfered
(SEND, RDMA write, RDMA read), completion events are processed, memory is
registered, etc. This is the network control I was referring
to. In retrospect, it would be more correct refer to this as
adapter control.
Just to nit-pick this a bit. The ULP determines what type of
operation to use - SEND, RDMA Write, RDMA Read, or Atomic (where
supported by the interconnect). The middleware API or the verbs API
provide an interface to abstract the IHV hardware-specifics from the ULP
allowing the ULP to be implemented across a variety of
technologies. Thus, the API is just an abstraction of the
underlying semantics and does not make decisions or provide much in the
way of controls in any regard unless additional value-add beyond the
underlying hardware semantics are transparently implemented within the
API implementation itself For example, SDP defines
specifically when to use a SEND for control operations, when to use a
RDMA for a zcopy operation. SDP itself does not care what the
underlying API is used to access the associated hardware resources but it
does define what resources and associated services are used. SDP
can be implemented directly on the verbs API (just like MPI) and operate
quite nicely without the additional middleware API in the execution
path.
Apologies of the nit pick but the API does not provide any type of
control other than to act as an abstraction funnel between the ULP /
application and the underlying hardware. There are opportunities to
provide transparent value-add controls but I don't believe this open
source effort is focused on these this time.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: IP addressing on InfiniBand networks (Caitlin Bestler)

2005-07-11 Thread Michael Krause


At 06:37 AM 7/11/2005, James Lentini wrote:

On Tue, 5 Jul 2005, Michael
Krause wrote:
The intention was to allow one
to manage the fabric by having mapping functions from traditional IP
management applications to IB GID to minimize the amount of work to
enable IB within a solution.
I was unaware of this. What happened to the mapping functions? What did
the API look like and how was it going to be
implemented?
The IBTA specs are not API specifications. They define semantics
and wire protocols. As for what was envisioned which guided the
spec, most data center management applications understand a variety of
management objects that represent various IP based attributes. A
GID is close enough to an IPv6 address that many of these objects could
be easily modified via a plug-in such that the IB would have been easily
slid into the associated management applications. Investigations
occurred, e.g. in to providing the necessary plug ins for OpenView,
Tivoli, etc. This is one area where IB is insufficient in terms of
a viable ecosystem since there really isn't any way to manage IB in the
enterprise without requiring significant amounts of training and a new
tool chain to be deployed.
Mike


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] vendor_id, vendor_part_id, hw_ver

2005-07-11 Thread Michael Krause


At 04:20 PM 7/8/2005, Kevin Reilly wrote:


Mike,
 Ideally your right the ULP shouldn't care what HCA it's
running on. There are some practical reasons why an ULP might want to
know the vendor and part number it was using like for debug or taking
advantage a perform nuance of a particular HCA.
Create a private interface since it is unique per HCA. This type of
information was rejected during the IB and iWARP verbs creation as
something best handled out of band / out of spec. Please do not
push for this to be in a ULP itself or any standard RDMA API since it
isn't required for all - most likely is very rare given it was rejected
by all companies involved in creating these technologies to
date.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] vendor_id, vendor_part_id, hw_ver

2005-07-08 Thread Michael Krause


At 02:00 PM 7/8/2005, Roland Dreier wrote:
 Kevin Is
openIB going to do anything to enumerate the
 Kevin vendor_id,vendor_part_id and hw_ver in a
common header
 Kevin fille or is it the responsiblity of ULP
running ontop of
 Kevin the lib to understand these values?
I don't have anything planned to enumerate those values. I
would
guess that only a very few ULPs should even look at
them.
Why would this ever have to be examined by a ULP? This seems like a
low-level driver issue. Given the verbs semantics abstract the
hardware, aside from the HCA-specific driver, the rest of the stack
should be unaware of any thing below.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] [iser]about the target

2005-07-06 Thread Michael Krause


At 06:14 AM 7/6/2005, Rimmer, Todd wrote:

I would like to clarify the
comment on SRP. There are companies presently shipping and
demonstrating SRP native IB storage. For example:

Engenio (formerly
LSI)

Raytheon

Data Direct

Mellanox

SRP was designed for
highly optimized storage access across an RDMA capable transport, and
hence is capable of very high performance.

Longer term the storage
vendors anticipate that iSCSI will be the focus for long haul (remote
backup, etc) type solutions, while IB Native Storage and FC will be the
focus for data center high performance storage
solutions.
iSCSI within the data center is quite real for many applications
especially blades where there is already an Ethernet interface. The
reason for iSER was to take advantage of the iSCSI ecosystem while
providing a RDMA focused data mover. SRP is primarily a data mover
and does not define the rest of the management, etc. interfaces.
When used to move data to a FC, the FC infrastructure is leveraged but
SRP by itself does nothing really in this regard. No one stated
that SRP could not deliver performance only that the rest of the
infrastructure is not defined and must rely upon other standards /
plug-ins / etc. I do not want to get into a vision / marketing
debate - was just explaining why we created iSER instead of just
enhancing SRP.
Mike

Todd R.


-Original Message-

From: Michael Krause
[
mailto:[EMAIL PROTECTED]]

Sent: Tuesday, July 05, 2005 12:28 PM

To: Ian Jiang; openib-general@openib.org

Subject: Re: [openib-general] [iser]about the target

At 06:07 PM 7/4/2005, Ian Jiang wrote:

Hi!

I am new to the iSER.

On 

https://openib.org/tiki/tiki-index.php?page=iSER, it is said
that iSER currently contains initiator only (no target). Will the target
come out later? How did they test the iSER initiator without a iSER
target?

Could you give some explaination?

From a practical perspective, there are very few iSCSI targets
shipping today. Most people had envisioned iSER over IB to a
gateway Ethernet device since native IB storage is also quite rare in
terms of real product. For many of us, our push for iSER over IB
was to replace SRP which has a deficient ecosystem thus not really used
beyond some basic Fibre Channel gateway cards. 

Mike

Thanks!


Ian Jiang

[EMAIL PROTECTED]



Computer Architecture Laboratory

Institute of Computing Technology

Chinese Academy of Sciences

Beijing,P.R.China

Zip code: 100080

Tel: +86-10-62564394(office)

_

ÓëÁª»úµÄÅóÓѽøÐн»Á÷£¬ÇëʹÓà MSN Messenger:

http://messenger.msn.com/cn 

___

openib-general mailing list

openib-general@openib.org



http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit

http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: IP addressing on InfiniBand networks (Caitlin Bestler)

2005-07-05 Thread Michael Krause


At 10:49 AM 6/30/2005, Roland Dreier wrote:
 Michael
Being the person who led the addressing definition for
 Michael IB, I can state quite clearly that GID are
NOT IPv6
 Michael addresses. They were intentionally
defined to have a
 Michael similar look-n-feel since they were
derived in large part
 Michael from Future I/O which had them as real
IPv6 addresses.
 Michael But again, they are NOT IPv6
addresses.
The IBA spec seems to have a different idea. In fact chapter 4
says:
 A GID is a valid 128-bit IPv6 address (per RFC
2373)
I wrote the original spec here. The text was supposed to be updated
to clarify that the rest of the sentence, i.e. with additional rules,
etc. thus making it not a real IPv6 address from the IETF's perspective
but something quite close. The intention was to allow one to manage
the fabric by having mapping functions from traditional IP management
applications to IB GID to minimize the amount of work to enable IB within
a solution. 
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [iser]about the target

2005-07-05 Thread Michael Krause


At 06:07 PM 7/4/2005, Ian Jiang wrote:
Hi!
I am new to the iSER.
On

https://openib.org/tiki/tiki-index.php?page=iSER, it is said
that iSER currently contains initiator only (no target). Will the target
come out later? How did they test the iSER initiator without a iSER
target?
Could you give some explaination?
 From a practical perspective, there are very few iSCSI targets shipping
today. Most people had envisioned iSER over IB to a gateway
Ethernet device since native IB storage is also quite rare in terms of
real product. For many of us, our push for iSER over IB was to
replace SRP which has a deficient ecosystem thus not really used beyond
some basic Fibre Channel gateway cards. 
Mike
Thanks!

Ian Jiang
[EMAIL PROTECTED]

Computer Architecture Laboratory
Institute of Computing Technology
Chinese Academy of Sciences
Beijing,P.R.China
Zip code: 100080
Tel: +86-10-62564394(office)
_
ÓëÁª»úµÄÅóÓѽøÐн»Á÷£¬ÇëʹÓà MSN Messenger:

http://messenger.msn.com/cn  
___
openib-general mailing list
openib-general@openib.org

http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit

http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: IP addressing on InfiniBand networks (Caitlin Bestler)

2005-06-30 Thread Michael Krause


At 10:39 PM 6/29/2005, Bill Strahm wrote:


--
Message: 2
Date: Wed, 29 Jun 2005 09:00:37 -0700
From: Roland Dreier [EMAIL PROTECTED]
Subject: Re: [openib-general] IP addressing on InfiniBand networks
To: Caitlin Bestler [EMAIL PROTECTED]
Cc: Lentini, James [EMAIL PROTECTED],
Christoph Hellwig

[EMAIL PROTECTED],openib-general
openib-general@openib.org
Message-ID: [EMAIL PROTECTED]
Content-Type: text/plain; charset=us-ascii
 Caitlin An assigned GID meets all of the requirements
for an IA
 Caitlin Address. I think taking advantage of that
existing
 Caitlin capability is just one of many options that can
be done
 Caitlin by the IB CM rather than forcing IB specific
changes up
 Caitlin to the application layer.
Just to be clear, the IBA spec is very clear that a GID _is_ an IPv6
address.
- R.
Just to be REALLY clear - IANA has not allocated IPv6
address space to any Infiniband entities - so they are not Internet IPv6
addresses. GIDs are formatted like IPv6 addresses but in no sense
should EVER be used at an IP layer 3 address.
 From an IPoIB stance - IB is just a very over engineered Layer 2 that
has a singularly large MAC address.
 From an IB ULP point of view - how you get to the layer 2 address that
is needed to perform communications that do not include IP, it isn't a
problem - but let me tell you The leadership of the IETF is scared of
IB because of saying things IB GIDs _ARE_ IPv6
addresses.
Being the person who led the addressing definition for IB, I can state
quite clearly that GID are NOT IPv6 addresses. They were
intentionally defined to have a similar look-n-feel since they were
derived in large part from Future I/O which had them as real IPv6
addresses. But again, they are NOT IPv6 addresses. 
For IP over IB, it is unfortunate that we could not have simply used a
raw datagram service as that would have made life very simple but we are
in the state we are so that means there is a UD transport providing a
layer 2 Ethernet equivalent of functionality. 
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] mapping between IP address and device name

2005-06-28 Thread Michael Krause


At 10:30 AM 6/24/2005, Roland Dreier wrote:
 Thomas As
I said - I am not attached to ATS. I would welcome an
 Thomas alternative.
Sure, understood. I'm suggesting a slight tweak to the IB wire
protocol. I don't think there's a difference in the security
provided, and carrying the peer address in the CM private data
avoids
a lot of the conceptual and implementation difficulties of ATS.
 Thomas But in the absence of one, I like what we
have. Also, I do
 Thomas not want to saddle the NFS/RDMA transport
with carrying an
 Thomas IP address purely for the benefit of a
missing transport
 Thomas facility. After all NFS/RDMA works on iWARP
too.
I'm not sure I understand this objection. We wouldn't be saddling
the
transport with anything -- simply specifying in the binding of
NFS/RDMA to IB that certain information is carried in the private
data
fields of the CM messages used to establish a connection.
Clearly
iWARP would use its own mechanism for providing the peer
address.
This would be exactly analogous to the situation for SDP --
obviously
SDP running on iWARP does not use the IB CM to exchange IP address
information in the same way the SDP over IB does.
Actually, SDP on iWARP uses the SDP port mapper protocol to comprehend
the IP address / port tuples used on both sides of the communication
before the connection is established (this protocol could be used by any
mapping service since it is implemented on top of UDP so could be re-used
by other subsystems like NFS. The TCP transport then connects
normally and one can ask it for the IP address / port tuple that is
really being used. Port mapper may be viewed as akin to the
SID protocol defined for IB. The SDP hello is then exchange in byte
stream as opposed to IB CM. 
The port mapper supports both centrally managed and distributed usage
models, supports the ability to return diff IP address than requested,
support multiple IP addresses per port, etc. One can construct a
very flexible infrastructure that supports nearly any type of mapping one
desires to same or different hardware or endnodes. It is fairly
light weight and can support caching of data for a period of time or even
a one-shot connection attempt.
Mike 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] How about ib_send_page() ?

2005-06-07 Thread Michael Krause


At 12:13 PM 6/3/2005, Sean Hefty wrote:
Fab Tillier wrote:
Ok, so this question is from a
noob, but here goes anyway. Why can't IPoIB
advertise a larger MTU than the UD MTU, and then just fragment large
IP
packets up if they need to go over the IB UD transport? Is there
any reason
this couldn't work? If it does, it allows IPoIB to expose a single
MTU to
the OS, and take care of the rest under the covers.
Just a thought.
I don't remember seeing a response to this. Something like this
could work. I guess one disadvantage is that it can be less
efficient if you lose a lot of packets. TCP would resend an entire
MTU (as seen by TCP) of data if only a single IB packet were lost.

Why not just use the IETF draft for RC / UC based IP over IB and not
worry about creating something new?  
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] How about ib_send_page() ?

2005-06-07 Thread Michael Krause


At 09:28 AM 6/7/2005, Fab Tillier wrote:
 From: Roland Dreier
[
mailto:[EMAIL PROTECTED]]
 Sent: Tuesday, June 07, 2005 8:38 AM
 
 Michael Why not just use the IETF draft
for RC / UC based IP over
 Michael IB and not worry about creating
something new?
 
 I think we've come full circle. The original post was a
suggestion on
 how to handle the fact the the connected-mode IPoIB draft requires
a
 network stack to deal with different MTUs for different
destinations
 on the same logical link.
That's right - by implementing IP segmentation in the IPoIB driver when
going
over UD, the driver could expose a single MTU to the network stack,
thereby
removing all the issues related to having per-endpoint MTUs.
Keeping a 2K MTU for RC mode doesn't really take advantage of IB's
RC
capabilities. I'd probably target 64K as the MTU.
The draft should state a minimum for all RC / UC which should be the TCP
MSS. Whether one does SAR over a UD endpoint independent of the
underlying physical MTU can be done but it should not require end-to-end
understanding of the operation, i.e. the send side tells its local that
the TCP MSS is X while the receive side only posts 2-4 KB buffers.
This has been done over Ethernet for years.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [Rdma-developers] Re: [openib-general] OpenIB and OpenRDMA: Convergence on common RDMAAPIs and ULPs for Linux

2005-05-31 Thread Michael Krause


At 06:47 AM 5/28/2005, Christoph Hellwig wrote:
On Sat, May 28, 2005 at
05:17:54AM -0700, Sukanta ganguly wrote:
 That's a pretty bold statement. Linux grew up to be
 popular via mass acceptance. Seems like that charter
 has changed and a few have control over Linux and its
 future. The My way or the highway philosophy has
 gotten embedded in the Linux way of life.
 Life is getting tough.
You're totally missing the point. Linux is successfull exactly
because it's lookinf for the right solution, not something the
business people need short-term. 
Hence why some of us contend that the end-game, i.e. the right solution,
is not necessarily the short-term implementation that is present today
that just evolves creating that legacy inertia that I wrote about
earlier. I think there is validity to having an implementation to
critique - accept, reject, modify. I think there is validity to
examining industry standards as the basis for new work /
implementation. If people are unwilling to discuss these standards
and only stay focused on their business people's short-term needs, then
some might contend as above that Linux is evolving to be much like the
dreaded Pacific NW company in the end. Not intending to offend
anyone but if there can be no debate without implementation on what is
the right solution, then people might as well just go off and implement
and propose their solution for incorporation into the Linux kernel.
It may be that OpenIB wins in the end or it may be that it
does not. Just having OpenIB subsume control of anything iWARP or
impose only DAPL for all RDMA infrastructure because it just happens to
be there today seems rather stifling. Just stating that some OpenIB
steering group is somehow empowered to decide this for Linux is also
rather strange. Open source is about being open and not under the
control of any one entity in the end. Perhaps that is no
longer the case.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [Rdma-developers] Re: [openib-general] OpenIB and OpenRDMA: Convergence on common RDMA APIs and ULPs for Linux

2005-05-27 Thread Michael Krause


At 06:40 AM 5/27/2005, Sukanta ganguly wrote:
Venkata,
 How will that work? If the RNIC offloads RDMA and
TCP completely from the Operating System and does not
share any state information then the application
running on the host will never be in the position to
utilize the socket interface to use the communication
logic to send and receive data between the remote node
and itself. Some information needs to be shared. How
much of it and what exactly needs to be shared is the
question.
Ok. It all depends upon what level of integration / interaction a
TOE and thus a RNIC will have with the host network stack. For
example, if a customer wants to have TCP and IP stats kept for the
off-loaded stack even if it is just being using for RDMA, then there
needs to be a method defined to consolidate these stats back into the
host network stack tool chain. Similarly, if one wants to maintain
a single routing table to manage, etc. on the host, then the RNIC needs
to access / update that information accordingly. One can progress
through other aspects of integration, e.g. connection management,
security interactions (e.g. DOS protection), and so forth. What is
exposed again depends upon the level of integration and how customers
want to manage their services. This problem also exists for IB but
most people have not thought about this from a customer perspective and
how to integrate the IB semantics into the way customers manage their
infrastructures, do billing, etc. For some environments, they
simply do not care but if IB is to be used in the enterprise space, then
some thought will be required here since most IT don't see anything as
being free or self-managed.
Again, Sockets is an application API and not how one communicates to a
TOE or RDMA component. The RNIC PI has been proposed as an
interface to the RDMA functionality. The PI supports all of the
iWARP and IB v 1.2 verbs. 
Mike

Thanks
SG
--- Venkata Jagana [EMAIL PROTECTED] wrote:
 
 
 
 
 
 
 [EMAIL PROTECTED] wrote on
 05/25/2005 09:47:00
 PM:
 
  Venkata,
  Interesting coincidence: I was talking with
 someone (at HP) today
  who knows substantially more than I do about
 RNICs.
  They indicated RNICs need to manage TCP state on
 the card from userspace.
  I suspect that's only possible through a private
 interface
  (e.g. ioctl() or /proc) or the non-existant (in
 kernel.org)
  TOE implementation. Is this correct?
 
 
 Not correct.
 
 Since RNICs are offloaded adapters with RDMA
 protocols layered on
 top of TCP stack, they do maintain the TCP state
 internally but
 it does not expose to the host. RNIC expose only
 RNIC Verbs interface
 to the host bot not TOE interface.
 
 Thanks
 Venkat
 
 
  hth,
  grant
 
 
 

---
  SF.Net email is sponsored by: GoToMeeting - the
 easiest way to
 collaborate
  online with coworkers and clients while avoiding
 the high cost of travel
 and
  communications. There is no equipment to buy and
 you can meet as often as
  you want. Try it

free.
http://ads.osdn.com/?ad_id=7402alloc_id=16135op=click
  ___
  Rdma-developers mailing list
  [EMAIL PROTECTED]
 

https://lists.sourceforge.net/lists/listinfo/rdma-developers
__
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around


http://mail.yahoo.com 

---
This SF.Net email is sponsored by Yahoo.
Introducing Yahoo! Search Developer Network - Create apps using
Yahoo!
Search APIs Find out how you can build Yahoo! directly into your own
Applications - visit

http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005
___
Rdma-developers mailing list
[EMAIL PROTECTED]

https://lists.sourceforge.net/lists/listinfo/rdma-developers


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [Rdma-developers] Re: [openib-general] OpenIB and OpenRDMA: Convergence on common RDMA APIs and ULPs for Linux

2005-05-27 Thread Michael Krause


At 09:29 AM 5/27/2005, Grant Grundler wrote:
On Fri, May 27, 2005 at
07:24:44AM -0700, Michael Krause wrote:
...
 Again, Sockets is an application API and not how one communicates to
a TOE 
 or RDMA component.
Mike,
What address family is used to open a socket over iWARP? AF_INET?
Or something else?
TCP = AF_INET. Address family != Sockets. Sockets is an API
that can operate over multiple address families. An application can
be coded to Sockets, IT API, DAPL, or a verbs interface like the RNIC
PI. It is a matter of choice as well as what is trying to be
accomplished. The RNIC PI is an acceptable interface for any
RDMA-focused ULP. There are pros / cons to using such a verbs
interface directly but I do not believe any one can deny that a
general-purpose verbs API is a good thing at the end of the day as it
works for the volume verbs definition. Whether one applies further
hardware semantics abstraction such as IT API / DAPL should be a choice
for the individual subsystem as there is no single right answer across
all subsystems. Attempting to force fit isn't practical.
I understand most
of what you wrote but am still missing one bit:
How is the RNIC told what the peer IP is it should communicate
with?
The destination address (IB GID or IP) is derived from the CM
services. This is where the two interconnects differ in what is
required to physical inject a packet on the wire. This is why I
call it out as separate from the verbs interface and something that could
be abstracted to some extent but at the end of the day, really requires
the subsystem to understand the underlying fabric type to make some
intelligent choices. Given this effort is still nascent, most of
the issues beyond basic bootstrap have not really been discussed as
yet.

 The RNIC PI
has been proposed as an interface to the 
 RDMA functionality. The PI supports all of the iWARP and IB v
1.2 verbs.
That's good. Folks from RDMA consortium will have to look at openib
implementations and see whats missing/wrong.
Then submit proposals to fill in the gaps. I'm obviously not the first
one to say this.
There are two open source efforts. The question is whether to move
to a single effort (I tried to get this to occur before OpenIB was
formally launched but it seem to fall on deaf ears for TTM marketing
purposes) or whether to just coordinate on some of the basics. My
preference remains that the efforts remained strictly focused on the RDMA
infrastructure and interconnect-specific components and leave the ULP /
services as separate efforts who will make their own decisions on how
best to interface with the RDMA infrastructure.
I expect most of
the principals involved with openib.org do NOT have time to browse
through RNIC PI at this point. They are struggling to get openib.org
filled in sufficiently so it can go into a commercial distro (RH/SuSE
primarily).
Hence, why OpenRDMA needs to get source being developed to enable the
RNIC community. If people find value in the work, then people can
look at finding the right solution for both IB and iWARP when it makes
sense.

Revenue for them
comes from selling IB equipment. Having openib.org code in kernel.org is
a key enabler for getting
into commercial distros. I expect the same is true for RNIC vendors
as well.
RNIC Vendors (and related switch Vendors) will have to decide which path
is the right one for them to get the support into kernel.org. Several
openib.org people have suggested one (like I have). RNIC folks need to
listen and decide if the advice is good or not. If RNIC folks think they
know better, then please take another look at where openib.org is today
and where rdmaconsortium is.
I'm certain openib.org would be dead now if policies and direction
changes had not made last year as demanded by several key linux
developers and users (Gov Labs).
Understood.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

  1   2   >