How smart is the SM when it generates path records?  I keep running into issues 
where the SM can't handle the PR query traffic when running large MPI jobs.  In 
many cases, it even falls over and dies.  Not a pretty situation.  SM 
transaction rate (or lack thereof) is what causes IBAT to fail, and sometimes 
causes IPoIB to fail.  The failure sequence for IPoIB goes something like this:

1. IPoIB gets an ARP request, reports it to Windows.
2. Windows sends an ARP response to IPoIB.
3. IPoIB needs to create an AV to send the ARP response.  It queries the SM for 
a PR so that it can fill the AV attributes.
4. PR query times out, IPoIB tells Windows it's hung.
5. Windows resets IPoIB
6. IPoIB does a port info query to the SA
7. Port info query times out
8. IPoIB gives up, logs an event to the event log, and goes into a 'cable 
disconnected' state.  It remains in this state until it gets a SM reregister 
request.

At this point, the node is unusable for IB traffic.  You have to either:
a). restart OpenSM so that every node gets a SM reregister request
b). disable/enable IPoIB so that it tries again (hopefully with better luck)

So I've been making some changes to minimize the dependency on the SM/SA at 
runtime, beyond initial configuration.

Phase 1 of the change is to eliminate path record queries from IPoIB.  Using 
the information from an ARP request or a work completion, along with 
information from the broadcast group, I can create address vectors without 
having to go chat with the SM.  This has shown good results so far.

Here's where I get the various AV attribute parameters:
Service Level: broadcast group
DLID: Work Completion
GRH Hop Limit: broadcast group
GRH Flow Label: broadcast group
GRH traffic class: broadcast group
GRH destination GID: endpoint (from ARP request)
GRH source GID: create subnet local GID using port GUID
Static Rate: broadcast group

The source and destination GIDs are the only fields that are used identically 
as the current path record based mechanism.

Can anyone find any pitfalls in doing this?  I understand that bandwidth will 
be limited to the speed of the broadcast group, and I think I'm OK with that 
because generally the fabric is homogeneous, and MC group rate problems are 
indicative of some configuration or fabric issue.  What happens to MC rate when 
you have IB routers in the mix?

Now onto Phase 2: Making IPoIB generate path records for IBAT clients rather 
than going to the SM.  The problem here is that if you have enough clients all 
going to the SM at the same time, they all end up in a situation where their 
queries timeout and they retry.  I've gone to an exponential backoff for 
retries and even with a maximum retry count of 2 minutes things never got past 
querying for path records.  A local PR cache would help here, and that's 
another option, but then you have issues with stale entries etc.  So I'd rather 
generate path records in IPoIB, where stale information is less likely (since 
the system will resend an ARP if a sufficient time interval has gone by).

To create a path record, IPoIB needs the following values (in addition to the 
ones it has access to for the AV creation):
SLID: Can be stored in IPoIB port object (__endpt_mgr_add_local gets it)
Reversible: Hard code to 1
NumbPath: Hard code to 1
PKey: Same as IPoIB port object
MTU: broadcast group
Rate: broadcast group
Packet Life: broadcast group
Preference: 0

Again, any glaring issues with doing this?

Thanks,
-Fab


_______________________________________________
ofw mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw

Reply via email to