Re: [openib-general] SA cache design

2006-01-16 Thread Sean Hefty

Eitan Zahavi wrote:

[EZ] Having N^2 messages is not a big problem if they do not all go one
target... 
CM is distributed and this is good. Only the PathRecord section of the

connection establishment is going today to one node (SA) and you are
about to fix it...


I expect that we'll start having issues scaling when the number of nodes starts 
to exceed the size of the CM's QP.  Your idea below should help.



During initial connections setup you will not have anything in the SA
cache and thus the SA will need to answer N^2 PathRecords. Smart
exponential back-off can resolve that DOS attack on the SA at bring-up.


I'll post the code for the cache once I complete my testing, but it issues a 
single query to fill the cache.  The SA will only see O(n) requests.  The cache 
also supports an update delay, or settle time, and minimum update time to 
prevent spamming the SA with back to back requests.



[EZ] We might need a little more in the key for QoS support (to come).


This would need to be exposed through our APIs as well.  Alternate paths are 
also not yet supported.



[EZ] I would try and make sure the connections are not done in a manner
such that all nodes try to establish connections to a single node at the
same time. This is an application issue but can be easily resolve.


I agree.


[EZ] I think a centralized CM is a only going to make things worse.


It can reduce the number of messages on the network from O(n^2) to O(n).  The 
idea is that instead of all nodes sending connection requests to all other 
nodes, they send a single connection request -- containing an array of QP 
information -- to one node.  (The array could be sent over an established 
connection, rather than in MADs.)  The amount of traffic to that one node should 
be only slightly worse than the all to all case.


- Sean

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-16 Thread Rimmer, Todd
> From: Eitan Zahavi [mailto:[EMAIL PROTECTED]
> What was I thinking ...
> for (target  = (myRank + 1) % numNodes ; target != myRank; target =
> (target + 1)% numNodes) { /* establish connection to node target
> */
> }
This can be even simpler for MPI.

Given some nodes must listen and others must connect, have an approch such as 
higher rank processes connect to lower rank processes.

Then its simply:
initiate listen on my endpoint /* could omit this for highest rank in 
job */

for (target=(my_rank-1); target>0; target--)
initiate connect to target

For even greater efficiency, the "initiate connect to target" could be done in 
parallel batches.  Eg. start 50 outbound connects, wait for some or all of them 
to complete, then start the next batch.  Such as:

for (target=(my_rank-1); target>0; target--)
while (num_outstanding > limit)
wait
num_outstanding++
initiate connect to target

Then the callback for completing a connection sequence could decrement 
num_outstanding and wakeup the waiter (or the waiter could be a sleep/poll type 
loop).

We have been successfully using the algorithms above for about 2-3 years now 
and they work very well.

Todd Rimmer
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-16 Thread Eitan Zahavi
What was I thinking ...
for (target  = (myRank + 1) % numNodes ; target != myRank; target =
(target + 1)% numNodes) {   /* establish connection to node target
*/
}

> [EZ] I would try and make sure the connections are not done in a
manner
> such that all nodes try to establish connections to a single node at
the
> same time. This is an application issue but can be easily resolve. Do
> the MPI connection in a loop like:
> 
> for (target  = (myRank + 1) % numNodes ; target != myRank; (target++)
%
> numNodes) {
>   /* establish connection to node target */
> }
> 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-16 Thread Eitan Zahavi
Hi Sean

> Eitan Zahavi wrote:
> > [EZ] The scalability issues we see today are what I most worry
about.
> 
> 
> One issue that I see is that the CMA, IB CM, and DAPL APIs support
only
> point-to-point connections.  Trying to layer a many-to-many connection
model
> over these is leading to the inefficiencies.  For example, the CMA
generates one
> SA query per connection.  Another issue is that even if if the number
of queries
> were reduced, the fabric will still see O(n^2) connection messages.
[EZ] Having N^2 messages is not a big problem if they do not all go one
target... 
CM is distributed and this is good. Only the PathRecord section of the
connection establishment is going today to one node (SA) and you are
about to fix it...
During initial connections setup you will not have anything in the SA
cache and thus the SA will need to answer N^2 PathRecords. Smart
exponential back-off can resolve that DOS attack on the SA at bring-up.
> 
> Based on the code, the only SA query of interest to most users will be
a path
> record query by gids/pkey.  To speed up applications written to the
current CMA,
> DAPL, and Intel's MPI (hey, I gotta eat), my actual implementation has
a very
> limited path record cache in the kernel.  The cache uses an index with
O(1)
> insertion, removal, and retrieval.  (I plan on re-using the index to
help
> improve the performance of the IB CM as well.)
[EZ] We might need a little more in the key for QoS support (to come).
> 
> I'm still working on ideas to address the many-to-many connection
model.  One
[EZ] I would try and make sure the connections are not done in a manner
such that all nodes try to establish connections to a single node at the
same time. This is an application issue but can be easily resolve. Do
the MPI connection in a loop like:

for (target  = (myRank + 1) % numNodes ; target != myRank; (target++) %
numNodes) {
/* establish connection to node target */
}

> idea is to have a centralized connection manager to coordinate the
connections
> between the various endpoints.  The drawback is that this requires
defining a
> proprietary protocol.  Any implementation work in this area will be
deferred for
> now though.
[EZ] I think a centralized CM is a only going to make things worse.
> 
> - Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-16 Thread Sean Hefty

Eitan Zahavi wrote:

[EZ] The scalability issues we see today are what I most worry about.


I think that we have a couple scalability issues at the core of this problem.  I 
think that a cache can solve part of the problem, but to fully address the 
issues, we eventually may need to extend our APIs and underlying protocols.


One issue that I see is that the CMA, IB CM, and DAPL APIs support only 
point-to-point connections.  Trying to layer a many-to-many connection model 
over these is leading to the inefficiencies.  For example, the CMA generates one 
SA query per connection.  Another issue is that even if if the number of queries 
were reduced, the fabric will still see O(n^2) connection messages.


Based on the code, the only SA query of interest to most users will be a path 
record query by gids/pkey.  To speed up applications written to the current CMA, 
DAPL, and Intel's MPI (hey, I gotta eat), my actual implementation has a very 
limited path record cache in the kernel.  The cache uses an index with O(1) 
insertion, removal, and retrieval.  (I plan on re-using the index to help 
improve the performance of the IB CM as well.)


I'm still working on ideas to address the many-to-many connection model.  One 
idea is to have a centralized connection manager to coordinate the connections 
between the various endpoints.  The drawback is that this requires defining a 
proprietary protocol.  Any implementation work in this area will be deferred for 
now though.


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-12 Thread Sean Hefty

Rimmer, Todd wrote:

While each process could do a GET_TABLE for all path records that
would be rather inefficient and would provide 1,000,000 path records in
the RMPP response, of which only 500 are of interest.


Each process could do a GET_TABLE for only those path records with the SGID set 
to their local port and NumPath set to 1.  This would give them only 1000 or so 
path records, most of which are of interest.



Even if all 4000 processors were being used in a single run, each
process only needs 3999 path records (999 or which are unique).
In fact a given node will never need more than N or the N^2 path records
because the remaining involve paths where this node is not involved.
so getting all 1,000,000 path records would be very inefficient.


Even a local cache wouldn't get every possible path record.  The application 
should be no different.  An application that wants to connect to every node on 
the fabric should only need to issue a single path record query, all of which 
are of interest.


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-12 Thread Rimmer, Todd
> From: Sean Hefty [mailto:[EMAIL PROTECTED]
> Rimmer, Todd wrote:
> > Each MPI process is independent.  However they all need to 
> get pathrecords
> > for all the other processes/nodes in the system.
> > Hence, each process on a node will make the exact same set 
> of queries.
> 
> That should still only be P queries per node, with P = number 
> of processes on a 
> node.  Why doesn't a single query (GET_TABLE) suffice for 
> each process?

Given a cluster with 1000 nodes, 4 processors per node.
A given MPI run may choose to use a subset, for example 500 processes.
Each process needs path records for the other 500 processes, but not
for the other 3500 cpus in the cluster.

While each process could do a GET_TABLE for all path records that
would be rather inefficient and would provide 1,000,000 path records in
the RMPP response, of which only 500 are of interest.

Even if all 4000 processors were being used in a single run, each
process only needs 3999 path records (999 or which are unique).
In fact a given node will never need more than N or the N^2 path records
because the remaining involve paths where this node is not involved.
so getting all 1,000,000 path records would be very inefficient.

Then multiply this by 4 processes per node making this same set of
queries.  Then multiply this by multiple partitions, SLs, etc per node
and it gets very inefficient to simply get the whole table.

Todd Rimmer
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-12 Thread Sean Hefty

Rimmer, Todd wrote:

Each MPI process is independent.  However they all need to get pathrecords
for all the other processes/nodes in the system.
Hence, each process on a node will make the exact same set of queries.


That should still only be P queries per node, with P = number of processes on a 
node.  Why doesn't a single query (GET_TABLE) suffice for each process?


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-12 Thread Grant Grundler
On Thu, Jan 12, 2006 at 11:58:28AM -0800, Sean Hefty wrote:
> This is still O(NlogN) operations, which made me look at indexing schemes 
> to improve performance.

I strongly associate "Indexing schemes"  with "judy":
http://docs.hp.com/en/B6841-90001/ix01.html

The open source project is here:
http://judy.sourceforge.net/

> The most obvious implementation to me was to store path records in a binary 
> tree sorted by dgid/pkey.  But this isn't very flexible.

"dynamic, associative array" might be overkill too.
I'm not sure how many index's it supports but Judy is definitely
worth looking at for a "simple" implementation.

Perf data I've seen 3-4 years ago indicated that Judy scales
nicely from 0 to several million entries.

grant
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-12 Thread Rimmer, Todd
> From: Sean Hefty [mailto:[EMAIL PROTECTED]
> > why ask the SA the same question multiple times in a row?
> 
> I have no idea why the application did this.  Are any of the 
> queries in this 
> case actually the same?

Each MPI process is independent.  However they all need to get pathrecords
for all the other processes/nodes in the system.
Hence, each process on a node will make the exact same set of queries.

Todd R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-12 Thread Sean Hefty

Eitan Zahavi wrote:

[EZ] MPI opens a connection from each node to every other node. Actually
even from every CPU to every other CPU. So this is why we have N^2
connections.


I was confusing myself.  I think that there are n(n-1)/2 connections, but that's 
still O(n^2).


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-12 Thread Sean Hefty

Rimmer, Todd wrote:

1 million entry SA database.


This is exactly why I think that the SA needs to be backed by a real DBMS.


In contrast the replica on each node only needs to handle O(N) entries.
And its lookup time could be O(logN).


This is still O(NlogN) operations, which made me look at indexing schemes to 
improve performance.


The most obvious implementation to me was to store path records in a binary tree 
sorted by dgid/pkey.  But this isn't very flexible.



why ask the SA the same question multiple times in a row?


I have no idea why the application did this.  Are any of the queries in this 
case actually the same?


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-12 Thread Eitan Zahavi
> 
> 
> On a related note, why does every instance of the application need to
query for
> every other instance?  To establish all-to-all communication, couldn't
instance
> X only initiate connections to instances > X?  (I.e. 1 connects to 2
and 3, 2
> connects to 3.)
[EZ] MPI opens a connection from each node to every other node. Actually
even from every CPU to every other CPU. So this is why we have N^2
connections.
> 
> > Only a very small subset of queries is used:
> > * PathRecord by SRC-GUID,DST-GUID
> > * PortInfo by capability mask
> 
> I did look at the code to see what queries were actually being used
today.  And
> yes, we can implement for only those cases.  I wanted to allow the
flexibility
> to support other queries efficiently.
[EZ] The scalability issues we see today are what I most worry about.
> 
> - Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-12 Thread Sean Hefty

Eitan Zahavi wrote:
The issue is the number of queries grow by N^2. 


I understand.

On a related note, why does every instance of the application need to query for 
every other instance?  To establish all-to-all communication, couldn't instance 
X only initiate connections to instances > X?  (I.e. 1 connects to 2 and 3, 2 
connects to 3.)


Only a very small subset of queries is used: 
* PathRecord by SRC-GUID,DST-GUID

* PortInfo by capability mask


I did look at the code to see what queries were actually being used today.  And 
yes, we can implement for only those cases.  I wanted to allow the flexibility 
to support other queries efficiently.


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-12 Thread Eitan Zahavi
Hi Sean,

The issue is the number of queries grow by N^2. 
Only a very small subset of queries is used: 
* PathRecord by SRC-GUID,DST-GUID
* PortInfo by capability mask
Not to say the current implementations are perfect.
But RDBMS are optimized for other requirements not a simple single key
lookup.
Also, PathRecord implementation requires traversing the fabric.
One could store the result after enumerating the entire
N^2*Nsl*Np-key*...
But then lookup is a simple hash lookup.

Eitan
> 
> Brian Long wrote:
> > How much overhead is going to be incurred by using a standard RDBMS
> > instead of not caching anything?  I'm not completely familiar with
the
> > IB configurations that would benefit from the proposed SA cache, but
it
> > seems to me, adding a RDBMS to anything as fast as IB would actually
> > slow things down considerably.  Can an RDBMS + SA cache actually be
> > faster than no cache at all?
> 
> I'm not sure what the speed-up of any cache will be.  The SA maintains
a
> database of various related records - node records, path records,
service
> records, etc. and responds to queries.  This need doesn't go away.
The SA
> itself is perfect candidate to be implemented using a DBMS.  (And if
one had
> been implemented over a DBMS, I'm not even sure that we'd be talking
about
> scalability issues for only a few thousand nodes.  Is the perceived
lack of
> scalability of the SA a result of the architecture or the existing
implementations?)
> 
> My belief is that a DBMS will outperform anything that I could write
to store
> and retrieve these records.  Consider that a 4000 node cluster will
have about
> 8,000,000 path records.  Local caches can reduce this considerably (to
about
> 4000), and if we greatly restrict the type of queries that are
supported, then
> we can manage the retrieval of those records ourselves.
> 
> I do not want end-users to have to administer a database.  However, if
the user
> only needs to install a library, then this approach seems worth
pursuing.
> 
> - Sean
> ___
> openib-general mailing list
> openib-general@openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-12 Thread Sean Hefty

Brian Long wrote:

What about SQLite (http://www.sqlite.org/)?  This is used by yum 2.4 in
Fedora Core and other distributions.

"SQLite is a small C library that implements a self-contained,
embeddable, zero-configuration SQL database engine."


Someone else sent me a link to this same site, and it looks promising.  Thanks.

- Sean

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-12 Thread Brian Long
On Thu, 2006-01-12 at 10:16 -0800, Sean Hefty wrote:
> Brian Long wrote:
> > How much overhead is going to be incurred by using a standard RDBMS
> > instead of not caching anything?  I'm not completely familiar with the
> > IB configurations that would benefit from the proposed SA cache, but it
> > seems to me, adding a RDBMS to anything as fast as IB would actually
> > slow things down considerably.  Can an RDBMS + SA cache actually be
> > faster than no cache at all?
> 
> I'm not sure what the speed-up of any cache will be.  The SA maintains a 
> database of various related records - node records, path records, service 
> records, etc. and responds to queries.  This need doesn't go away.  The SA 
> itself is perfect candidate to be implemented using a DBMS.  (And if one had 
> been implemented over a DBMS, I'm not even sure that we'd be talking about 
> scalability issues for only a few thousand nodes.  Is the perceived lack of 
> scalability of the SA a result of the architecture or the existing 
> implementations?)
> 
> My belief is that a DBMS will outperform anything that I could write to store 
> and retrieve these records.  Consider that a 4000 node cluster will have 
> about 
> 8,000,000 path records.  Local caches can reduce this considerably (to about 
> 4000), and if we greatly restrict the type of queries that are supported, 
> then 
> we can manage the retrieval of those records ourselves.
> 
> I do not want end-users to have to administer a database.  However, if the 
> user 
> only needs to install a library, then this approach seems worth pursuing.

What about SQLite (http://www.sqlite.org/)?  This is used by yum 2.4 in
Fedora Core and other distributions.

"SQLite is a small C library that implements a self-contained,
embeddable, zero-configuration SQL database engine."

/Brian/

-- 
   Brian Long  | |   |
   IT Data Center Systems  |   .|||.   .|||.
   Cisco Linux Developer   |   ..:|||:...:|||:..
   Phone: (919) 392-7363   |   C i s c o   S y s t e m s

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-12 Thread Rimmer, Todd
> From: Sean Hefty [mailto:[EMAIL PROTECTED]
> I'm not sure what the speed-up of any cache will be.  The SA 
> maintains a 
> database of various related records - node records, path 
> records, service 
> records, etc. and responds to queries.  This need doesn't go 
> away.  The SA 
> itself is perfect candidate to be implemented using a DBMS.  
> (And if one had 
> been implemented over a DBMS, I'm not even sure that we'd be 
> talking about 
> scalability issues for only a few thousand nodes.  Is the 
> perceived lack of 
> scalability of the SA a result of the architecture or the 
> existing implementations?)

The scalability problem occurs during things like MPI job startup.
At start up, you will have N processes which each need N-1 path
records to establish connections.  Those queries require both Node Record
and Path Record queries.

This means at job startup, the SA must process O(N^2) SA queries.
If the lookup algorithm in the SA is O(logM) {M= number of SA records,
which is O(N^2)), then the SA will have
O(N^2 log(N^2)) operations to perform and O(N^2) packets to send and receive.

For a 4000 CPU cluster (1000 nodes with 2 dual core CPUs each),
that is over 16 million SA queries at job startup against a 1 million entry
SA database.
It would take quite a good SA database implementation to handle than
in a timely manner.

In contrast the replica on each node only needs to handle O(N) entries.
And its lookup time could be O(logN).

You'll note I spoke of processes, not nodes.  In multi-CPU nodes,
each process will need similar information.  This is one area where a
replica can greatly help, why ask the SA the same question multiple times
in a row?

If only a cache is considered, then the startup is still O(N^2) SA queries
its just that we have 1/CPU-per-Node as many queries.

Todd Rimmer
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-12 Thread Sean Hefty

Brian Long wrote:

How much overhead is going to be incurred by using a standard RDBMS
instead of not caching anything?  I'm not completely familiar with the
IB configurations that would benefit from the proposed SA cache, but it
seems to me, adding a RDBMS to anything as fast as IB would actually
slow things down considerably.  Can an RDBMS + SA cache actually be
faster than no cache at all?


I'm not sure what the speed-up of any cache will be.  The SA maintains a 
database of various related records - node records, path records, service 
records, etc. and responds to queries.  This need doesn't go away.  The SA 
itself is perfect candidate to be implemented using a DBMS.  (And if one had 
been implemented over a DBMS, I'm not even sure that we'd be talking about 
scalability issues for only a few thousand nodes.  Is the perceived lack of 
scalability of the SA a result of the architecture or the existing implementations?)


My belief is that a DBMS will outperform anything that I could write to store 
and retrieve these records.  Consider that a 4000 node cluster will have about 
8,000,000 path records.  Local caches can reduce this considerably (to about 
4000), and if we greatly restrict the type of queries that are supported, then 
we can manage the retrieval of those records ourselves.


I do not want end-users to have to administer a database.  However, if the user 
only needs to install a library, then this approach seems worth pursuing.


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-12 Thread Brian Long
On Wed, 2006-01-11 at 14:21 -0800, Sean Hefty wrote:
> Rimmer, Todd wrote:
> > A relational database is overkill for this function.
> > It will also likely be more complex for end users to setup and debug.
> > The cache setup should be simple.  The solution should be such that
> > just an on/off switch needs to be configured (with a default of on)
> > for most users to get started.
> 
> My take is a little different.  I view the SA as a database that maintains 
> related attributes.
> 
> By supporting relationships between different attributes, we can provide a 
> more 
> powerful, higher performing, and more user-friendly interface to the user.  
> For 
> example, a single SQL query could return path records given only a node 
> description or service name.  Today, we generate multiple SA queries, their 
> responses, and associated RMPP MADs to obtain the same data.
> 
> I'm not sold on the idea of using a relational database, because of the 
> additional complexity for end-users.  However, I believe it can offer 
> significant advantages over what we could code ourselves.

How much overhead is going to be incurred by using a standard RDBMS
instead of not caching anything?  I'm not completely familiar with the
IB configurations that would benefit from the proposed SA cache, but it
seems to me, adding a RDBMS to anything as fast as IB would actually
slow things down considerably.  Can an RDBMS + SA cache actually be
faster than no cache at all?

/Brian/

-- 
   Brian Long  | |   |
   IT Data Center Systems  |   .|||.   .|||.
   Cisco Linux Developer   |   ..:|||:...:|||:..
   Phone: (919) 392-7363   |   C i s c o   S y s t e m s

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-11 Thread Sean Hefty

Rimmer, Todd wrote:

A relational database is overkill for this function.
It will also likely be more complex for end users to setup and debug.
The cache setup should be simple.  The solution should be such that
just an on/off switch needs to be configured (with a default of on)
for most users to get started.


My take is a little different.  I view the SA as a database that maintains 
related attributes.


By supporting relationships between different attributes, we can provide a more 
powerful, higher performing, and more user-friendly interface to the user.  For 
example, a single SQL query could return path records given only a node 
description or service name.  Today, we generate multiple SA queries, their 
responses, and associated RMPP MADs to obtain the same data.


I'm not sold on the idea of using a relational database, because of the 
additional complexity for end-users.  However, I believe it can offer 
significant advantages over what we could code ourselves.


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-11 Thread Sean Hefty

Greg Lindahl wrote:

Since no one's really answered this yet:

Many sysadmins are not going to want to install a relational database
to run an SA cache. So I'd stick to Berkeley DB if I were you.


Thanks for the response.  To be clear, the cache would be an optional component, 
and likely only needed for larger configurations.


From what I can tell PostgreSQL and MySQL both ship with RedHat and SuSE. 
MySQL claims that it can be built as a small library that can then be integrated 
with an application.  It may be possible to have the application do everything 
for the user except install the necessary libraries... ?


The installation and configuration of a database is what I see as the biggest 
drawback to going this route.  Unfortunately, I need to play with this idea more 
to see how much of an impact that would be to an actual user.


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-11 Thread Greg Lindahl
Since no one's really answered this yet:

Many sysadmins are not going to want to install a relational database
to run an SA cache. So I'd stick to Berkeley DB if I were you.

-- greg


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-11 Thread Rimmer, Todd
> From: Sean Hefty [mailto:[EMAIL PROTECTED]
> 
> Eitan Zahavi wrote:
> > Is the intention to speed up SA queries?
> > Or is it to have persistent storage of them?
> 
> I want both.  :)

I would clarify that the best bang for the effort will be to focus on
the queries which the ULPs themselves will use most often.
For example, the resolution from a node name or Node Guid to a path record.

While a general purpose replica would be nice, it could over complicate the
initial design.

The goal is not to optimize all the queries an end user might desire, but
rather to help avoid the O(N^2) load which thinks like start up of
an MPI or SDP application could cause on the SA.

> and how/when is it invalidated by the SM.
There are a variety of notices already available from the SM which should
be used for the triggering or invalidation.  Such:
GID In/Out of Service
Client Reregistration

It may also be desirable to have the CM upon a failed connect to a given remote 
node to
trigger the local replica to invalidate and requery for information about 
remote node.

Todd R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-11 Thread Sean Hefty

Eitan Zahavi wrote:

Is the intention to speed up SA queries?
Or is it to have persistent storage of them?


I want both.  :)


I think we should focus on the kind of data to cache,
how it is made transparently available to any OpenIB client
and how/when is it invalidated by the SM.
We should only keep the cache data in memory not on disk.


In order to support advanced queries efficiently, some sort of indexing scheme 
would be needed.  This is what a database system would provide, saving us from 
having to implement that part.  The fact that the database could also provide 
persistent storage and triggers are just additional advantages.



Later if we want to make it persistent or even stored in LDAP/SQL...
I do not care. But the first implementation should be in memory.


I think that you're assuming that an initial implementation that is done just in 
memory would be quicker to complete.  I'm not really wanting to write a complete 
throw-away solution capable of supporting only one or two very simple queries 
efficiently.



BTW: most of the databases referred by these mails are not supporting
distributed shadow copies of a centrally controlled tables.


Personally, I'd be happy with a simple database that provided nothing more than 
indexing and query support.


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-11 Thread Sean Hefty

James Lentini wrote:
Will it be possible to use the OpenIB stack without setting up the SA 
cache?


Yes.

- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-11 Thread Rimmer, Todd
> On Tue, 10 Jan 2006, Sean Hefty wrote:
> 
> > Grant Grundler wrote:
> > > I forgot to point out postgres:
> > >   http://www.postgresql.org/about/
> > 
> > This looks like it would work well.
> > 
> > The question that I have for users is:  Is it acceptable for the 
> > cache to make use of a relational database system?

A relational database is overkill for this function.
It will also likely be more complex for end users to setup and debug.
The cache setup should be simple.  The solution should be such that
just an on/off switch needs to be configured (with a default of on)
for most users to get started.

Todd Rimmer
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-11 Thread Eitan Zahavi

Hi Sean,

Now I really lost you:

Is the intention to speed up SA queries?
Or is it to have persistent storage of them?

I think we should focus on the kind of data to cache,
how it is made transparently available to any OpenIB client
and how/when is it invalidated by the SM.
We should only keep the cache data in memory not on disk.

Later if we want to make it persistent or even stored in LDAP/SQL...
I do not care. But the first implementation should be in memory.

BTW: most of the databases referred by these mails are not supporting
distributed shadow copies of a centrally controlled tables.

Eitan

Sean Hefty wrote:

Sean Hefty wrote:

To keep the design as flexible as possible, my plan is to implement 
the cache in userspace.  The interface to the cache would be via 
MADs.  Clients would send their queries to the sa_cache instead of the 
SA itself.  The format of the MADs would be essentially identical to 
those used to query the SA itself.  Response MADs would contain any 
requested information.  If the cache could not satisfy a request, the 
sa_cache would query the SA, update its cache, then return a reply.



What I think I really want is a distributed relational database 
management system with an SQL interface and triggers that maintains the 
SA data...  (select * from path_rec where sgid=x and dgid=y and pkey=z)


But without making any assumptions about the SA, a local cache could 
still use an RDMS to store and retrieve the data records.  Would 
requiring an RDMS on each system be acceptable?  If not, then writing a 
small, dumb pseudo-database as part of the sa_cache could provide a lot 
of flexibility.


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit 
http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-11 Thread James Lentini


On Tue, 10 Jan 2006, Sean Hefty wrote:

> Grant Grundler wrote:
> > I forgot to point out postgres:
> > http://www.postgresql.org/about/
> 
> This looks like it would work well.
> 
> The question that I have for users is:  Is it acceptable for the 
> cache to make use of a relational database system?

Will it be possible to use the OpenIB stack without setting up the SA 
cache?
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-10 Thread Sean Hefty

Grant Grundler wrote:

I forgot to point out postgres:
http://www.postgresql.org/about/


This looks like it would work well.

The question that I have for users is:  Is it acceptable for the cache to make 
use of a relational database system?


The disadvantage is that a RDMS would need to be installed and configured on 
several, or all systems.  (It's not clear to me yet how much of that could be 
automated.)


The advantage is that the cache would gain the benefits of having a database 
backend - notably support for more complex queries, persistent storage, and 
indexing to increase query performance.


To provide some additional context, path record queries can be fairly complex, 
involving a number of fields.  (All queries today are limited to sgid, dgid, and 
pkey.)  Trying to efficiently retrieve a path record based on a dgid and pkey is 
non-trivial, and support for queries with additional restrictions or for other 
SA records complicates this issue.


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-10 Thread Grant Grundler
On Tue, Jan 10, 2006 at 03:00:46PM -0800, Sean Hefty wrote:
> I did find that libdb-4.2 was installed on SuSE and RedHat systems, and a 
> libodbc was on my SuSE system.  Libdb-4.2 would help manage some of the SA 
> objects to a file, but is limited in its data storage and retrieval 
> capabilities.  If a true relational database couldn't be used, libdb would 
> definitely be useful.

I forgot to point out postgres:
http://www.postgresql.org/about/

Several packages (e.g. postfix, ldap) offer different backends so
the admin can decide how sophisticated the data storage and retrieval
needs to be. With roughly 150K employees, HP has a rather sophisticated
LDAP/postfix setup to manage logins. But I don't need that for the 10 boxes
I manage outside the firewall. Same is probably true for SA cache.

grant
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-10 Thread Sean Hefty

Grant Grundler wrote:

We already have several databases for different things:
makedb  (primarily for NSS)
updatedb (fast lookup of local files)
mandb   (man pages)
rpmdb   (yes, even on debian boxes)
sasldbconverter2 (for SASL - linux securty/login stuff)
*db4.3* (Berkeley v4.3 Database - used by apt-get/dpkg, Apache,
python, libns-db, postfix, etc)

In fact, looks like a debian "testing" box would be disfunctional
without Berkeley Database. Would that work?


Thanks for pointing these out.

I did find that libdb-4.2 was installed on SuSE and RedHat systems, and a 
libodbc was on my SuSE system.  Libdb-4.2 would help manage some of the SA 
objects to a file, but is limited in its data storage and retrieval 
capabilities.  If a true relational database couldn't be used, libdb would 
definitely be useful.


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-10 Thread Grant Grundler
On Tue, Jan 10, 2006 at 10:55:36AM -0800, Sean Hefty wrote:
> What I think I really want is a distributed relational database management 
> system with an SQL interface and triggers that maintains the SA data...  
> (select * from path_rec where sgid=x and dgid=y and pkey=z)
> 
> But without making any assumptions about the SA, a local cache could still 
> use an RDMS to store and retrieve the data records.  Would requiring an 
> RDMS on each system be acceptable?

We already have several databases for different things:
makedb  (primarily for NSS)
updatedb (fast lookup of local files)
mandb   (man pages)
rpmdb   (yes, even on debian boxes)
sasldbconverter2 (for SASL - linux securty/login stuff)
*db4.3* (Berkeley v4.3 Database - used by apt-get/dpkg, Apache,
python, libns-db, postfix, etc)

In fact, looks like a debian "testing" box would be disfunctional
without Berkeley Database. Would that work?

sleepycat.org gives more examples of opensource use:
OpenLDAP, Kerberos, Subversion, Sendmail, Postfix,
SquidGuard, NetaTalk, Movable Type, SpamAssassin,
Mail Avenger, Bogofilter


hth,
grant

> If not, then writing a small, dumb 
> pseudo-database as part of the sa_cache could provide a lot of flexibility.
> 
> - Sean
> ___
> openib-general mailing list
> openib-general@openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-10 Thread Sean Hefty

Sean Hefty wrote:
To keep the design as flexible as possible, my plan is to implement the 
cache in userspace.  The interface to the cache would be via MADs.  
Clients would send their queries to the sa_cache instead of the SA 
itself.  The format of the MADs would be essentially identical to those 
used to query the SA itself.  Response MADs would contain any requested 
information.  If the cache could not satisfy a request, the sa_cache 
would query the SA, update its cache, then return a reply.


What I think I really want is a distributed relational database management 
system with an SQL interface and triggers that maintains the SA data...  (select 
* from path_rec where sgid=x and dgid=y and pkey=z)


But without making any assumptions about the SA, a local cache could still use 
an RDMS to store and retrieve the data records.  Would requiring an RDMS on each 
system be acceptable?  If not, then writing a small, dumb pseudo-database as 
part of the sa_cache could provide a lot of flexibility.


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-06 Thread Eitan Zahavi
Hi Sean

I am still confused about the exact requirement. But the reference is:
osm/opensm/osm_sa_path_record.c

The rest of the queries are handled by osm_sa_*.c (but not the code in
_ctrl.c).
osm_sa_class_port_info.c  
osm_sa_response.c
osm_sa_node_record.c  
osm_sa_service_record.c
osm_sa_informinfo.c 
osm_sa_path_record.c  
osm_sa_slvl_record.c
osm_sa_lft_record.c   
osm_sa_lft_record_ctrl.c 
osm_sa_sminfo_record.c
osm_sa_link_record.c  
osm_sa_pkey_record.c 
osm_sa_vlarb_record.c
osm_sa_mad_ctrl.c 
osm_sa_portinfo_record.c  
osm_sa_mcmember_record.c

Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -Original Message-
> From: Sean Hefty [mailto:[EMAIL PROTECTED]
> Sent: Friday, January 06, 2006 10:40 PM
> To: Hal Rosenstock
> Cc: Eitan Zahavi; openib
> Subject: Re: [openib-general] SA cache design
> 
> Hal Rosenstock wrote:
> > I would view that the database is an SADB with the actual
pathrecords as
> > one example rather than the SMDB from which they are calculated. I
think
> > Sean is interested in the SA packet query/response code here so
avoid
> > recreating this and that the backend would be stripped out. Sean, is
> > that accurate ?
> 
> Hal is correct.
> 
> - Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-06 Thread Sean Hefty

Hal Rosenstock wrote:

I would view that the database is an SADB with the actual pathrecords as
one example rather than the SMDB from which they are calculated. I think
Sean is interested in the SA packet query/response code here so avoid
recreating this and that the backend would be stripped out. Sean, is
that accurate ?


Hal is correct.

- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-06 Thread Sean Hefty

Sean Hefty wrote:
- The MAD interface will result in additional data copies and userspace 
to kernel transitions for clients residing on the local system.
- Clients require a mechanism to locate the sa_cache, or need to make 
assumptions about its location.


Based on some comments from people, I believe that we can handle the latter 
problem when the sa_cache/sa_replica/sa_whateveryouwanttocallit registers with 
the MAD layer.  Ib_mad can record an sa_lid and sa_sl as part of a device's port 
attributes.  These would initially be set the same as sm_lid and sm_sl.  When a 
client registers to receive unsolicited SA MADs, the attributes would be updated 
accordingly.  ib_sa and other clients sending MADs to the SA would use these 
values in place of the SM values.


I'm not fond of the idea of pushing an SA switch into the MAD layer, since this 
makes it more difficult for the actual cache to query the SA directly.


Another approach that may work better long term is treating the cache as a 
redirected SA request.  Something along the lines of:


http://openib.org/pipermail/openib-general/2005-September/011349.html

(but with a restricted implementation for now) might also work.

- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-06 Thread Hal Rosenstock
On Fri, 2006-01-06 at 14:55, Eitan Zahavi wrote:

> I guess you mean the code that is answering to PathRecord queries?
> It is possible to extract the "SMDB" objects and duplicate that database.
> I am not sure it is such a good idea. What if the SM is not OpenSM?

I would view that the database is an SADB with the actual pathrecords as
one example rather than the SMDB from which they are calculated. I think
Sean is interested in the SA packet query/response code here so avoid
recreating this and that the backend would be stripped out. Sean, is
that accurate ?

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-06 Thread Hal Rosenstock
On Fri, 2006-01-06 at 15:13, Eitan Zahavi wrote:
> Hal Rosenstock wrote:
> > On Fri, 2006-01-06 at 15:00, Eitan Zahavi wrote:
> > 
> >>I agree with Todd: a key is to keep the client unaware of the mux existence.
> >>So the same client can be run on system without the cache.
> > 
> > 
> > Define same client ? I would consider it the same SA client directing
> > requests differently based on how the mux is configured based on a query
> > to the cache (if it exists) as to its capabilities.
> SA Client can be embedded in an application - any program that can send mads 
> can be an SA client.

Such (non OpenIB) clients would not take advantage of the cache. That
seems like the tradeoff for not the duplicated forwarding of the
request. Guess I'm in the minority thinking that this might be
worthwhile.

-- Hal

> 
> > 
> > -- Hal
> > 
> > 
> >>Hal Rosenstock wrote:
> >>
> >>>On Fri, 2006-01-06 at 09:05, Rimmer, Todd wrote:
> >>>
> >>>
> >From: Hal Rosenstock [mailto:[EMAIL PROTECTED]
> >On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote:
> >
> >
> >>This of course implies the "SA Mux" must analyze more than just 
> >>the attribute ID to determine if the replica can handle the query.  
> >>But the memory savings is well worth the extra level of filtering.
> >
> >If the SA cache does this, it seems it would be pretty simple 
> >to return
> >this info in an attribute to the client so the client would 
> >know when to
> >go to the cache/replica and when to go direct to the SA in the case
> >where only certain queries are supported. Wouldn't this be 
> >advantageous
> >when the replica doesn't support all queries ?
> 
> Why put the burden on the application.  give the query to the Mux.
> >>>
> >>>
> >>>That's what I'm suggesting. Rather than a binary switch mux, a more
> >>>granular one which determines how to route the outgoing SA request.
> >>>
> >>>
> >>>
>  With an optional flag indicating a prefered "routing" (choices of: to 
>  SA, 
> to replica, let Mux decide).  Then let it decide.  As you suggest it may 
> be simplest to let the Mux try the replica and on failure fallback 
> to the SA transparent to the app (sort of the way SDP intercepts 
> socket ops and falls back to TCP/IP when SDP isn't appropriate).
> >>>
> >>>
> >>>It depends on whether the replica/cache forwards unsupported requests on
> >>>or responds with not supported back to the client as to how this is
> >>>handled. Sean was proposing the forward on model and a binary switch at
> >>>the client. I think this is more granular and can be mux'd only with the
> >>>knowledge of what a replica/cache supports (not sure about dealing with
> >>>different replica/caches supporting a different set of queries; need to
> >>>think more on how the caches are located, etc.). You are mentioning a
> >>>third model here.
> >>>
> >>>-- Hal
> >>>
> >>>___
> >>>openib-general mailing list
> >>>openib-general@openib.org
> >>>http://openib.org/mailman/listinfo/openib-general
> >>>
> >>>To unsubscribe, please visit 
> >>>http://openib.org/mailman/listinfo/openib-general
> >>
> 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-06 Thread Eitan Zahavi

Sean Hefty wrote:

Eitan Zahavi wrote:

Can someone familiar with the opensm code tell me how difficult it 
would be to extract out the code that tracks the subnet data and 
responds to queries?



I guess you mean the code that is answering to PathRecord queries?



Yes - that along with answering other queries.


It is possible to extract the "SMDB" objects and duplicate that database.
I am not sure it is such a good idea. What if the SM is not OpenSM?



I was thinking in terms of code re-use, and not in terms of which SM was 
running.  Interfacing to the SM would be through standard queries.

The issue is that answering PathRecords queries can have impact on further 
algorithms the SM takes.
It might not be enough to know the topology, SL2VL, LFT, MFT to answer 
PathRecord attributes...


- Sean


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-06 Thread Eitan Zahavi

Hal Rosenstock wrote:

On Fri, 2006-01-06 at 15:00, Eitan Zahavi wrote:


I agree with Todd: a key is to keep the client unaware of the mux existence.
So the same client can be run on system without the cache.



Define same client ? I would consider it the same SA client directing
requests differently based on how the mux is configured based on a query
to the cache (if it exists) as to its capabilities.

SA Client can be embedded in an application - any program that can send mads 
can be an SA client.



-- Hal



Hal Rosenstock wrote:


On Fri, 2006-01-06 at 09:05, Rimmer, Todd wrote:



From: Hal Rosenstock [mailto:[EMAIL PROTECTED]
On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote:


This of course implies the "SA Mux" must analyze more than just 
the attribute ID to determine if the replica can handle the query.  
But the memory savings is well worth the extra level of filtering.


If the SA cache does this, it seems it would be pretty simple 
to return
this info in an attribute to the client so the client would 
know when to

go to the cache/replica and when to go direct to the SA in the case
where only certain queries are supported. Wouldn't this be 
advantageous

when the replica doesn't support all queries ?


Why put the burden on the application.  give the query to the Mux.



That's what I'm suggesting. Rather than a binary switch mux, a more
granular one which determines how to route the outgoing SA request.



With an optional flag indicating a prefered "routing" (choices of: to SA, 
to replica, let Mux decide).  Then let it decide.  As you suggest it may 
be simplest to let the Mux try the replica and on failure fallback 
to the SA transparent to the app (sort of the way SDP intercepts 
socket ops and falls back to TCP/IP when SDP isn't appropriate).



It depends on whether the replica/cache forwards unsupported requests on
or responds with not supported back to the client as to how this is
handled. Sean was proposing the forward on model and a binary switch at
the client. I think this is more granular and can be mux'd only with the
knowledge of what a replica/cache supports (not sure about dealing with
different replica/caches supporting a different set of queries; need to
think more on how the caches are located, etc.). You are mentioning a
third model here.

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general




___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-06 Thread Eitan Zahavi

Hi Todd,

So you agree we will need to design "replica" buildup scalability features into 
the solution ( to avoid the bring-up load on the SA) ?

Why would a caching system not work here? Instead of replicating the data.

The caching concept allows for the SA to still be in the loop by invalidating 
the cache or through cache entries lifetime policy.

The reason I think a total replica (distribution of the SA) would eventually be 
problematic is that as we approach QoS solutions,
some need for path record use and retirement is going to show up. What if the 
SM decides to change SL2VL maps due to new QoS requirement.
We will need a more complicated "synchronization" or invalidation technique to push that 
kind of data into the "replica" SAs.

Eitan

Rimmer, Todd wrote:

From: Eitan Zahavi [mailto:[EMAIL PROTECTED]
Hi Sean, Todd,

Although I like the "replica" idea for its "query" 
performance boost - I suspect it will actually do not scale 
for very large
networks: Each node has to query for the entire database 
would cause N^2 load on the SA.
After any change (which do happen with higher probability on 
large networks) the SA will need to send each Report to N targets.


We already have some bad experience with large clusters SA 
query issues, like the one reported by Roland

"searching for SRP targets using PortInfo capability mask".



Our experience has been the exact opposite.
While there is an initial load on the SA to populate the replica (which we have 
used various techniques to reduce such as backing off when the SA reports Busy, 
having a random time offset of start of query, etc).  The boost occurs when a 
new application starts, such as an MPI using the SA/CM to establish connections 
as per the IBTA spec.  A 1000 process MPI job would have each process make 999 
queries to the SA at job startup time.  This causes a burst of 999, sets of 
SA queries (most will involve both Node Record and Path record queries so it 
will really be 2x this amount), BEFORE the MPI job can actually start.

As Open IB moves forward to implement QOS and other features, MPI will have to 
use the SA to get its path records.  If you study MVAPICH at present, it merely 
exchanges LIDs between nodes and hardcodes (or via enviornment variables uses 
the same value for all processes) all the other QOS parameters.  In a true QOS 
and congestion management environment it will instead have to use the CM/SA.

We have been using this replica technique quite successfully for 2-3 years now. 
 Our MPI has used the SA/CM for connection establishment for just as long.

As it was pointed out, most fabrics will be quite stable.  Hence having a 
replica and paying the cost of the SA queries once will be much more efficient 
than paying that cost on every application startup.

Todd Rimmer



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-06 Thread Hal Rosenstock
On Fri, 2006-01-06 at 15:00, Eitan Zahavi wrote:
> I agree with Todd: a key is to keep the client unaware of the mux existence.
> So the same client can be run on system without the cache.

Define same client ? I would consider it the same SA client directing
requests differently based on how the mux is configured based on a query
to the cache (if it exists) as to its capabilities.

-- Hal

> Hal Rosenstock wrote:
> > On Fri, 2006-01-06 at 09:05, Rimmer, Todd wrote:
> > 
> >>>From: Hal Rosenstock [mailto:[EMAIL PROTECTED]
> >>>On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote:
> >>>
> This of course implies the "SA Mux" must analyze more than just 
> the attribute ID to determine if the replica can handle the query.  
> But the memory savings is well worth the extra level of filtering.
> >>>
> >>>If the SA cache does this, it seems it would be pretty simple 
> >>>to return
> >>>this info in an attribute to the client so the client would 
> >>>know when to
> >>>go to the cache/replica and when to go direct to the SA in the case
> >>>where only certain queries are supported. Wouldn't this be 
> >>>advantageous
> >>>when the replica doesn't support all queries ?
> >>
> >>Why put the burden on the application.  give the query to the Mux.
> > 
> > 
> > That's what I'm suggesting. Rather than a binary switch mux, a more
> > granular one which determines how to route the outgoing SA request.
> > 
> > 
> >>  With an optional flag indicating a prefered "routing" (choices of: to SA, 
> >>to replica, let Mux decide).  Then let it decide.  As you suggest it may 
> >>be simplest to let the Mux try the replica and on failure fallback 
> >>to the SA transparent to the app (sort of the way SDP intercepts 
> >>socket ops and falls back to TCP/IP when SDP isn't appropriate).
> > 
> > 
> > It depends on whether the replica/cache forwards unsupported requests on
> > or responds with not supported back to the client as to how this is
> > handled. Sean was proposing the forward on model and a binary switch at
> > the client. I think this is more granular and can be mux'd only with the
> > knowledge of what a replica/cache supports (not sure about dealing with
> > different replica/caches supporting a different set of queries; need to
> > think more on how the caches are located, etc.). You are mentioning a
> > third model here.
> > 
> > -- Hal
> > 
> > ___
> > openib-general mailing list
> > openib-general@openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit 
> > http://openib.org/mailman/listinfo/openib-general
> 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-06 Thread Hal Rosenstock
On Fri, 2006-01-06 at 14:50, Eitan Zahavi wrote:
> Hal Rosenstock wrote:
> > On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote:
> > 
> >>This of course implies the "SA Mux" must analyze more than just 
> >>the attribute ID to determine if the replica can handle the query.  
> >>But the memory savings is well worth the extra level of filtering.
> > 
> > 
> > If the SA cache does this, it seems it would be pretty simple to return
> > this info in an attribute to the client so the client would know when to
> > go to the cache/replica and when to go direct to the SA in the case
> > where only certain queries are supported. Wouldn't this be advantageous
> > when the replica doesn't support all queries ?

> I think we want to make the client totally unaware to the 
> existence of the cache.

Perhaps. I would express this differently: the client to be as unaware
as possible (the muxing on a per attribute to direct the request seems
reasonably straightforward).

> So the cache itself will simply forward the message (maybe changing TID).

Yes, the transformation at the cache should be as trivial as possible.

I would like to eliminate the doubling up of packets when unncessary
(for requests that the cache does not support rather than ones it does
support but does not have the information).

-- Hal

> > -- Hal
> > 
> > ___
> > openib-general mailing list
> > openib-general@openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit 
> > http://openib.org/mailman/listinfo/openib-general
> 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-06 Thread Sean Hefty

Eitan Zahavi wrote:
Can someone familiar with the opensm code tell me how difficult it 
would be to extract out the code that tracks the subnet data and 
responds to queries?


I guess you mean the code that is answering to PathRecord queries?


Yes - that along with answering other queries.


It is possible to extract the "SMDB" objects and duplicate that database.
I am not sure it is such a good idea. What if the SM is not OpenSM?


I was thinking in terms of code re-use, and not in terms of which SM was 
running.  Interfacing to the SM would be through standard queries.


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-06 Thread Eitan Zahavi

I agree with Todd: a key is to keep the client unaware of the mux existence.
So the same client can be run on system without the cache.

Hal Rosenstock wrote:

On Fri, 2006-01-06 at 09:05, Rimmer, Todd wrote:


From: Hal Rosenstock [mailto:[EMAIL PROTECTED]
On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote:

This of course implies the "SA Mux" must analyze more than just 
the attribute ID to determine if the replica can handle the query.  
But the memory savings is well worth the extra level of filtering.


If the SA cache does this, it seems it would be pretty simple 
to return
this info in an attribute to the client so the client would 
know when to

go to the cache/replica and when to go direct to the SA in the case
where only certain queries are supported. Wouldn't this be 
advantageous

when the replica doesn't support all queries ?


Why put the burden on the application.  give the query to the Mux.



That's what I'm suggesting. Rather than a binary switch mux, a more
granular one which determines how to route the outgoing SA request.


 With an optional flag indicating a prefered "routing" (choices of: to SA, 
to replica, let Mux decide).  Then let it decide.  As you suggest it may 
be simplest to let the Mux try the replica and on failure fallback 
to the SA transparent to the app (sort of the way SDP intercepts 
socket ops and falls back to TCP/IP when SDP isn't appropriate).



It depends on whether the replica/cache forwards unsupported requests on
or responds with not supported back to the client as to how this is
handled. Sean was proposing the forward on model and a binary switch at
the client. I think this is more granular and can be mux'd only with the
knowledge of what a replica/cache supports (not sure about dealing with
different replica/caches supporting a different set of queries; need to
think more on how the caches are located, etc.). You are mentioning a
third model here.

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-06 Thread Eitan Zahavi

Sean Hefty wrote:

Eitan Zahavi wrote:

So if the cache  is on another host - a new kind of MAD will have to 
be sent on behalf of

the original request?



I was thinking more in terms of redirection.


Today none of the clients support redirection. It would take significant 
duplicated effort on
the client front to support that.

In IB QoS properties are mainly the PathRecord parameters: SL, Rate, 
MTU, PathBits (LMC bits).
So if traditionally we had PathRecord requested for each Src->Dst port 
now we will need to track at least:
Src->Dst * #QoS-levels. (a non optimal implementation will require 
even more: #Src->Dst * #Clients * #Servers * #Services).



I understand you now.

Can someone familiar with the opensm code tell me how difficult it would 
be to extract out the code that tracks the subnet data and responds to 
queries?

I guess you mean the code that is answering to PathRecord queries?
It is possible to extract the "SMDB" objects and duplicate that database.
I am not sure it is such a good idea. What if the SM is not OpenSM?


- Sean


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-06 Thread Eitan Zahavi

Hal Rosenstock wrote:

On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote:

This of course implies the "SA Mux" must analyze more than just 
the attribute ID to determine if the replica can handle the query.  
But the memory savings is well worth the extra level of filtering.



If the SA cache does this, it seems it would be pretty simple to return
this info in an attribute to the client so the client would know when to
go to the cache/replica and when to go direct to the SA in the case
where only certain queries are supported. Wouldn't this be advantageous
when the replica doesn't support all queries ?

I think we want to make the client totally unaware to the existence of the 
cache.
So the cache itself will simply forward the message (maybe changing TID).


-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-06 Thread Hal Rosenstock
On Fri, 2006-01-06 at 13:59, Sean Hefty wrote:
> Eitan Zahavi wrote:
> > So if the cache  is on another host - a new kind of MAD will have to be 
> > sent on behalf of
> > the original request?
> 
> I was thinking more in terms of redirection.
> 
> > In IB QoS properties are mainly the PathRecord parameters: SL, Rate, 
> > MTU, PathBits (LMC bits).
> > So if traditionally we had PathRecord requested for each Src->Dst port 
> > now we will need to track at least:
> > Src->Dst * #QoS-levels. (a non optimal implementation will require even 
> > more: #Src->Dst * #Clients * #Servers * #Services).
> 
> I understand you now.

I'm not sure about the granularity this needs tracking at.

> Can someone familiar with the opensm code tell me how difficult it would be 
> to 
> extract out the code that tracks the subnet data and responds to queries?

Although I don't think that is difficult, IMO it is more a matter of
whether you want to buy into the architecture with the component and
vendor libraries. I can help with this if this is the direction chosen.
I would make this another build option.

The other question is how this would be changed so that when the data is
not present the real SA is queried.

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-06 Thread Sean Hefty

Eitan Zahavi wrote:
So if the cache  is on another host - a new kind of MAD will have to be 
sent on behalf of

the original request?


I was thinking more in terms of redirection.

In IB QoS properties are mainly the PathRecord parameters: SL, Rate, 
MTU, PathBits (LMC bits).
So if traditionally we had PathRecord requested for each Src->Dst port 
now we will need to track at least:
Src->Dst * #QoS-levels. (a non optimal implementation will require even 
more: #Src->Dst * #Clients * #Servers * #Services).


I understand you now.

Can someone familiar with the opensm code tell me how difficult it would be to 
extract out the code that tracks the subnet data and responds to queries?


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-06 Thread Hal Rosenstock
Hi Eitan,

[snip...]

> >>  So if a new client wants to connect to another node a new PathRecord
> >>  query will not need to be sent to the SA. However, recent work on QoS has
> >>pointed out
> >>  that under some QoS schemes PathRecord should not be shared by different
> >>clients
> > 
> > 
> > I'm not sure that QoS handling is the responsibility of the cache.  The 
> > module
> > requesting the path records should probably deal with this.
> In IB QoS properties are mainly the PathRecord parameters: SL, Rate, MTU, 
> PathBits (LMC bits).
> So if traditionally we had PathRecord requested for each Src->Dst port now we 
> will need to 
> track at least:
> Src->Dst * #QoS-levels. (a non optimal implementation will require even more: 
> #Src->Dst * #Clients * #Servers * #Services).

Perhaps QoS requests (I'm referring to those with the new proposed key)
are not cached as I think this may end up with the cache needing to know
the path record policies). I would propose deferring this aspect until
the new QoS work is a little firmer and the cache direction in OpenIB is
also a little firmer (e.g. QoS = phase 2 or beyond of this work).

[snip...]

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-06 Thread Hal Rosenstock
On Fri, 2006-01-06 at 09:05, Rimmer, Todd wrote:
> > From: Hal Rosenstock [mailto:[EMAIL PROTECTED]
> > On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote:
> > > This of course implies the "SA Mux" must analyze more than just 
> > > the attribute ID to determine if the replica can handle the query.  
> > > But the memory savings is well worth the extra level of filtering.
> > 
> > If the SA cache does this, it seems it would be pretty simple 
> > to return
> > this info in an attribute to the client so the client would 
> > know when to
> > go to the cache/replica and when to go direct to the SA in the case
> > where only certain queries are supported. Wouldn't this be 
> > advantageous
> > when the replica doesn't support all queries ?
> 
> Why put the burden on the application.  give the query to the Mux.

That's what I'm suggesting. Rather than a binary switch mux, a more
granular one which determines how to route the outgoing SA request.

>   With an optional flag indicating a prefered "routing" (choices of: to SA, 
> to replica, let Mux decide).  Then let it decide.  As you suggest it may 
> be simplest to let the Mux try the replica and on failure fallback 
> to the SA transparent to the app (sort of the way SDP intercepts 
> socket ops and falls back to TCP/IP when SDP isn't appropriate).

It depends on whether the replica/cache forwards unsupported requests on
or responds with not supported back to the client as to how this is
handled. Sean was proposing the forward on model and a binary switch at
the client. I think this is more granular and can be mux'd only with the
knowledge of what a replica/cache supports (not sure about dealing with
different replica/caches supporting a different set of queries; need to
think more on how the caches are located, etc.). You are mentioning a
third model here.

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-06 Thread Rimmer, Todd
> From: Hal Rosenstock [mailto:[EMAIL PROTECTED]
> On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote:
> > This of course implies the "SA Mux" must analyze more than just 
> > the attribute ID to determine if the replica can handle the query.  
> > But the memory savings is well worth the extra level of filtering.
> 
> If the SA cache does this, it seems it would be pretty simple 
> to return
> this info in an attribute to the client so the client would 
> know when to
> go to the cache/replica and when to go direct to the SA in the case
> where only certain queries are supported. Wouldn't this be 
> advantageous
> when the replica doesn't support all queries ?

Why put the burden on the application.  give the query to the Mux.  With an 
optional flag indicating a prefered "routing" (choices of: to SA, to replica, 
let Mux decide).  Then let it decide.  As you suggest it may be simplest to let 
the Mux try the replica and on failure fallback to the SA transparent to the 
app (sort of the way SDP intercepts socket ops and falls back to TCP/IP when 
SDP isn't appropriate).

Todd R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-06 Thread Hal Rosenstock
On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote:
> This of course implies the "SA Mux" must analyze more than just 
> the attribute ID to determine if the replica can handle the query.  
> But the memory savings is well worth the extra level of filtering.

If the SA cache does this, it seems it would be pretty simple to return
this info in an attribute to the client so the client would know when to
go to the cache/replica and when to go direct to the SA in the case
where only certain queries are supported. Wouldn't this be advantageous
when the replica doesn't support all queries ?

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-06 Thread Rimmer, Todd
> From: Eitan Zahavi [mailto:[EMAIL PROTECTED]
> Hi Sean, Todd,
> 
> Although I like the "replica" idea for its "query" 
> performance boost - I suspect it will actually do not scale 
> for very large
> networks: Each node has to query for the entire database 
> would cause N^2 load on the SA.
> After any change (which do happen with higher probability on 
> large networks) the SA will need to send each Report to N targets.
> 
> We already have some bad experience with large clusters SA 
> query issues, like the one reported by Roland
> "searching for SRP targets using PortInfo capability mask".
> 
Our experience has been the exact opposite.
While there is an initial load on the SA to populate the replica (which we have 
used various techniques to reduce such as backing off when the SA reports Busy, 
having a random time offset of start of query, etc).  The boost occurs when a 
new application starts, such as an MPI using the SA/CM to establish connections 
as per the IBTA spec.  A 1000 process MPI job would have each process make 999 
queries to the SA at job startup time.  This causes a burst of 999, sets of 
SA queries (most will involve both Node Record and Path record queries so it 
will really be 2x this amount), BEFORE the MPI job can actually start.

As Open IB moves forward to implement QOS and other features, MPI will have to 
use the SA to get its path records.  If you study MVAPICH at present, it merely 
exchanges LIDs between nodes and hardcodes (or via enviornment variables uses 
the same value for all processes) all the other QOS parameters.  In a true QOS 
and congestion management environment it will instead have to use the CM/SA.

We have been using this replica technique quite successfully for 2-3 years now. 
 Our MPI has used the SA/CM for connection establishment for just as long.

As it was pointed out, most fabrics will be quite stable.  Hence having a 
replica and paying the cost of the SA queries once will be much more efficient 
than paying that cost on every application startup.

Todd Rimmer
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-06 Thread Eitan Zahavi

Hi Sean, Todd,

Although I like the "replica" idea for its "query" performance boost - I 
suspect it will actually do not scale for very large
networks: Each node has to query for the entire database would cause N^2 load 
on the SA.
After any change (which do happen with higher probability on large networks) 
the SA will need to send each Report to N targets.

We already have some bad experience with large clusters SA query issues, like 
the one reported by Roland
"searching for SRP targets using PortInfo capability mask".

Eitan

Sean Hefty wrote:

- It is implemented in kernel mode
- while user mode may help during initial debug, it will be important


for


kernel mode ULPs such as SRP, IPoIB and SDP to also make use of
these records



Your kernel footprint is smaller than I expected, which is good.  Note that with
a MAD interface, kernel modules would still have access to any cached data.  I
also wanted to stick with usermode to allow saving the cache to disk, so that it
would be available immediately after a reboot.  (My assumption being that
changes to the network topology would be rare, so we could optimize around a
stable network design.)

As a related topic, there will be a separate SA client interface defined that
will generate SA query MADs for the user.

- Sean


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-06 Thread Eitan Zahavi

Hi Sean,

Please see below.

Sean Hefty wrote:

* Regarding the sentence:"Clients would send their queries to the sa_cache
instead of the SA"
 I would propose that a "SA MAD send switch" be implemented in the core: Such
a switch
 will enable plugging in the SA cache (I would prefer calling it SA local
agent due to
 its extended functionality). Once plugged in, this "SA local agent" should
be forwarded all
 outgoing SA queries. Once it handles the MAD it should be able to inject the
response through
 the core "SA MAD send switch" as if they arrived from the wire.



This was my thought as well.  I hesitated to refer to the cache as a local
agent, since that's an implementation detail.  I want to allow the possibility
for the cache to reside on another system.  For the initial implementation, the
cache would be local however.

So if the cache  is on another host - a new kind of MAD will have to be sent on 
behalf of
the original request?




Functional requirements:
* It is clear that the first SA query to cache is PathRecord.



This will be the first cached query in the initial check-in.



 So if a new client wants to connect to another node a new PathRecord
 query will not need to be sent to the SA. However, recent work on QoS has
pointed out
 that under some QoS schemes PathRecord should not be shared by different
clients



I'm not sure that QoS handling is the responsibility of the cache.  The module
requesting the path records should probably deal with this.

In IB QoS properties are mainly the PathRecord parameters: SL, Rate, MTU, 
PathBits (LMC bits).
So if traditionally we had PathRecord requested for each Src->Dst port now we 
will need to track at least:
Src->Dst * #QoS-levels. (a non optimal implementation will require even more: 
#Src->Dst * #Clients * #Servers * #Services).





* Forgive me for bringing the following issue - over and over to the group:
 Multicast Join/Leave should be reference counted. The "SA local agent" could
be
 the right place for doing this kind of reference counting (actually if it
does that
 it probably needs to be located in the Kernel - to enable cleanup after
killed processes).



I agree that this is a problem, but I my preference would be for a dedicated
kernel module to handle multicast join/leave requests.

Since we already sniff into the SA queries it makes sense to have the same code 
also handle
other functionality that requires sniffing into the SA requests.
As HAL points out this involves both ServiceRecord, Multicast Join/Leave and 
InformInfo requests.
Multicast Join/Leave actually behaves like a cache: if a "join" to the same 
MGID already took place
(no leave yet) then no need to sent the new request to the SA.


- Sean



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-05 Thread Sean Hefty
>>  Note that with
>> a MAD interface, kernel modules would still have access to
>> any cached data.  I
>> also wanted to stick with usermode to allow saving the cache
>> to disk, so that it
>> would be available immediately after a reboot.  (My
>> assumption being that
>> changes to the network topology would be rare, so we could
>> optimize around a
>> stable network design.)
>It is risky to assume that PathRecords would stay the same across a node
>reboot.  It is very likely that the SM could assign different LIDs or if the
>node is down for an extended period other things in the fabric could have
>significantly changed.

OpenSM currently maintains LIDs between system reboots, and I believe that this
is desirable for fast fabric bring-up.  And I believe that this is a desirable
feature for any SM to have.  In any case, a local LID change is trivial to
detect and can easily be used to invalidate the entire cache.  Likewise, the
cache could automatically be flushed if not updated for some specified time
period, or if some other defined event occurred - such as a GUID change on the
local HCA.

Overall, I think that the risk here is low.

>> As a related topic, there will be a separate SA client
>> interface defined that
>> will generate SA query MADs for the user.
>Given the complexity of the RMPP protocol and the subtle bugs which everyone
>has encountered while implementing and debugging it (timeouts, retries, abort,
>window size management, class header offset, etc), it would be best to limit
>the number of copies of this protocol within the system.  Keeping the RMPP
>details hidden just in the kernel would be best.  An analogy might be the way
>sockets hides the details of the TCP/IP protocol from applications.  While I'm
>not aware of any changes in the works, we all remember the significant changes
>which occurred between IBTA 1.0 and IBTA 1.1 in the RMPP area.  If any similar
>significant revision to the protocol occurred it would be best to have it all
>implemented in just one place.

RMPP is implemented by the MAD layer, and is hidden to any clients using the MAD
services.  There will still only be a single RMPP implementation in the stack.

- Sean


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-05 Thread Rimmer, Todd
> From: Sean Hefty [mailto:[EMAIL PROTECTED]
> Your kernel footprint is smaller than I expected, which is 
> good.
The key is that while there are O(N^2) path records in a fabric, only O(N) are 
of interest to a given node.  Hence if you only replicate entries where this 
node is the source the size of the replica is significantly smaller.  If 
someone is curious and wants to see all path records in the system, that would 
be a query you would let go through to the SA (and it would be a very 
infrequent query since no real world app, beyond fabric debug tools, would care 
about the paths which don't involve the node making the query).

This of course implies the "SA Mux" must analyze more than just the attribute 
ID to determine if the replica can handle the query.  But the memory savings is 
well worth the extra level of filtering.

>  Note that with
> a MAD interface, kernel modules would still have access to 
> any cached data.  I
> also wanted to stick with usermode to allow saving the cache 
> to disk, so that it
> would be available immediately after a reboot.  (My 
> assumption being that
> changes to the network topology would be rare, so we could 
> optimize around a
> stable network design.)
It is risky to assume that PathRecords would stay the same across a node 
reboot.  It is very likely that the SM could assign different LIDs or if the 
node is down for an extended period other things in the fabric could have 
significantly changed.

> 
> As a related topic, there will be a separate SA client 
> interface defined that
> will generate SA query MADs for the user.
Given the complexity of the RMPP protocol and the subtle bugs which everyone 
has encountered while implementing and debugging it (timeouts, retries, abort, 
window size management, class header offset, etc), it would be best to limit 
the number of copies of this protocol within the system.  Keeping the RMPP 
details hidden just in the kernel would be best.  An analogy might be the way 
sockets hides the details of the TCP/IP protocol from applications.  While I'm 
not aware of any changes in the works, we all remember the significant changes 
which occurred between IBTA 1.0 and IBTA 1.1 in the RMPP area.  If any similar 
significant revision to the protocol occurred it would be best to have it all 
implemented in just one place.

my $0.02

Todd Rimmer
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-05 Thread Hal Rosenstock
On Thu, 2006-01-05 at 18:24, Sean Hefty wrote:
> >For the precise language, see C15-0-1.24 p. 923 IBA 1.2:
> >
> >
> >C15-0.1.24: It shall be possible to determine the location of SA from
> >any
> >endport by sending a GMP to QP1 (the GSI) of the node identified by the
> >endport's PortInfo:MasterSMLID, using in the GMP the base LID of the
> >endport as the SLID, the endport's PortInfo:MasterSMSL as the SL, the
> >well-known Q_Key (0x8001_), and whichever of the default P_Keys
> >(0x or 0x7FFF) was placed in the endport's P_Key Table by the SM
> >(Table 183 Initialization on page 868).
> >
> >so I overstated it a bit but this needs to be obeyed.
> 
> Could each of the requests be redirected to different nodes?

Yes.

>   I can envision how
> the sa_cache could eventually build towards a distributed SA.

I think a distributed SA is more like it rather than an SA cache.

-- Hal

> >C15-0.1.25: A SubnAdmGet(ClassPortInfo) sent according to C15-
> >0.1.24: shall return all information needed to communicate with Subnet
> >Administration. Alternatively, valid GMPs for SA sent according to C15-
> >0.1.24: shall either return redirection responses providing all such
> >information, or shall be normally processed by SA.
> 
> Thanks for the references.
> 
> - Sean
> 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-05 Thread Hal Rosenstock
On Thu, 2006-01-05 at 17:04, Sean Hefty wrote:
> >> I hadn't fully figured this out yet.  I'm not sure if another MAD class is
> >> needed or not.  My goal is to implement this as transparent to the
> >application
> >> as possible without violating the spec, perhaps appearing as an SA on a
> >> different LID.
> >
> >The LID for the (real) SA is determined from PortInfo:MasterSMLID so I
> >don't see how this could be done that way.
> 
> I didn't think that it was a requirement that the SA share the same LID as the
> SM.

For the precise language, see C15-0-1.24 p. 923 IBA 1.2:


C15-0.1.24: It shall be possible to determine the location of SA from
any
endport by sending a GMP to QP1 (the GSI) of the node identified by the
endport's PortInfo:MasterSMLID, using in the GMP the base LID of the
endport as the SLID, the endport's PortInfo:MasterSMSL as the SL, the
well-known Q_Key (0x8001_), and whichever of the default P_Keys
(0x or 0x7FFF) was placed in the endport's P_Key Table by the SM
(Table 183 Initialization on page 868).

so I overstated it a bit but this needs to be obeyed.

Also,

C15-0.1.25: A SubnAdmGet(ClassPortInfo) sent according to C15-
0.1.24: shall return all information needed to communicate with Subnet
Administration. Alternatively, valid GMPs for SA sent according to C15-
0.1.24: shall either return redirection responses providing all such
information, or shall be normally processed by SA.

-- Hal


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-05 Thread Sean Hefty
>- It is implemented in kernel mode
>   - while user mode may help during initial debug, it will be important
for
>   kernel mode ULPs such as SRP, IPoIB and SDP to also make use of
>these records

Your kernel footprint is smaller than I expected, which is good.  Note that with
a MAD interface, kernel modules would still have access to any cached data.  I
also wanted to stick with usermode to allow saving the cache to disk, so that it
would be available immediately after a reboot.  (My assumption being that
changes to the network topology would be rare, so we could optimize around a
stable network design.)

As a related topic, there will be a separate SA client interface defined that
will generate SA query MADs for the user.

- Sean


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-05 Thread Hal Rosenstock
On Thu, 2006-01-05 at 16:51, Sean Hefty wrote:
> I agree that this is a problem, but I my preference would be for a dedicated
> kernel module to handle multicast join/leave requests.

In addition to multicast, it's also service records and event
subscriptions too.

-- Hal


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-05 Thread Sean Hefty
>Sean, This is great.  This is a feature which I find near and dear and is very
>important to large fabric scalability.  If you look in contrib in the infinicon
>area, you will see a version of a SA replica which we implemented in the
>linux_discovery tree.  The version in SVN is a little dated, but has the major
>features and capabilities.  If you find it useful I could provide a more
>updated version of that component for your reference.

Thanks - I will look at the version that is there.

- Sean


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-05 Thread Sean Hefty
>> I hadn't fully figured this out yet.  I'm not sure if another MAD class is
>> needed or not.  My goal is to implement this as transparent to the
>application
>> as possible without violating the spec, perhaps appearing as an SA on a
>> different LID.
>
>The LID for the (real) SA is determined from PortInfo:MasterSMLID so I
>don't see how this could be done that way.

I didn't think that it was a requirement that the SA share the same LID as the
SM.

- Sean

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-05 Thread Sean Hefty
>* Regarding the sentence:"Clients would send their queries to the sa_cache
>instead of the SA"
>   I would propose that a "SA MAD send switch" be implemented in the core: Such
>a switch
>   will enable plugging in the SA cache (I would prefer calling it SA local
>agent due to
>   its extended functionality). Once plugged in, this "SA local agent" should
>be forwarded all
>   outgoing SA queries. Once it handles the MAD it should be able to inject the
>response through
>   the core "SA MAD send switch" as if they arrived from the wire.

This was my thought as well.  I hesitated to refer to the cache as a local
agent, since that's an implementation detail.  I want to allow the possibility
for the cache to reside on another system.  For the initial implementation, the
cache would be local however.

>Functional requirements:
>* It is clear that the first SA query to cache is PathRecord.

This will be the first cached query in the initial check-in.

>   So if a new client wants to connect to another node a new PathRecord
>   query will not need to be sent to the SA. However, recent work on QoS has
>pointed out
>   that under some QoS schemes PathRecord should not be shared by different
>clients

I'm not sure that QoS handling is the responsibility of the cache.  The module
requesting the path records should probably deal with this.

>* Forgive me for bringing the following issue - over and over to the group:
>   Multicast Join/Leave should be reference counted. The "SA local agent" could
>be
>   the right place for doing this kind of reference counting (actually if it
>does that
>   it probably needs to be located in the Kernel - to enable cleanup after
>killed processes).

I agree that this is a problem, but I my preference would be for a dedicated
kernel module to handle multicast join/leave requests.

- Sean


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] SA cache design

2006-01-05 Thread Rimmer, Todd
> From: Sean Hefty [mailto:[EMAIL PROTECTED]
> 
> I've been given the task of trying to come up with an 
> implementation for an SA 
> cache.  The intent is to increase the scalability and 
> performance of the openib 
> stack.  My current thoughts on the implementation are below.  
> Any feedback is 
> welcome.

Sean, This is great.  This is a feature which I find near and dear and is very 
important to large fabric scalability.  If you look in contrib in the infinicon 
area, you will see a version of a SA replica which we implemented in the 
linux_discovery tree.  The version in SVN is a little dated, but has the major 
features and capabilities.  If you find it useful I could provide a more 
updated version of that component for your reference.

Some features of it (which you should consider or possibly use as reference 
code):
- It maintains a full replica of:
- All Node Records
- Path Records relevant to this Node (where this node is Source)
- Device Management Agent records for IOUs, IOCs and Service Records
- even for a large cluster, the footprint of the above will be < 1MB

- It is implemented in kernel mode
- while user mode may help during initial debug, it will be important 
for
kernel mode ULPs such as SRP, IPoIB and SDP to also make use of 
these records

- It is infact a replica, not a cache.  It maintains an up to date replica using
the following techniques
- registers for SA GID in/out of service notices
- such notices when received trigger a query of information 
about that node only
- schedules a periodic full SA query
- if notices are successfully registered for, the query is at a 
slow pace (once every 10 minutes is default, but its configureable)
- if notices are not successfully registered for, the query is 
at a faster pace (once a minute, but its configurable)
- since notices are unreliable, the periodic sweep is needed to 
cover for lost notices, however the SA should resend notices which are not 
responded to

- In addition for CAs it performs IOU, IOC and Service record queries and 
replicates them
- this allows for very fast access to IOU/IOC/Service record info by 
drivers like SRP
- hence allowing for faster reconnection and failure recovery handling

- It can handle SA outages and still respond to queries while the SA is down, 
the SA is slow, or while the synchronization process is being performed (eg. it 
does all its queries to a temporary replica then updates the main replica, 
hence if the queries fail or take a long time, the main replica is still 
available and reasonably accurate).

- I like the idea of using the same API for SA queries and allowing an SA mux 
to choose to query the replica or the actual SA.  Hence if later versions 
choose to extend what is maintained in the replica, it would be transparent to 
applications
- The API could allow for a flag to force a query against the replica 
or against the actual SA, with the default being to allow the "SA mux" to 
select which to use


> 
> To keep the design as flexible as possible, my plan is to 
> implement the cache in 
> userspace.  The interface to the cache would be via MADs.  
> Clients would send 
> their queries to the sa_cache instead of the SA itself.  The 
> format of the MADs 
> would be essentially identical to those used to query the SA 
> itself.  Response 
> MADs would contain any requested information.  If the cache 
> could not satisfy a 
> request, the sa_cache would query the SA, update its cache, 
> then return a reply.

- in our stack we had a separate more advanced SA query API (refered to the 
Subnet Driver API).  This has evolved significantly since the old Intel 
IbAccess days, but still has similarities.  It handled all the details of the 
query including retries (as specified by the caller), timeouts and even 
multi-level queries (get path records based on Node Guids, etc).  It also 
handled the RMPP aspects and hid the intermediate RMPP headers and control 
protocol.  You may want to consider defining and using such an API instead of 
MADs, least the user of the SA replica need to also implement RMPP itself.  
Given such an API the implementation could choose to query the actual SA or the 
replica and hide the RMPP details in the SA query case.

Todd Rimmer
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-05 Thread Hal Rosenstock
Hi Eitan,

On Thu, 2006-01-05 at 07:27, Eitan Zahavi wrote:
> Hi Sean,
> 
> This is great initiative - tackling an important issue.
> I am glad you took this on.
> 
> Please see below.
> 
> Sean Hefty wrote:
> > I've been given the task of trying to come up with an implementation for 
> > an SA cache.  The intent is to increase the scalability and performance 
> > of the openib stack.  My current thoughts on the implementation are 
> > below.  Any feedback is welcome.
> > 
> > To keep the design as flexible as possible, my plan is to implement the 
> > cache in userspace.  The interface to the cache would be via MADs.  
> > Clients would send their queries to the sa_cache instead of the SA 
> > itself.  The format of the MADs would be essentially identical to those 
> > used to query the SA itself.  Response MADs would contain any requested 
> > information.  If the cache could not satisfy a request, the sa_cache 
> > would query the SA, update its cache, then return a reply.
> * I think the idea of using MADs to interface with the cache is very good.
> * User space implementation:
>This also might be a good tradeoff between coding and debugging versus the
>the impact on number of connections per second. I hope the impact on 
> performance
>will not be too big. Maybe we can take the path of implementing in user 
> space and
>if the performance penalty will be too high we can port to kernel.
> * Regarding the sentence:"Clients would send their queries to the sa_cache 
> instead of the SA"
>I would propose that a "SA MAD send switch" be implemented in the core: 
> Such a switch
>will enable plugging in the SA cache (I would prefer calling it SA local 
> agent due to
>its extended functionality). Once plugged in, this "SA local agent" should 
> be forwarded all
>outgoing SA queries. Once it handles the MAD it should be able to inject 
> the response through
>the core "SA MAD send switch" as if they arrived from the wire.
> > 
> > The benefits that I see with this approach are:
> > 
> > + Clients would only need to send requests to the sa_cache.
> > + The sa_cache can be implemented in stages.  Requests that it cannot 
> > handle would just be forwarded to the SA.
> > + The sa_cache could be implemented on each host, or a select number of 
> > hosts.
> > + The interface to the sa_cache is similar to that used by the SA.
> > + The cache would use virtual memory and could be saved to disk.
> > 
> > Some drawbacks specific to this method are:
> > 
> > - The MAD interface will result in additional data copies and userspace 
> > to kernel transitions for clients residing on the local system.
> > - Clients require a mechanism to locate the sa_cache, or need to make 
> > assumptions about its location.
> The proposal for "SA MAD send switch" in the core will resolve this issue.
> No client change will be required as all MADs are sent through the core which 
> will
> redirect them to the SA agent ...

I see this as more granular than a complete switch for the entire class.
More like on a per attribute basis.

> Functional requirements:
> * It is clear that the first SA query to cache is PathRecord.
>So if a new client wants to connect to another node a new PathRecord
>query will not need to be sent to the SA. However, recent work on QoS has 
> pointed out
>that under some QoS schemes PathRecord should not be shared by different 
> clients
>or even connections. There are several ways to make such QoS scheme scale.
>Since this is a different discussion topic - I only bring this up such that
>we take into account caching might also need to be done by a complex key 
> (not just
>SRC/DST ...)

Per the QoS direction, this complex key is indeed part of the enhanced
PathRecord, right ?

> * Forgive me for bringing the following issue - over and over to the group:
>Multicast Join/Leave should be reference counted. The "SA local agent" 
> could be
>the right place for doing this kind of reference counting (actually if it 
> does that
>it probably needs to be located in the Kernel - to enable cleanup after 
> killed processes).

The cache itself may need another level of reference counting (even if
invalidation is broadcast).

> * Similarly - "Client re-registration" could be made transparent to clients.
> 
> Cache Invalidation:
> Several discussions about PathRecord invalidation were spawn in the past.
> IMO, it is enough to be notified about death of local processes, remote port 
> availability (trap 64/65) and
> multicast group availability (trap 66/67) in order to invalidate SA cache 
> information.

I think that it's more complicated than this. As an example, how does
the SA cache know whether a cached path record needs to be changed based
on traps 64/65 ? It seems to me to need to be tightly tied to the SM/SA
for this.

> So each SA Agent could register to obtain this data. But that solution does 
> not nicely scale,
> as the SA needs to send notif

Re: [openib-general] SA cache design

2006-01-05 Thread Hal Rosenstock
Hi Sean,

On Tue, 2006-01-03 at 20:15, Sean Hefty wrote:
> Hal Rosenstock wrote:
> >>I've been given the task of trying to come up with an implementation for an 
> >>SA 
> >>cache.  The intent is to increase the scalability and performance of the 
> >>openib 
> >>stack.  My current thoughts on the implementation are below.  Any feedback 
> >>is 
> >>welcome.
> >>
> >>To keep the design as flexible as possible, my plan is to implement the 
> >>cache in 
> >>userspace.  The interface to the cache would be via MADs.
> > 
> > Would this be another MAD class which mimics the SA class ?
> 
> I hadn't fully figured this out yet.  I'm not sure if another MAD class is 
> needed or not.  My goal is to implement this as transparent to the 
> application 
> as possible without violating the spec, perhaps appearing as an SA on a 
> different LID.

The LID for the (real) SA is determined from PortInfo:MasterSMLID so I
don't see how this could be done that way.

> >>  Clients would send 
> >>their queries to the sa_cache instead of the SA itself.  The format of the 
> >>MADs 
> >>would be essentially identical to those used to query the SA itself.  
> >>Response 
> >>MADs would contain any requested information.  If the cache could not 
> >>satisfy a 
> >>request, the sa_cache would query the SA, update its cache, then return a 
> >>reply.
> >>
> >>The benefits that I see with this approach are:
> >>
> >>+ Clients would only need to send requests to the sa_cache.
> >>+ The sa_cache can be implemented in stages.  Requests that it cannot 
> >>handle 
> >>would just be forwarded to the SA.
> > 
> > Another option would be for the SA cache to indicate what requests its
> > handles (some MADs for this) and have the clients only go to the cache
> > for those queries (and direct to the SA for the others).
> 
> I thought about this, but this puts an additional burden on the clients. 

Sure but how significant is this, especially if the 2 requests look
alike with some minor exception(s) like the class. I would think this
would make up for eliminating the extra indirection in the case where
the cache does not support the request.

> Letting the sa_cache forward the request allows it to send the requests to 
> another sa_cache, rather than directly to the SA.  There's some additional 
> flexibility that we gain in the long term design by forwarding requests.  
> (I'm 
> thinking of the possibility of having an sa_cache hierarchy.)

Sure; a hierarchial cache should scale even better.

> >>+ The sa_cache could be implemented on each host, or a select number of 
> >>hosts.
> >>+ The interface to the sa_cache is similar to that used by the SA.
> >>+ The cache would use virtual memory and could be saved to disk.
> >>
> >>Some drawbacks specific to this method are:
> >>
> >>- The MAD interface will result in additional data copies and userspace to 
> >>kernel transitions for clients residing on the local system.
> >>- Clients require a mechanism to locate the sa_cache, or need to make 
> >>assumptions about its location.
> > 
> > Would SA caching be a service ID or set of IDs ?
> 
> I'd like the sa_cache to give the appearance of being a standard SA as much 
> as 
> possible.

Yes, the closer to the real SA requests the cache requests are the
better.

>   One effect is that an sa_cache may not be able to run on the same 
> node as the actual SA,

Not sure why this would be the case.

>  but that restriction seems desirable to me.

Agreed.

> > Are there also issues around cache invalidation ?
> 
> I didn't list cache synchronization as an issue because I couldn't think of 
> any 
> problems that were specific to this design, versus being a general issue.

Yes, this is a general design issue. The whole idea of how requests are
matched to the cache (what info is kept in the cache) and the
invalidation are keys. Just take PathRecords as one example.

-- Hal

> - Sean
> 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-05 Thread Eitan Zahavi

Hi Sean,

This is great initiative - tackling an important issue.
I am glad you took this on.

Please see below.

Sean Hefty wrote:
I've been given the task of trying to come up with an implementation for 
an SA cache.  The intent is to increase the scalability and performance 
of the openib stack.  My current thoughts on the implementation are 
below.  Any feedback is welcome.


To keep the design as flexible as possible, my plan is to implement the 
cache in userspace.  The interface to the cache would be via MADs.  
Clients would send their queries to the sa_cache instead of the SA 
itself.  The format of the MADs would be essentially identical to those 
used to query the SA itself.  Response MADs would contain any requested 
information.  If the cache could not satisfy a request, the sa_cache 
would query the SA, update its cache, then return a reply.

* I think the idea of using MADs to interface with the cache is very good.
* User space implementation:
  This also might be a good tradeoff between coding and debugging versus the
  the impact on number of connections per second. I hope the impact on 
performance
  will not be too big. Maybe we can take the path of implementing in user space 
and
  if the performance penalty will be too high we can port to kernel.
* Regarding the sentence:"Clients would send their queries to the sa_cache instead 
of the SA"
  I would propose that a "SA MAD send switch" be implemented in the core: Such 
a switch
  will enable plugging in the SA cache (I would prefer calling it SA local 
agent due to
  its extended functionality). Once plugged in, this "SA local agent" should be 
forwarded all
  outgoing SA queries. Once it handles the MAD it should be able to inject the 
response through
  the core "SA MAD send switch" as if they arrived from the wire.


The benefits that I see with this approach are:

+ Clients would only need to send requests to the sa_cache.
+ The sa_cache can be implemented in stages.  Requests that it cannot 
handle would just be forwarded to the SA.
+ The sa_cache could be implemented on each host, or a select number of 
hosts.

+ The interface to the sa_cache is similar to that used by the SA.
+ The cache would use virtual memory and could be saved to disk.

Some drawbacks specific to this method are:

- The MAD interface will result in additional data copies and userspace 
to kernel transitions for clients residing on the local system.
- Clients require a mechanism to locate the sa_cache, or need to make 
assumptions about its location.

The proposal for "SA MAD send switch" in the core will resolve this issue.
No client change will be required as all MADs are sent through the core which 
will
redirect them to the SA agent ...

Functional requirements:
* It is clear that the first SA query to cache is PathRecord.
  So if a new client wants to connect to another node a new PathRecord
  query will not need to be sent to the SA. However, recent work on QoS has 
pointed out
  that under some QoS schemes PathRecord should not be shared by different 
clients
  or even connections. There are several ways to make such QoS scheme scale.
  Since this is a different discussion topic - I only bring this up such that
  we take into account caching might also need to be done by a complex key (not 
just
  SRC/DST ...)
* Forgive me for bringing the following issue - over and over to the group:
  Multicast Join/Leave should be reference counted. The "SA local agent" could 
be
  the right place for doing this kind of reference counting (actually if it 
does that
  it probably needs to be located in the Kernel - to enable cleanup after 
killed processes).
* Similarly - "Client re-registration" could be made transparent to clients.

Cache Invalidation:
Several discussions about PathRecord invalidation were spawn in the past.
IMO, it is enough to be notified about death of local processes, remote port 
availability (trap 64/65) and
multicast group availability (trap 66/67) in order to invalidate SA cache 
information.
So each SA Agent could register to obtain this data. But that solution does not 
nicely scale,
as the SA needs to send notification to all nodes (but is reliable - could 
resend until Repressed).
However, current IBTA definition for InformInfo (event forwarding mechanism) 
does not
allow for multicast of Report(Notice). The reason is that registration for 
event forwarding
is done with  Set(InformInfo) which uses the requester QP and LID as the 
address for sending
the matching report. A simple way around that limitation could be to enable the SM to 
"pre-register"
a well known multicast group target for event forwarding. One issue though, 
would be that UD multicast
is not reliable and some notifications could get lost. A notification sequence 
number could be used
to catch these missed notifications eventually.

Eitan
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listin

Re: [openib-general] SA cache design

2006-01-03 Thread Sean Hefty

Hal Rosenstock wrote:
I've been given the task of trying to come up with an implementation for an SA 
cache.  The intent is to increase the scalability and performance of the openib 
stack.  My current thoughts on the implementation are below.  Any feedback is 
welcome.


To keep the design as flexible as possible, my plan is to implement the cache in 
userspace.  The interface to the cache would be via MADs.


Would this be another MAD class which mimics the SA class ?


I hadn't fully figured this out yet.  I'm not sure if another MAD class is 
needed or not.  My goal is to implement this as transparent to the application 
as possible without violating the spec, perhaps appearing as an SA on a 
different LID.


 Clients would send 
their queries to the sa_cache instead of the SA itself.  The format of the MADs 
would be essentially identical to those used to query the SA itself.  Response 
MADs would contain any requested information.  If the cache could not satisfy a 
request, the sa_cache would query the SA, update its cache, then return a reply.


The benefits that I see with this approach are:

+ Clients would only need to send requests to the sa_cache.
+ The sa_cache can be implemented in stages.  Requests that it cannot handle 
would just be forwarded to the SA.


Another option would be for the SA cache to indicate what requests its
handles (some MADs for this) and have the clients only go to the cache
for those queries (and direct to the SA for the others).


I thought about this, but this puts an additional burden on the clients. 
Letting the sa_cache forward the request allows it to send the requests to 
another sa_cache, rather than directly to the SA.  There's some additional 
flexibility that we gain in the long term design by forwarding requests.  (I'm 
thinking of the possibility of having an sa_cache hierarchy.)



+ The sa_cache could be implemented on each host, or a select number of hosts.
+ The interface to the sa_cache is similar to that used by the SA.
+ The cache would use virtual memory and could be saved to disk.

Some drawbacks specific to this method are:

- The MAD interface will result in additional data copies and userspace to 
kernel transitions for clients residing on the local system.
- Clients require a mechanism to locate the sa_cache, or need to make 
assumptions about its location.


Would SA caching be a service ID or set of IDs ?


I'd like the sa_cache to give the appearance of being a standard SA as much as 
possible.  One effect is that an sa_cache may not be able to run on the same 
node as the actual SA, but that restriction seems desirable to me.



Are there also issues around cache invalidation ?


I didn't list cache synchronization as an issue because I couldn't think of any 
problems that were specific to this design, versus being a general issue.


- Sean

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SA cache design

2006-01-03 Thread Hal Rosenstock
Hi Sean,

On Tue, 2006-01-03 at 19:42, Sean Hefty wrote:
> I've been given the task of trying to come up with an implementation for an 
> SA 
> cache.  The intent is to increase the scalability and performance of the 
> openib 
> stack.  My current thoughts on the implementation are below.  Any feedback is 
> welcome.
> 
> To keep the design as flexible as possible, my plan is to implement the cache 
> in 
> userspace.  The interface to the cache would be via MADs.

Would this be another MAD class which mimics the SA class ?

>   Clients would send 
> their queries to the sa_cache instead of the SA itself.  The format of the 
> MADs 
> would be essentially identical to those used to query the SA itself.  
> Response 
> MADs would contain any requested information.  If the cache could not satisfy 
> a 
> request, the sa_cache would query the SA, update its cache, then return a 
> reply.
> 
> The benefits that I see with this approach are:
> 
> + Clients would only need to send requests to the sa_cache.
> + The sa_cache can be implemented in stages.  Requests that it cannot handle 
> would just be forwarded to the SA.

Another option would be for the SA cache to indicate what requests its
handles (some MADs for this) and have the clients only go to the cache
for those queries (and direct to the SA for the others).

> + The sa_cache could be implemented on each host, or a select number of hosts.
> + The interface to the sa_cache is similar to that used by the SA.
> + The cache would use virtual memory and could be saved to disk.
> 
> Some drawbacks specific to this method are:
> 
> - The MAD interface will result in additional data copies and userspace to 
> kernel transitions for clients residing on the local system.
> - Clients require a mechanism to locate the sa_cache, or need to make 
> assumptions about its location.

Would SA caching be a service ID or set of IDs ?

Are there also issues around cache invalidation ?

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] SA cache design

2006-01-03 Thread Sean Hefty
I've been given the task of trying to come up with an implementation for an SA 
cache.  The intent is to increase the scalability and performance of the openib 
stack.  My current thoughts on the implementation are below.  Any feedback is 
welcome.


To keep the design as flexible as possible, my plan is to implement the cache in 
userspace.  The interface to the cache would be via MADs.  Clients would send 
their queries to the sa_cache instead of the SA itself.  The format of the MADs 
would be essentially identical to those used to query the SA itself.  Response 
MADs would contain any requested information.  If the cache could not satisfy a 
request, the sa_cache would query the SA, update its cache, then return a reply.


The benefits that I see with this approach are:

+ Clients would only need to send requests to the sa_cache.
+ The sa_cache can be implemented in stages.  Requests that it cannot handle 
would just be forwarded to the SA.

+ The sa_cache could be implemented on each host, or a select number of hosts.
+ The interface to the sa_cache is similar to that used by the SA.
+ The cache would use virtual memory and could be saved to disk.

Some drawbacks specific to this method are:

- The MAD interface will result in additional data copies and userspace to 
kernel transitions for clients residing on the local system.
- Clients require a mechanism to locate the sa_cache, or need to make 
assumptions about its location.


I'm sure that there are other benefits and drawbacks that I'm missing.

- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general