Re: [openib-general] SA cache design
Eitan Zahavi wrote: [EZ] Having N^2 messages is not a big problem if they do not all go one target... CM is distributed and this is good. Only the PathRecord section of the connection establishment is going today to one node (SA) and you are about to fix it... I expect that we'll start having issues scaling when the number of nodes starts to exceed the size of the CM's QP. Your idea below should help. During initial connections setup you will not have anything in the SA cache and thus the SA will need to answer N^2 PathRecords. Smart exponential back-off can resolve that DOS attack on the SA at bring-up. I'll post the code for the cache once I complete my testing, but it issues a single query to fill the cache. The SA will only see O(n) requests. The cache also supports an update delay, or settle time, and minimum update time to prevent spamming the SA with back to back requests. [EZ] We might need a little more in the key for QoS support (to come). This would need to be exposed through our APIs as well. Alternate paths are also not yet supported. [EZ] I would try and make sure the connections are not done in a manner such that all nodes try to establish connections to a single node at the same time. This is an application issue but can be easily resolve. I agree. [EZ] I think a centralized CM is a only going to make things worse. It can reduce the number of messages on the network from O(n^2) to O(n). The idea is that instead of all nodes sending connection requests to all other nodes, they send a single connection request -- containing an array of QP information -- to one node. (The array could be sent over an established connection, rather than in MADs.) The amount of traffic to that one node should be only slightly worse than the all to all case. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
> From: Eitan Zahavi [mailto:[EMAIL PROTECTED] > What was I thinking ... > for (target = (myRank + 1) % numNodes ; target != myRank; target = > (target + 1)% numNodes) { /* establish connection to node target > */ > } This can be even simpler for MPI. Given some nodes must listen and others must connect, have an approch such as higher rank processes connect to lower rank processes. Then its simply: initiate listen on my endpoint /* could omit this for highest rank in job */ for (target=(my_rank-1); target>0; target--) initiate connect to target For even greater efficiency, the "initiate connect to target" could be done in parallel batches. Eg. start 50 outbound connects, wait for some or all of them to complete, then start the next batch. Such as: for (target=(my_rank-1); target>0; target--) while (num_outstanding > limit) wait num_outstanding++ initiate connect to target Then the callback for completing a connection sequence could decrement num_outstanding and wakeup the waiter (or the waiter could be a sleep/poll type loop). We have been successfully using the algorithms above for about 2-3 years now and they work very well. Todd Rimmer ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
What was I thinking ... for (target = (myRank + 1) % numNodes ; target != myRank; target = (target + 1)% numNodes) { /* establish connection to node target */ } > [EZ] I would try and make sure the connections are not done in a manner > such that all nodes try to establish connections to a single node at the > same time. This is an application issue but can be easily resolve. Do > the MPI connection in a loop like: > > for (target = (myRank + 1) % numNodes ; target != myRank; (target++) % > numNodes) { > /* establish connection to node target */ > } > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
Hi Sean > Eitan Zahavi wrote: > > [EZ] The scalability issues we see today are what I most worry about. > > > One issue that I see is that the CMA, IB CM, and DAPL APIs support only > point-to-point connections. Trying to layer a many-to-many connection model > over these is leading to the inefficiencies. For example, the CMA generates one > SA query per connection. Another issue is that even if if the number of queries > were reduced, the fabric will still see O(n^2) connection messages. [EZ] Having N^2 messages is not a big problem if they do not all go one target... CM is distributed and this is good. Only the PathRecord section of the connection establishment is going today to one node (SA) and you are about to fix it... During initial connections setup you will not have anything in the SA cache and thus the SA will need to answer N^2 PathRecords. Smart exponential back-off can resolve that DOS attack on the SA at bring-up. > > Based on the code, the only SA query of interest to most users will be a path > record query by gids/pkey. To speed up applications written to the current CMA, > DAPL, and Intel's MPI (hey, I gotta eat), my actual implementation has a very > limited path record cache in the kernel. The cache uses an index with O(1) > insertion, removal, and retrieval. (I plan on re-using the index to help > improve the performance of the IB CM as well.) [EZ] We might need a little more in the key for QoS support (to come). > > I'm still working on ideas to address the many-to-many connection model. One [EZ] I would try and make sure the connections are not done in a manner such that all nodes try to establish connections to a single node at the same time. This is an application issue but can be easily resolve. Do the MPI connection in a loop like: for (target = (myRank + 1) % numNodes ; target != myRank; (target++) % numNodes) { /* establish connection to node target */ } > idea is to have a centralized connection manager to coordinate the connections > between the various endpoints. The drawback is that this requires defining a > proprietary protocol. Any implementation work in this area will be deferred for > now though. [EZ] I think a centralized CM is a only going to make things worse. > > - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Eitan Zahavi wrote: [EZ] The scalability issues we see today are what I most worry about. I think that we have a couple scalability issues at the core of this problem. I think that a cache can solve part of the problem, but to fully address the issues, we eventually may need to extend our APIs and underlying protocols. One issue that I see is that the CMA, IB CM, and DAPL APIs support only point-to-point connections. Trying to layer a many-to-many connection model over these is leading to the inefficiencies. For example, the CMA generates one SA query per connection. Another issue is that even if if the number of queries were reduced, the fabric will still see O(n^2) connection messages. Based on the code, the only SA query of interest to most users will be a path record query by gids/pkey. To speed up applications written to the current CMA, DAPL, and Intel's MPI (hey, I gotta eat), my actual implementation has a very limited path record cache in the kernel. The cache uses an index with O(1) insertion, removal, and retrieval. (I plan on re-using the index to help improve the performance of the IB CM as well.) I'm still working on ideas to address the many-to-many connection model. One idea is to have a centralized connection manager to coordinate the connections between the various endpoints. The drawback is that this requires defining a proprietary protocol. Any implementation work in this area will be deferred for now though. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Rimmer, Todd wrote: While each process could do a GET_TABLE for all path records that would be rather inefficient and would provide 1,000,000 path records in the RMPP response, of which only 500 are of interest. Each process could do a GET_TABLE for only those path records with the SGID set to their local port and NumPath set to 1. This would give them only 1000 or so path records, most of which are of interest. Even if all 4000 processors were being used in a single run, each process only needs 3999 path records (999 or which are unique). In fact a given node will never need more than N or the N^2 path records because the remaining involve paths where this node is not involved. so getting all 1,000,000 path records would be very inefficient. Even a local cache wouldn't get every possible path record. The application should be no different. An application that wants to connect to every node on the fabric should only need to issue a single path record query, all of which are of interest. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
> From: Sean Hefty [mailto:[EMAIL PROTECTED] > Rimmer, Todd wrote: > > Each MPI process is independent. However they all need to > get pathrecords > > for all the other processes/nodes in the system. > > Hence, each process on a node will make the exact same set > of queries. > > That should still only be P queries per node, with P = number > of processes on a > node. Why doesn't a single query (GET_TABLE) suffice for > each process? Given a cluster with 1000 nodes, 4 processors per node. A given MPI run may choose to use a subset, for example 500 processes. Each process needs path records for the other 500 processes, but not for the other 3500 cpus in the cluster. While each process could do a GET_TABLE for all path records that would be rather inefficient and would provide 1,000,000 path records in the RMPP response, of which only 500 are of interest. Even if all 4000 processors were being used in a single run, each process only needs 3999 path records (999 or which are unique). In fact a given node will never need more than N or the N^2 path records because the remaining involve paths where this node is not involved. so getting all 1,000,000 path records would be very inefficient. Then multiply this by 4 processes per node making this same set of queries. Then multiply this by multiple partitions, SLs, etc per node and it gets very inefficient to simply get the whole table. Todd Rimmer ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Rimmer, Todd wrote: Each MPI process is independent. However they all need to get pathrecords for all the other processes/nodes in the system. Hence, each process on a node will make the exact same set of queries. That should still only be P queries per node, with P = number of processes on a node. Why doesn't a single query (GET_TABLE) suffice for each process? - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
On Thu, Jan 12, 2006 at 11:58:28AM -0800, Sean Hefty wrote: > This is still O(NlogN) operations, which made me look at indexing schemes > to improve performance. I strongly associate "Indexing schemes" with "judy": http://docs.hp.com/en/B6841-90001/ix01.html The open source project is here: http://judy.sourceforge.net/ > The most obvious implementation to me was to store path records in a binary > tree sorted by dgid/pkey. But this isn't very flexible. "dynamic, associative array" might be overkill too. I'm not sure how many index's it supports but Judy is definitely worth looking at for a "simple" implementation. Perf data I've seen 3-4 years ago indicated that Judy scales nicely from 0 to several million entries. grant ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
> From: Sean Hefty [mailto:[EMAIL PROTECTED] > > why ask the SA the same question multiple times in a row? > > I have no idea why the application did this. Are any of the > queries in this > case actually the same? Each MPI process is independent. However they all need to get pathrecords for all the other processes/nodes in the system. Hence, each process on a node will make the exact same set of queries. Todd R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Eitan Zahavi wrote: [EZ] MPI opens a connection from each node to every other node. Actually even from every CPU to every other CPU. So this is why we have N^2 connections. I was confusing myself. I think that there are n(n-1)/2 connections, but that's still O(n^2). - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Rimmer, Todd wrote: 1 million entry SA database. This is exactly why I think that the SA needs to be backed by a real DBMS. In contrast the replica on each node only needs to handle O(N) entries. And its lookup time could be O(logN). This is still O(NlogN) operations, which made me look at indexing schemes to improve performance. The most obvious implementation to me was to store path records in a binary tree sorted by dgid/pkey. But this isn't very flexible. why ask the SA the same question multiple times in a row? I have no idea why the application did this. Are any of the queries in this case actually the same? - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
> > > On a related note, why does every instance of the application need to query for > every other instance? To establish all-to-all communication, couldn't instance > X only initiate connections to instances > X? (I.e. 1 connects to 2 and 3, 2 > connects to 3.) [EZ] MPI opens a connection from each node to every other node. Actually even from every CPU to every other CPU. So this is why we have N^2 connections. > > > Only a very small subset of queries is used: > > * PathRecord by SRC-GUID,DST-GUID > > * PortInfo by capability mask > > I did look at the code to see what queries were actually being used today. And > yes, we can implement for only those cases. I wanted to allow the flexibility > to support other queries efficiently. [EZ] The scalability issues we see today are what I most worry about. > > - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Eitan Zahavi wrote: The issue is the number of queries grow by N^2. I understand. On a related note, why does every instance of the application need to query for every other instance? To establish all-to-all communication, couldn't instance X only initiate connections to instances > X? (I.e. 1 connects to 2 and 3, 2 connects to 3.) Only a very small subset of queries is used: * PathRecord by SRC-GUID,DST-GUID * PortInfo by capability mask I did look at the code to see what queries were actually being used today. And yes, we can implement for only those cases. I wanted to allow the flexibility to support other queries efficiently. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
Hi Sean, The issue is the number of queries grow by N^2. Only a very small subset of queries is used: * PathRecord by SRC-GUID,DST-GUID * PortInfo by capability mask Not to say the current implementations are perfect. But RDBMS are optimized for other requirements not a simple single key lookup. Also, PathRecord implementation requires traversing the fabric. One could store the result after enumerating the entire N^2*Nsl*Np-key*... But then lookup is a simple hash lookup. Eitan > > Brian Long wrote: > > How much overhead is going to be incurred by using a standard RDBMS > > instead of not caching anything? I'm not completely familiar with the > > IB configurations that would benefit from the proposed SA cache, but it > > seems to me, adding a RDBMS to anything as fast as IB would actually > > slow things down considerably. Can an RDBMS + SA cache actually be > > faster than no cache at all? > > I'm not sure what the speed-up of any cache will be. The SA maintains a > database of various related records - node records, path records, service > records, etc. and responds to queries. This need doesn't go away. The SA > itself is perfect candidate to be implemented using a DBMS. (And if one had > been implemented over a DBMS, I'm not even sure that we'd be talking about > scalability issues for only a few thousand nodes. Is the perceived lack of > scalability of the SA a result of the architecture or the existing implementations?) > > My belief is that a DBMS will outperform anything that I could write to store > and retrieve these records. Consider that a 4000 node cluster will have about > 8,000,000 path records. Local caches can reduce this considerably (to about > 4000), and if we greatly restrict the type of queries that are supported, then > we can manage the retrieval of those records ourselves. > > I do not want end-users to have to administer a database. However, if the user > only needs to install a library, then this approach seems worth pursuing. > > - Sean > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Brian Long wrote: What about SQLite (http://www.sqlite.org/)? This is used by yum 2.4 in Fedora Core and other distributions. "SQLite is a small C library that implements a self-contained, embeddable, zero-configuration SQL database engine." Someone else sent me a link to this same site, and it looks promising. Thanks. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
On Thu, 2006-01-12 at 10:16 -0800, Sean Hefty wrote: > Brian Long wrote: > > How much overhead is going to be incurred by using a standard RDBMS > > instead of not caching anything? I'm not completely familiar with the > > IB configurations that would benefit from the proposed SA cache, but it > > seems to me, adding a RDBMS to anything as fast as IB would actually > > slow things down considerably. Can an RDBMS + SA cache actually be > > faster than no cache at all? > > I'm not sure what the speed-up of any cache will be. The SA maintains a > database of various related records - node records, path records, service > records, etc. and responds to queries. This need doesn't go away. The SA > itself is perfect candidate to be implemented using a DBMS. (And if one had > been implemented over a DBMS, I'm not even sure that we'd be talking about > scalability issues for only a few thousand nodes. Is the perceived lack of > scalability of the SA a result of the architecture or the existing > implementations?) > > My belief is that a DBMS will outperform anything that I could write to store > and retrieve these records. Consider that a 4000 node cluster will have > about > 8,000,000 path records. Local caches can reduce this considerably (to about > 4000), and if we greatly restrict the type of queries that are supported, > then > we can manage the retrieval of those records ourselves. > > I do not want end-users to have to administer a database. However, if the > user > only needs to install a library, then this approach seems worth pursuing. What about SQLite (http://www.sqlite.org/)? This is used by yum 2.4 in Fedora Core and other distributions. "SQLite is a small C library that implements a self-contained, embeddable, zero-configuration SQL database engine." /Brian/ -- Brian Long | | | IT Data Center Systems | .|||. .|||. Cisco Linux Developer | ..:|||:...:|||:.. Phone: (919) 392-7363 | C i s c o S y s t e m s ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
> From: Sean Hefty [mailto:[EMAIL PROTECTED] > I'm not sure what the speed-up of any cache will be. The SA > maintains a > database of various related records - node records, path > records, service > records, etc. and responds to queries. This need doesn't go > away. The SA > itself is perfect candidate to be implemented using a DBMS. > (And if one had > been implemented over a DBMS, I'm not even sure that we'd be > talking about > scalability issues for only a few thousand nodes. Is the > perceived lack of > scalability of the SA a result of the architecture or the > existing implementations?) The scalability problem occurs during things like MPI job startup. At start up, you will have N processes which each need N-1 path records to establish connections. Those queries require both Node Record and Path Record queries. This means at job startup, the SA must process O(N^2) SA queries. If the lookup algorithm in the SA is O(logM) {M= number of SA records, which is O(N^2)), then the SA will have O(N^2 log(N^2)) operations to perform and O(N^2) packets to send and receive. For a 4000 CPU cluster (1000 nodes with 2 dual core CPUs each), that is over 16 million SA queries at job startup against a 1 million entry SA database. It would take quite a good SA database implementation to handle than in a timely manner. In contrast the replica on each node only needs to handle O(N) entries. And its lookup time could be O(logN). You'll note I spoke of processes, not nodes. In multi-CPU nodes, each process will need similar information. This is one area where a replica can greatly help, why ask the SA the same question multiple times in a row? If only a cache is considered, then the startup is still O(N^2) SA queries its just that we have 1/CPU-per-Node as many queries. Todd Rimmer ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Brian Long wrote: How much overhead is going to be incurred by using a standard RDBMS instead of not caching anything? I'm not completely familiar with the IB configurations that would benefit from the proposed SA cache, but it seems to me, adding a RDBMS to anything as fast as IB would actually slow things down considerably. Can an RDBMS + SA cache actually be faster than no cache at all? I'm not sure what the speed-up of any cache will be. The SA maintains a database of various related records - node records, path records, service records, etc. and responds to queries. This need doesn't go away. The SA itself is perfect candidate to be implemented using a DBMS. (And if one had been implemented over a DBMS, I'm not even sure that we'd be talking about scalability issues for only a few thousand nodes. Is the perceived lack of scalability of the SA a result of the architecture or the existing implementations?) My belief is that a DBMS will outperform anything that I could write to store and retrieve these records. Consider that a 4000 node cluster will have about 8,000,000 path records. Local caches can reduce this considerably (to about 4000), and if we greatly restrict the type of queries that are supported, then we can manage the retrieval of those records ourselves. I do not want end-users to have to administer a database. However, if the user only needs to install a library, then this approach seems worth pursuing. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
On Wed, 2006-01-11 at 14:21 -0800, Sean Hefty wrote: > Rimmer, Todd wrote: > > A relational database is overkill for this function. > > It will also likely be more complex for end users to setup and debug. > > The cache setup should be simple. The solution should be such that > > just an on/off switch needs to be configured (with a default of on) > > for most users to get started. > > My take is a little different. I view the SA as a database that maintains > related attributes. > > By supporting relationships between different attributes, we can provide a > more > powerful, higher performing, and more user-friendly interface to the user. > For > example, a single SQL query could return path records given only a node > description or service name. Today, we generate multiple SA queries, their > responses, and associated RMPP MADs to obtain the same data. > > I'm not sold on the idea of using a relational database, because of the > additional complexity for end-users. However, I believe it can offer > significant advantages over what we could code ourselves. How much overhead is going to be incurred by using a standard RDBMS instead of not caching anything? I'm not completely familiar with the IB configurations that would benefit from the proposed SA cache, but it seems to me, adding a RDBMS to anything as fast as IB would actually slow things down considerably. Can an RDBMS + SA cache actually be faster than no cache at all? /Brian/ -- Brian Long | | | IT Data Center Systems | .|||. .|||. Cisco Linux Developer | ..:|||:...:|||:.. Phone: (919) 392-7363 | C i s c o S y s t e m s ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Rimmer, Todd wrote: A relational database is overkill for this function. It will also likely be more complex for end users to setup and debug. The cache setup should be simple. The solution should be such that just an on/off switch needs to be configured (with a default of on) for most users to get started. My take is a little different. I view the SA as a database that maintains related attributes. By supporting relationships between different attributes, we can provide a more powerful, higher performing, and more user-friendly interface to the user. For example, a single SQL query could return path records given only a node description or service name. Today, we generate multiple SA queries, their responses, and associated RMPP MADs to obtain the same data. I'm not sold on the idea of using a relational database, because of the additional complexity for end-users. However, I believe it can offer significant advantages over what we could code ourselves. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Greg Lindahl wrote: Since no one's really answered this yet: Many sysadmins are not going to want to install a relational database to run an SA cache. So I'd stick to Berkeley DB if I were you. Thanks for the response. To be clear, the cache would be an optional component, and likely only needed for larger configurations. From what I can tell PostgreSQL and MySQL both ship with RedHat and SuSE. MySQL claims that it can be built as a small library that can then be integrated with an application. It may be possible to have the application do everything for the user except install the necessary libraries... ? The installation and configuration of a database is what I see as the biggest drawback to going this route. Unfortunately, I need to play with this idea more to see how much of an impact that would be to an actual user. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Since no one's really answered this yet: Many sysadmins are not going to want to install a relational database to run an SA cache. So I'd stick to Berkeley DB if I were you. -- greg ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
> From: Sean Hefty [mailto:[EMAIL PROTECTED] > > Eitan Zahavi wrote: > > Is the intention to speed up SA queries? > > Or is it to have persistent storage of them? > > I want both. :) I would clarify that the best bang for the effort will be to focus on the queries which the ULPs themselves will use most often. For example, the resolution from a node name or Node Guid to a path record. While a general purpose replica would be nice, it could over complicate the initial design. The goal is not to optimize all the queries an end user might desire, but rather to help avoid the O(N^2) load which thinks like start up of an MPI or SDP application could cause on the SA. > and how/when is it invalidated by the SM. There are a variety of notices already available from the SM which should be used for the triggering or invalidation. Such: GID In/Out of Service Client Reregistration It may also be desirable to have the CM upon a failed connect to a given remote node to trigger the local replica to invalidate and requery for information about remote node. Todd R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Eitan Zahavi wrote: Is the intention to speed up SA queries? Or is it to have persistent storage of them? I want both. :) I think we should focus on the kind of data to cache, how it is made transparently available to any OpenIB client and how/when is it invalidated by the SM. We should only keep the cache data in memory not on disk. In order to support advanced queries efficiently, some sort of indexing scheme would be needed. This is what a database system would provide, saving us from having to implement that part. The fact that the database could also provide persistent storage and triggers are just additional advantages. Later if we want to make it persistent or even stored in LDAP/SQL... I do not care. But the first implementation should be in memory. I think that you're assuming that an initial implementation that is done just in memory would be quicker to complete. I'm not really wanting to write a complete throw-away solution capable of supporting only one or two very simple queries efficiently. BTW: most of the databases referred by these mails are not supporting distributed shadow copies of a centrally controlled tables. Personally, I'd be happy with a simple database that provided nothing more than indexing and query support. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
James Lentini wrote: Will it be possible to use the OpenIB stack without setting up the SA cache? Yes. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
> On Tue, 10 Jan 2006, Sean Hefty wrote: > > > Grant Grundler wrote: > > > I forgot to point out postgres: > > > http://www.postgresql.org/about/ > > > > This looks like it would work well. > > > > The question that I have for users is: Is it acceptable for the > > cache to make use of a relational database system? A relational database is overkill for this function. It will also likely be more complex for end users to setup and debug. The cache setup should be simple. The solution should be such that just an on/off switch needs to be configured (with a default of on) for most users to get started. Todd Rimmer ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Hi Sean, Now I really lost you: Is the intention to speed up SA queries? Or is it to have persistent storage of them? I think we should focus on the kind of data to cache, how it is made transparently available to any OpenIB client and how/when is it invalidated by the SM. We should only keep the cache data in memory not on disk. Later if we want to make it persistent or even stored in LDAP/SQL... I do not care. But the first implementation should be in memory. BTW: most of the databases referred by these mails are not supporting distributed shadow copies of a centrally controlled tables. Eitan Sean Hefty wrote: Sean Hefty wrote: To keep the design as flexible as possible, my plan is to implement the cache in userspace. The interface to the cache would be via MADs. Clients would send their queries to the sa_cache instead of the SA itself. The format of the MADs would be essentially identical to those used to query the SA itself. Response MADs would contain any requested information. If the cache could not satisfy a request, the sa_cache would query the SA, update its cache, then return a reply. What I think I really want is a distributed relational database management system with an SQL interface and triggers that maintains the SA data... (select * from path_rec where sgid=x and dgid=y and pkey=z) But without making any assumptions about the SA, a local cache could still use an RDMS to store and retrieve the data records. Would requiring an RDMS on each system be acceptable? If not, then writing a small, dumb pseudo-database as part of the sa_cache could provide a lot of flexibility. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
On Tue, 10 Jan 2006, Sean Hefty wrote: > Grant Grundler wrote: > > I forgot to point out postgres: > > http://www.postgresql.org/about/ > > This looks like it would work well. > > The question that I have for users is: Is it acceptable for the > cache to make use of a relational database system? Will it be possible to use the OpenIB stack without setting up the SA cache? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Grant Grundler wrote: I forgot to point out postgres: http://www.postgresql.org/about/ This looks like it would work well. The question that I have for users is: Is it acceptable for the cache to make use of a relational database system? The disadvantage is that a RDMS would need to be installed and configured on several, or all systems. (It's not clear to me yet how much of that could be automated.) The advantage is that the cache would gain the benefits of having a database backend - notably support for more complex queries, persistent storage, and indexing to increase query performance. To provide some additional context, path record queries can be fairly complex, involving a number of fields. (All queries today are limited to sgid, dgid, and pkey.) Trying to efficiently retrieve a path record based on a dgid and pkey is non-trivial, and support for queries with additional restrictions or for other SA records complicates this issue. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
On Tue, Jan 10, 2006 at 03:00:46PM -0800, Sean Hefty wrote: > I did find that libdb-4.2 was installed on SuSE and RedHat systems, and a > libodbc was on my SuSE system. Libdb-4.2 would help manage some of the SA > objects to a file, but is limited in its data storage and retrieval > capabilities. If a true relational database couldn't be used, libdb would > definitely be useful. I forgot to point out postgres: http://www.postgresql.org/about/ Several packages (e.g. postfix, ldap) offer different backends so the admin can decide how sophisticated the data storage and retrieval needs to be. With roughly 150K employees, HP has a rather sophisticated LDAP/postfix setup to manage logins. But I don't need that for the 10 boxes I manage outside the firewall. Same is probably true for SA cache. grant ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Grant Grundler wrote: We already have several databases for different things: makedb (primarily for NSS) updatedb (fast lookup of local files) mandb (man pages) rpmdb (yes, even on debian boxes) sasldbconverter2 (for SASL - linux securty/login stuff) *db4.3* (Berkeley v4.3 Database - used by apt-get/dpkg, Apache, python, libns-db, postfix, etc) In fact, looks like a debian "testing" box would be disfunctional without Berkeley Database. Would that work? Thanks for pointing these out. I did find that libdb-4.2 was installed on SuSE and RedHat systems, and a libodbc was on my SuSE system. Libdb-4.2 would help manage some of the SA objects to a file, but is limited in its data storage and retrieval capabilities. If a true relational database couldn't be used, libdb would definitely be useful. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
On Tue, Jan 10, 2006 at 10:55:36AM -0800, Sean Hefty wrote: > What I think I really want is a distributed relational database management > system with an SQL interface and triggers that maintains the SA data... > (select * from path_rec where sgid=x and dgid=y and pkey=z) > > But without making any assumptions about the SA, a local cache could still > use an RDMS to store and retrieve the data records. Would requiring an > RDMS on each system be acceptable? We already have several databases for different things: makedb (primarily for NSS) updatedb (fast lookup of local files) mandb (man pages) rpmdb (yes, even on debian boxes) sasldbconverter2 (for SASL - linux securty/login stuff) *db4.3* (Berkeley v4.3 Database - used by apt-get/dpkg, Apache, python, libns-db, postfix, etc) In fact, looks like a debian "testing" box would be disfunctional without Berkeley Database. Would that work? sleepycat.org gives more examples of opensource use: OpenLDAP, Kerberos, Subversion, Sendmail, Postfix, SquidGuard, NetaTalk, Movable Type, SpamAssassin, Mail Avenger, Bogofilter hth, grant > If not, then writing a small, dumb > pseudo-database as part of the sa_cache could provide a lot of flexibility. > > - Sean > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Sean Hefty wrote: To keep the design as flexible as possible, my plan is to implement the cache in userspace. The interface to the cache would be via MADs. Clients would send their queries to the sa_cache instead of the SA itself. The format of the MADs would be essentially identical to those used to query the SA itself. Response MADs would contain any requested information. If the cache could not satisfy a request, the sa_cache would query the SA, update its cache, then return a reply. What I think I really want is a distributed relational database management system with an SQL interface and triggers that maintains the SA data... (select * from path_rec where sgid=x and dgid=y and pkey=z) But without making any assumptions about the SA, a local cache could still use an RDMS to store and retrieve the data records. Would requiring an RDMS on each system be acceptable? If not, then writing a small, dumb pseudo-database as part of the sa_cache could provide a lot of flexibility. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
Hi Sean I am still confused about the exact requirement. But the reference is: osm/opensm/osm_sa_path_record.c The rest of the queries are handled by osm_sa_*.c (but not the code in _ctrl.c). osm_sa_class_port_info.c osm_sa_response.c osm_sa_node_record.c osm_sa_service_record.c osm_sa_informinfo.c osm_sa_path_record.c osm_sa_slvl_record.c osm_sa_lft_record.c osm_sa_lft_record_ctrl.c osm_sa_sminfo_record.c osm_sa_link_record.c osm_sa_pkey_record.c osm_sa_vlarb_record.c osm_sa_mad_ctrl.c osm_sa_portinfo_record.c osm_sa_mcmember_record.c Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -Original Message- > From: Sean Hefty [mailto:[EMAIL PROTECTED] > Sent: Friday, January 06, 2006 10:40 PM > To: Hal Rosenstock > Cc: Eitan Zahavi; openib > Subject: Re: [openib-general] SA cache design > > Hal Rosenstock wrote: > > I would view that the database is an SADB with the actual pathrecords as > > one example rather than the SMDB from which they are calculated. I think > > Sean is interested in the SA packet query/response code here so avoid > > recreating this and that the backend would be stripped out. Sean, is > > that accurate ? > > Hal is correct. > > - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Hal Rosenstock wrote: I would view that the database is an SADB with the actual pathrecords as one example rather than the SMDB from which they are calculated. I think Sean is interested in the SA packet query/response code here so avoid recreating this and that the backend would be stripped out. Sean, is that accurate ? Hal is correct. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Sean Hefty wrote: - The MAD interface will result in additional data copies and userspace to kernel transitions for clients residing on the local system. - Clients require a mechanism to locate the sa_cache, or need to make assumptions about its location. Based on some comments from people, I believe that we can handle the latter problem when the sa_cache/sa_replica/sa_whateveryouwanttocallit registers with the MAD layer. Ib_mad can record an sa_lid and sa_sl as part of a device's port attributes. These would initially be set the same as sm_lid and sm_sl. When a client registers to receive unsolicited SA MADs, the attributes would be updated accordingly. ib_sa and other clients sending MADs to the SA would use these values in place of the SM values. I'm not fond of the idea of pushing an SA switch into the MAD layer, since this makes it more difficult for the actual cache to query the SA directly. Another approach that may work better long term is treating the cache as a redirected SA request. Something along the lines of: http://openib.org/pipermail/openib-general/2005-September/011349.html (but with a restricted implementation for now) might also work. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
On Fri, 2006-01-06 at 14:55, Eitan Zahavi wrote: > I guess you mean the code that is answering to PathRecord queries? > It is possible to extract the "SMDB" objects and duplicate that database. > I am not sure it is such a good idea. What if the SM is not OpenSM? I would view that the database is an SADB with the actual pathrecords as one example rather than the SMDB from which they are calculated. I think Sean is interested in the SA packet query/response code here so avoid recreating this and that the backend would be stripped out. Sean, is that accurate ? -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
On Fri, 2006-01-06 at 15:13, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > On Fri, 2006-01-06 at 15:00, Eitan Zahavi wrote: > > > >>I agree with Todd: a key is to keep the client unaware of the mux existence. > >>So the same client can be run on system without the cache. > > > > > > Define same client ? I would consider it the same SA client directing > > requests differently based on how the mux is configured based on a query > > to the cache (if it exists) as to its capabilities. > SA Client can be embedded in an application - any program that can send mads > can be an SA client. Such (non OpenIB) clients would not take advantage of the cache. That seems like the tradeoff for not the duplicated forwarding of the request. Guess I'm in the minority thinking that this might be worthwhile. -- Hal > > > > > -- Hal > > > > > >>Hal Rosenstock wrote: > >> > >>>On Fri, 2006-01-06 at 09:05, Rimmer, Todd wrote: > >>> > >>> > >From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > >On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote: > > > > > >>This of course implies the "SA Mux" must analyze more than just > >>the attribute ID to determine if the replica can handle the query. > >>But the memory savings is well worth the extra level of filtering. > > > >If the SA cache does this, it seems it would be pretty simple > >to return > >this info in an attribute to the client so the client would > >know when to > >go to the cache/replica and when to go direct to the SA in the case > >where only certain queries are supported. Wouldn't this be > >advantageous > >when the replica doesn't support all queries ? > > Why put the burden on the application. give the query to the Mux. > >>> > >>> > >>>That's what I'm suggesting. Rather than a binary switch mux, a more > >>>granular one which determines how to route the outgoing SA request. > >>> > >>> > >>> > With an optional flag indicating a prefered "routing" (choices of: to > SA, > to replica, let Mux decide). Then let it decide. As you suggest it may > be simplest to let the Mux try the replica and on failure fallback > to the SA transparent to the app (sort of the way SDP intercepts > socket ops and falls back to TCP/IP when SDP isn't appropriate). > >>> > >>> > >>>It depends on whether the replica/cache forwards unsupported requests on > >>>or responds with not supported back to the client as to how this is > >>>handled. Sean was proposing the forward on model and a binary switch at > >>>the client. I think this is more granular and can be mux'd only with the > >>>knowledge of what a replica/cache supports (not sure about dealing with > >>>different replica/caches supporting a different set of queries; need to > >>>think more on how the caches are located, etc.). You are mentioning a > >>>third model here. > >>> > >>>-- Hal > >>> > >>>___ > >>>openib-general mailing list > >>>openib-general@openib.org > >>>http://openib.org/mailman/listinfo/openib-general > >>> > >>>To unsubscribe, please visit > >>>http://openib.org/mailman/listinfo/openib-general > >> > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Sean Hefty wrote: Eitan Zahavi wrote: Can someone familiar with the opensm code tell me how difficult it would be to extract out the code that tracks the subnet data and responds to queries? I guess you mean the code that is answering to PathRecord queries? Yes - that along with answering other queries. It is possible to extract the "SMDB" objects and duplicate that database. I am not sure it is such a good idea. What if the SM is not OpenSM? I was thinking in terms of code re-use, and not in terms of which SM was running. Interfacing to the SM would be through standard queries. The issue is that answering PathRecords queries can have impact on further algorithms the SM takes. It might not be enough to know the topology, SL2VL, LFT, MFT to answer PathRecord attributes... - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Hal Rosenstock wrote: On Fri, 2006-01-06 at 15:00, Eitan Zahavi wrote: I agree with Todd: a key is to keep the client unaware of the mux existence. So the same client can be run on system without the cache. Define same client ? I would consider it the same SA client directing requests differently based on how the mux is configured based on a query to the cache (if it exists) as to its capabilities. SA Client can be embedded in an application - any program that can send mads can be an SA client. -- Hal Hal Rosenstock wrote: On Fri, 2006-01-06 at 09:05, Rimmer, Todd wrote: From: Hal Rosenstock [mailto:[EMAIL PROTECTED] On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote: This of course implies the "SA Mux" must analyze more than just the attribute ID to determine if the replica can handle the query. But the memory savings is well worth the extra level of filtering. If the SA cache does this, it seems it would be pretty simple to return this info in an attribute to the client so the client would know when to go to the cache/replica and when to go direct to the SA in the case where only certain queries are supported. Wouldn't this be advantageous when the replica doesn't support all queries ? Why put the burden on the application. give the query to the Mux. That's what I'm suggesting. Rather than a binary switch mux, a more granular one which determines how to route the outgoing SA request. With an optional flag indicating a prefered "routing" (choices of: to SA, to replica, let Mux decide). Then let it decide. As you suggest it may be simplest to let the Mux try the replica and on failure fallback to the SA transparent to the app (sort of the way SDP intercepts socket ops and falls back to TCP/IP when SDP isn't appropriate). It depends on whether the replica/cache forwards unsupported requests on or responds with not supported back to the client as to how this is handled. Sean was proposing the forward on model and a binary switch at the client. I think this is more granular and can be mux'd only with the knowledge of what a replica/cache supports (not sure about dealing with different replica/caches supporting a different set of queries; need to think more on how the caches are located, etc.). You are mentioning a third model here. -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Hi Todd, So you agree we will need to design "replica" buildup scalability features into the solution ( to avoid the bring-up load on the SA) ? Why would a caching system not work here? Instead of replicating the data. The caching concept allows for the SA to still be in the loop by invalidating the cache or through cache entries lifetime policy. The reason I think a total replica (distribution of the SA) would eventually be problematic is that as we approach QoS solutions, some need for path record use and retirement is going to show up. What if the SM decides to change SL2VL maps due to new QoS requirement. We will need a more complicated "synchronization" or invalidation technique to push that kind of data into the "replica" SAs. Eitan Rimmer, Todd wrote: From: Eitan Zahavi [mailto:[EMAIL PROTECTED] Hi Sean, Todd, Although I like the "replica" idea for its "query" performance boost - I suspect it will actually do not scale for very large networks: Each node has to query for the entire database would cause N^2 load on the SA. After any change (which do happen with higher probability on large networks) the SA will need to send each Report to N targets. We already have some bad experience with large clusters SA query issues, like the one reported by Roland "searching for SRP targets using PortInfo capability mask". Our experience has been the exact opposite. While there is an initial load on the SA to populate the replica (which we have used various techniques to reduce such as backing off when the SA reports Busy, having a random time offset of start of query, etc). The boost occurs when a new application starts, such as an MPI using the SA/CM to establish connections as per the IBTA spec. A 1000 process MPI job would have each process make 999 queries to the SA at job startup time. This causes a burst of 999, sets of SA queries (most will involve both Node Record and Path record queries so it will really be 2x this amount), BEFORE the MPI job can actually start. As Open IB moves forward to implement QOS and other features, MPI will have to use the SA to get its path records. If you study MVAPICH at present, it merely exchanges LIDs between nodes and hardcodes (or via enviornment variables uses the same value for all processes) all the other QOS parameters. In a true QOS and congestion management environment it will instead have to use the CM/SA. We have been using this replica technique quite successfully for 2-3 years now. Our MPI has used the SA/CM for connection establishment for just as long. As it was pointed out, most fabrics will be quite stable. Hence having a replica and paying the cost of the SA queries once will be much more efficient than paying that cost on every application startup. Todd Rimmer ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
On Fri, 2006-01-06 at 15:00, Eitan Zahavi wrote: > I agree with Todd: a key is to keep the client unaware of the mux existence. > So the same client can be run on system without the cache. Define same client ? I would consider it the same SA client directing requests differently based on how the mux is configured based on a query to the cache (if it exists) as to its capabilities. -- Hal > Hal Rosenstock wrote: > > On Fri, 2006-01-06 at 09:05, Rimmer, Todd wrote: > > > >>>From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > >>>On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote: > >>> > This of course implies the "SA Mux" must analyze more than just > the attribute ID to determine if the replica can handle the query. > But the memory savings is well worth the extra level of filtering. > >>> > >>>If the SA cache does this, it seems it would be pretty simple > >>>to return > >>>this info in an attribute to the client so the client would > >>>know when to > >>>go to the cache/replica and when to go direct to the SA in the case > >>>where only certain queries are supported. Wouldn't this be > >>>advantageous > >>>when the replica doesn't support all queries ? > >> > >>Why put the burden on the application. give the query to the Mux. > > > > > > That's what I'm suggesting. Rather than a binary switch mux, a more > > granular one which determines how to route the outgoing SA request. > > > > > >> With an optional flag indicating a prefered "routing" (choices of: to SA, > >>to replica, let Mux decide). Then let it decide. As you suggest it may > >>be simplest to let the Mux try the replica and on failure fallback > >>to the SA transparent to the app (sort of the way SDP intercepts > >>socket ops and falls back to TCP/IP when SDP isn't appropriate). > > > > > > It depends on whether the replica/cache forwards unsupported requests on > > or responds with not supported back to the client as to how this is > > handled. Sean was proposing the forward on model and a binary switch at > > the client. I think this is more granular and can be mux'd only with the > > knowledge of what a replica/cache supports (not sure about dealing with > > different replica/caches supporting a different set of queries; need to > > think more on how the caches are located, etc.). You are mentioning a > > third model here. > > > > -- Hal > > > > ___ > > openib-general mailing list > > openib-general@openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
On Fri, 2006-01-06 at 14:50, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote: > > > >>This of course implies the "SA Mux" must analyze more than just > >>the attribute ID to determine if the replica can handle the query. > >>But the memory savings is well worth the extra level of filtering. > > > > > > If the SA cache does this, it seems it would be pretty simple to return > > this info in an attribute to the client so the client would know when to > > go to the cache/replica and when to go direct to the SA in the case > > where only certain queries are supported. Wouldn't this be advantageous > > when the replica doesn't support all queries ? > I think we want to make the client totally unaware to the > existence of the cache. Perhaps. I would express this differently: the client to be as unaware as possible (the muxing on a per attribute to direct the request seems reasonably straightforward). > So the cache itself will simply forward the message (maybe changing TID). Yes, the transformation at the cache should be as trivial as possible. I would like to eliminate the doubling up of packets when unncessary (for requests that the cache does not support rather than ones it does support but does not have the information). -- Hal > > -- Hal > > > > ___ > > openib-general mailing list > > openib-general@openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Eitan Zahavi wrote: Can someone familiar with the opensm code tell me how difficult it would be to extract out the code that tracks the subnet data and responds to queries? I guess you mean the code that is answering to PathRecord queries? Yes - that along with answering other queries. It is possible to extract the "SMDB" objects and duplicate that database. I am not sure it is such a good idea. What if the SM is not OpenSM? I was thinking in terms of code re-use, and not in terms of which SM was running. Interfacing to the SM would be through standard queries. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
I agree with Todd: a key is to keep the client unaware of the mux existence. So the same client can be run on system without the cache. Hal Rosenstock wrote: On Fri, 2006-01-06 at 09:05, Rimmer, Todd wrote: From: Hal Rosenstock [mailto:[EMAIL PROTECTED] On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote: This of course implies the "SA Mux" must analyze more than just the attribute ID to determine if the replica can handle the query. But the memory savings is well worth the extra level of filtering. If the SA cache does this, it seems it would be pretty simple to return this info in an attribute to the client so the client would know when to go to the cache/replica and when to go direct to the SA in the case where only certain queries are supported. Wouldn't this be advantageous when the replica doesn't support all queries ? Why put the burden on the application. give the query to the Mux. That's what I'm suggesting. Rather than a binary switch mux, a more granular one which determines how to route the outgoing SA request. With an optional flag indicating a prefered "routing" (choices of: to SA, to replica, let Mux decide). Then let it decide. As you suggest it may be simplest to let the Mux try the replica and on failure fallback to the SA transparent to the app (sort of the way SDP intercepts socket ops and falls back to TCP/IP when SDP isn't appropriate). It depends on whether the replica/cache forwards unsupported requests on or responds with not supported back to the client as to how this is handled. Sean was proposing the forward on model and a binary switch at the client. I think this is more granular and can be mux'd only with the knowledge of what a replica/cache supports (not sure about dealing with different replica/caches supporting a different set of queries; need to think more on how the caches are located, etc.). You are mentioning a third model here. -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Sean Hefty wrote: Eitan Zahavi wrote: So if the cache is on another host - a new kind of MAD will have to be sent on behalf of the original request? I was thinking more in terms of redirection. Today none of the clients support redirection. It would take significant duplicated effort on the client front to support that. In IB QoS properties are mainly the PathRecord parameters: SL, Rate, MTU, PathBits (LMC bits). So if traditionally we had PathRecord requested for each Src->Dst port now we will need to track at least: Src->Dst * #QoS-levels. (a non optimal implementation will require even more: #Src->Dst * #Clients * #Servers * #Services). I understand you now. Can someone familiar with the opensm code tell me how difficult it would be to extract out the code that tracks the subnet data and responds to queries? I guess you mean the code that is answering to PathRecord queries? It is possible to extract the "SMDB" objects and duplicate that database. I am not sure it is such a good idea. What if the SM is not OpenSM? - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Hal Rosenstock wrote: On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote: This of course implies the "SA Mux" must analyze more than just the attribute ID to determine if the replica can handle the query. But the memory savings is well worth the extra level of filtering. If the SA cache does this, it seems it would be pretty simple to return this info in an attribute to the client so the client would know when to go to the cache/replica and when to go direct to the SA in the case where only certain queries are supported. Wouldn't this be advantageous when the replica doesn't support all queries ? I think we want to make the client totally unaware to the existence of the cache. So the cache itself will simply forward the message (maybe changing TID). -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
On Fri, 2006-01-06 at 13:59, Sean Hefty wrote: > Eitan Zahavi wrote: > > So if the cache is on another host - a new kind of MAD will have to be > > sent on behalf of > > the original request? > > I was thinking more in terms of redirection. > > > In IB QoS properties are mainly the PathRecord parameters: SL, Rate, > > MTU, PathBits (LMC bits). > > So if traditionally we had PathRecord requested for each Src->Dst port > > now we will need to track at least: > > Src->Dst * #QoS-levels. (a non optimal implementation will require even > > more: #Src->Dst * #Clients * #Servers * #Services). > > I understand you now. I'm not sure about the granularity this needs tracking at. > Can someone familiar with the opensm code tell me how difficult it would be > to > extract out the code that tracks the subnet data and responds to queries? Although I don't think that is difficult, IMO it is more a matter of whether you want to buy into the architecture with the component and vendor libraries. I can help with this if this is the direction chosen. I would make this another build option. The other question is how this would be changed so that when the data is not present the real SA is queried. -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Eitan Zahavi wrote: So if the cache is on another host - a new kind of MAD will have to be sent on behalf of the original request? I was thinking more in terms of redirection. In IB QoS properties are mainly the PathRecord parameters: SL, Rate, MTU, PathBits (LMC bits). So if traditionally we had PathRecord requested for each Src->Dst port now we will need to track at least: Src->Dst * #QoS-levels. (a non optimal implementation will require even more: #Src->Dst * #Clients * #Servers * #Services). I understand you now. Can someone familiar with the opensm code tell me how difficult it would be to extract out the code that tracks the subnet data and responds to queries? - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Hi Eitan, [snip...] > >> So if a new client wants to connect to another node a new PathRecord > >> query will not need to be sent to the SA. However, recent work on QoS has > >>pointed out > >> that under some QoS schemes PathRecord should not be shared by different > >>clients > > > > > > I'm not sure that QoS handling is the responsibility of the cache. The > > module > > requesting the path records should probably deal with this. > In IB QoS properties are mainly the PathRecord parameters: SL, Rate, MTU, > PathBits (LMC bits). > So if traditionally we had PathRecord requested for each Src->Dst port now we > will need to > track at least: > Src->Dst * #QoS-levels. (a non optimal implementation will require even more: > #Src->Dst * #Clients * #Servers * #Services). Perhaps QoS requests (I'm referring to those with the new proposed key) are not cached as I think this may end up with the cache needing to know the path record policies). I would propose deferring this aspect until the new QoS work is a little firmer and the cache direction in OpenIB is also a little firmer (e.g. QoS = phase 2 or beyond of this work). [snip...] -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
On Fri, 2006-01-06 at 09:05, Rimmer, Todd wrote: > > From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > > On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote: > > > This of course implies the "SA Mux" must analyze more than just > > > the attribute ID to determine if the replica can handle the query. > > > But the memory savings is well worth the extra level of filtering. > > > > If the SA cache does this, it seems it would be pretty simple > > to return > > this info in an attribute to the client so the client would > > know when to > > go to the cache/replica and when to go direct to the SA in the case > > where only certain queries are supported. Wouldn't this be > > advantageous > > when the replica doesn't support all queries ? > > Why put the burden on the application. give the query to the Mux. That's what I'm suggesting. Rather than a binary switch mux, a more granular one which determines how to route the outgoing SA request. > With an optional flag indicating a prefered "routing" (choices of: to SA, > to replica, let Mux decide). Then let it decide. As you suggest it may > be simplest to let the Mux try the replica and on failure fallback > to the SA transparent to the app (sort of the way SDP intercepts > socket ops and falls back to TCP/IP when SDP isn't appropriate). It depends on whether the replica/cache forwards unsupported requests on or responds with not supported back to the client as to how this is handled. Sean was proposing the forward on model and a binary switch at the client. I think this is more granular and can be mux'd only with the knowledge of what a replica/cache supports (not sure about dealing with different replica/caches supporting a different set of queries; need to think more on how the caches are located, etc.). You are mentioning a third model here. -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
> From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote: > > This of course implies the "SA Mux" must analyze more than just > > the attribute ID to determine if the replica can handle the query. > > But the memory savings is well worth the extra level of filtering. > > If the SA cache does this, it seems it would be pretty simple > to return > this info in an attribute to the client so the client would > know when to > go to the cache/replica and when to go direct to the SA in the case > where only certain queries are supported. Wouldn't this be > advantageous > when the replica doesn't support all queries ? Why put the burden on the application. give the query to the Mux. With an optional flag indicating a prefered "routing" (choices of: to SA, to replica, let Mux decide). Then let it decide. As you suggest it may be simplest to let the Mux try the replica and on failure fallback to the SA transparent to the app (sort of the way SDP intercepts socket ops and falls back to TCP/IP when SDP isn't appropriate). Todd R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote: > This of course implies the "SA Mux" must analyze more than just > the attribute ID to determine if the replica can handle the query. > But the memory savings is well worth the extra level of filtering. If the SA cache does this, it seems it would be pretty simple to return this info in an attribute to the client so the client would know when to go to the cache/replica and when to go direct to the SA in the case where only certain queries are supported. Wouldn't this be advantageous when the replica doesn't support all queries ? -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
> From: Eitan Zahavi [mailto:[EMAIL PROTECTED] > Hi Sean, Todd, > > Although I like the "replica" idea for its "query" > performance boost - I suspect it will actually do not scale > for very large > networks: Each node has to query for the entire database > would cause N^2 load on the SA. > After any change (which do happen with higher probability on > large networks) the SA will need to send each Report to N targets. > > We already have some bad experience with large clusters SA > query issues, like the one reported by Roland > "searching for SRP targets using PortInfo capability mask". > Our experience has been the exact opposite. While there is an initial load on the SA to populate the replica (which we have used various techniques to reduce such as backing off when the SA reports Busy, having a random time offset of start of query, etc). The boost occurs when a new application starts, such as an MPI using the SA/CM to establish connections as per the IBTA spec. A 1000 process MPI job would have each process make 999 queries to the SA at job startup time. This causes a burst of 999, sets of SA queries (most will involve both Node Record and Path record queries so it will really be 2x this amount), BEFORE the MPI job can actually start. As Open IB moves forward to implement QOS and other features, MPI will have to use the SA to get its path records. If you study MVAPICH at present, it merely exchanges LIDs between nodes and hardcodes (or via enviornment variables uses the same value for all processes) all the other QOS parameters. In a true QOS and congestion management environment it will instead have to use the CM/SA. We have been using this replica technique quite successfully for 2-3 years now. Our MPI has used the SA/CM for connection establishment for just as long. As it was pointed out, most fabrics will be quite stable. Hence having a replica and paying the cost of the SA queries once will be much more efficient than paying that cost on every application startup. Todd Rimmer ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Hi Sean, Todd, Although I like the "replica" idea for its "query" performance boost - I suspect it will actually do not scale for very large networks: Each node has to query for the entire database would cause N^2 load on the SA. After any change (which do happen with higher probability on large networks) the SA will need to send each Report to N targets. We already have some bad experience with large clusters SA query issues, like the one reported by Roland "searching for SRP targets using PortInfo capability mask". Eitan Sean Hefty wrote: - It is implemented in kernel mode - while user mode may help during initial debug, it will be important for kernel mode ULPs such as SRP, IPoIB and SDP to also make use of these records Your kernel footprint is smaller than I expected, which is good. Note that with a MAD interface, kernel modules would still have access to any cached data. I also wanted to stick with usermode to allow saving the cache to disk, so that it would be available immediately after a reboot. (My assumption being that changes to the network topology would be rare, so we could optimize around a stable network design.) As a related topic, there will be a separate SA client interface defined that will generate SA query MADs for the user. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Hi Sean, Please see below. Sean Hefty wrote: * Regarding the sentence:"Clients would send their queries to the sa_cache instead of the SA" I would propose that a "SA MAD send switch" be implemented in the core: Such a switch will enable plugging in the SA cache (I would prefer calling it SA local agent due to its extended functionality). Once plugged in, this "SA local agent" should be forwarded all outgoing SA queries. Once it handles the MAD it should be able to inject the response through the core "SA MAD send switch" as if they arrived from the wire. This was my thought as well. I hesitated to refer to the cache as a local agent, since that's an implementation detail. I want to allow the possibility for the cache to reside on another system. For the initial implementation, the cache would be local however. So if the cache is on another host - a new kind of MAD will have to be sent on behalf of the original request? Functional requirements: * It is clear that the first SA query to cache is PathRecord. This will be the first cached query in the initial check-in. So if a new client wants to connect to another node a new PathRecord query will not need to be sent to the SA. However, recent work on QoS has pointed out that under some QoS schemes PathRecord should not be shared by different clients I'm not sure that QoS handling is the responsibility of the cache. The module requesting the path records should probably deal with this. In IB QoS properties are mainly the PathRecord parameters: SL, Rate, MTU, PathBits (LMC bits). So if traditionally we had PathRecord requested for each Src->Dst port now we will need to track at least: Src->Dst * #QoS-levels. (a non optimal implementation will require even more: #Src->Dst * #Clients * #Servers * #Services). * Forgive me for bringing the following issue - over and over to the group: Multicast Join/Leave should be reference counted. The "SA local agent" could be the right place for doing this kind of reference counting (actually if it does that it probably needs to be located in the Kernel - to enable cleanup after killed processes). I agree that this is a problem, but I my preference would be for a dedicated kernel module to handle multicast join/leave requests. Since we already sniff into the SA queries it makes sense to have the same code also handle other functionality that requires sniffing into the SA requests. As HAL points out this involves both ServiceRecord, Multicast Join/Leave and InformInfo requests. Multicast Join/Leave actually behaves like a cache: if a "join" to the same MGID already took place (no leave yet) then no need to sent the new request to the SA. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
>> Note that with >> a MAD interface, kernel modules would still have access to >> any cached data. I >> also wanted to stick with usermode to allow saving the cache >> to disk, so that it >> would be available immediately after a reboot. (My >> assumption being that >> changes to the network topology would be rare, so we could >> optimize around a >> stable network design.) >It is risky to assume that PathRecords would stay the same across a node >reboot. It is very likely that the SM could assign different LIDs or if the >node is down for an extended period other things in the fabric could have >significantly changed. OpenSM currently maintains LIDs between system reboots, and I believe that this is desirable for fast fabric bring-up. And I believe that this is a desirable feature for any SM to have. In any case, a local LID change is trivial to detect and can easily be used to invalidate the entire cache. Likewise, the cache could automatically be flushed if not updated for some specified time period, or if some other defined event occurred - such as a GUID change on the local HCA. Overall, I think that the risk here is low. >> As a related topic, there will be a separate SA client >> interface defined that >> will generate SA query MADs for the user. >Given the complexity of the RMPP protocol and the subtle bugs which everyone >has encountered while implementing and debugging it (timeouts, retries, abort, >window size management, class header offset, etc), it would be best to limit >the number of copies of this protocol within the system. Keeping the RMPP >details hidden just in the kernel would be best. An analogy might be the way >sockets hides the details of the TCP/IP protocol from applications. While I'm >not aware of any changes in the works, we all remember the significant changes >which occurred between IBTA 1.0 and IBTA 1.1 in the RMPP area. If any similar >significant revision to the protocol occurred it would be best to have it all >implemented in just one place. RMPP is implemented by the MAD layer, and is hidden to any clients using the MAD services. There will still only be a single RMPP implementation in the stack. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
> From: Sean Hefty [mailto:[EMAIL PROTECTED] > Your kernel footprint is smaller than I expected, which is > good. The key is that while there are O(N^2) path records in a fabric, only O(N) are of interest to a given node. Hence if you only replicate entries where this node is the source the size of the replica is significantly smaller. If someone is curious and wants to see all path records in the system, that would be a query you would let go through to the SA (and it would be a very infrequent query since no real world app, beyond fabric debug tools, would care about the paths which don't involve the node making the query). This of course implies the "SA Mux" must analyze more than just the attribute ID to determine if the replica can handle the query. But the memory savings is well worth the extra level of filtering. > Note that with > a MAD interface, kernel modules would still have access to > any cached data. I > also wanted to stick with usermode to allow saving the cache > to disk, so that it > would be available immediately after a reboot. (My > assumption being that > changes to the network topology would be rare, so we could > optimize around a > stable network design.) It is risky to assume that PathRecords would stay the same across a node reboot. It is very likely that the SM could assign different LIDs or if the node is down for an extended period other things in the fabric could have significantly changed. > > As a related topic, there will be a separate SA client > interface defined that > will generate SA query MADs for the user. Given the complexity of the RMPP protocol and the subtle bugs which everyone has encountered while implementing and debugging it (timeouts, retries, abort, window size management, class header offset, etc), it would be best to limit the number of copies of this protocol within the system. Keeping the RMPP details hidden just in the kernel would be best. An analogy might be the way sockets hides the details of the TCP/IP protocol from applications. While I'm not aware of any changes in the works, we all remember the significant changes which occurred between IBTA 1.0 and IBTA 1.1 in the RMPP area. If any similar significant revision to the protocol occurred it would be best to have it all implemented in just one place. my $0.02 Todd Rimmer ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
On Thu, 2006-01-05 at 18:24, Sean Hefty wrote: > >For the precise language, see C15-0-1.24 p. 923 IBA 1.2: > > > > > >C15-0.1.24: It shall be possible to determine the location of SA from > >any > >endport by sending a GMP to QP1 (the GSI) of the node identified by the > >endport's PortInfo:MasterSMLID, using in the GMP the base LID of the > >endport as the SLID, the endport's PortInfo:MasterSMSL as the SL, the > >well-known Q_Key (0x8001_), and whichever of the default P_Keys > >(0x or 0x7FFF) was placed in the endport's P_Key Table by the SM > >(Table 183 Initialization on page 868). > > > >so I overstated it a bit but this needs to be obeyed. > > Could each of the requests be redirected to different nodes? Yes. > I can envision how > the sa_cache could eventually build towards a distributed SA. I think a distributed SA is more like it rather than an SA cache. -- Hal > >C15-0.1.25: A SubnAdmGet(ClassPortInfo) sent according to C15- > >0.1.24: shall return all information needed to communicate with Subnet > >Administration. Alternatively, valid GMPs for SA sent according to C15- > >0.1.24: shall either return redirection responses providing all such > >information, or shall be normally processed by SA. > > Thanks for the references. > > - Sean > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
On Thu, 2006-01-05 at 17:04, Sean Hefty wrote: > >> I hadn't fully figured this out yet. I'm not sure if another MAD class is > >> needed or not. My goal is to implement this as transparent to the > >application > >> as possible without violating the spec, perhaps appearing as an SA on a > >> different LID. > > > >The LID for the (real) SA is determined from PortInfo:MasterSMLID so I > >don't see how this could be done that way. > > I didn't think that it was a requirement that the SA share the same LID as the > SM. For the precise language, see C15-0-1.24 p. 923 IBA 1.2: C15-0.1.24: It shall be possible to determine the location of SA from any endport by sending a GMP to QP1 (the GSI) of the node identified by the endport's PortInfo:MasterSMLID, using in the GMP the base LID of the endport as the SLID, the endport's PortInfo:MasterSMSL as the SL, the well-known Q_Key (0x8001_), and whichever of the default P_Keys (0x or 0x7FFF) was placed in the endport's P_Key Table by the SM (Table 183 Initialization on page 868). so I overstated it a bit but this needs to be obeyed. Also, C15-0.1.25: A SubnAdmGet(ClassPortInfo) sent according to C15- 0.1.24: shall return all information needed to communicate with Subnet Administration. Alternatively, valid GMPs for SA sent according to C15- 0.1.24: shall either return redirection responses providing all such information, or shall be normally processed by SA. -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
>- It is implemented in kernel mode > - while user mode may help during initial debug, it will be important for > kernel mode ULPs such as SRP, IPoIB and SDP to also make use of >these records Your kernel footprint is smaller than I expected, which is good. Note that with a MAD interface, kernel modules would still have access to any cached data. I also wanted to stick with usermode to allow saving the cache to disk, so that it would be available immediately after a reboot. (My assumption being that changes to the network topology would be rare, so we could optimize around a stable network design.) As a related topic, there will be a separate SA client interface defined that will generate SA query MADs for the user. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
On Thu, 2006-01-05 at 16:51, Sean Hefty wrote: > I agree that this is a problem, but I my preference would be for a dedicated > kernel module to handle multicast join/leave requests. In addition to multicast, it's also service records and event subscriptions too. -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
>Sean, This is great. This is a feature which I find near and dear and is very >important to large fabric scalability. If you look in contrib in the infinicon >area, you will see a version of a SA replica which we implemented in the >linux_discovery tree. The version in SVN is a little dated, but has the major >features and capabilities. If you find it useful I could provide a more >updated version of that component for your reference. Thanks - I will look at the version that is there. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
>> I hadn't fully figured this out yet. I'm not sure if another MAD class is >> needed or not. My goal is to implement this as transparent to the >application >> as possible without violating the spec, perhaps appearing as an SA on a >> different LID. > >The LID for the (real) SA is determined from PortInfo:MasterSMLID so I >don't see how this could be done that way. I didn't think that it was a requirement that the SA share the same LID as the SM. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
>* Regarding the sentence:"Clients would send their queries to the sa_cache >instead of the SA" > I would propose that a "SA MAD send switch" be implemented in the core: Such >a switch > will enable plugging in the SA cache (I would prefer calling it SA local >agent due to > its extended functionality). Once plugged in, this "SA local agent" should >be forwarded all > outgoing SA queries. Once it handles the MAD it should be able to inject the >response through > the core "SA MAD send switch" as if they arrived from the wire. This was my thought as well. I hesitated to refer to the cache as a local agent, since that's an implementation detail. I want to allow the possibility for the cache to reside on another system. For the initial implementation, the cache would be local however. >Functional requirements: >* It is clear that the first SA query to cache is PathRecord. This will be the first cached query in the initial check-in. > So if a new client wants to connect to another node a new PathRecord > query will not need to be sent to the SA. However, recent work on QoS has >pointed out > that under some QoS schemes PathRecord should not be shared by different >clients I'm not sure that QoS handling is the responsibility of the cache. The module requesting the path records should probably deal with this. >* Forgive me for bringing the following issue - over and over to the group: > Multicast Join/Leave should be reference counted. The "SA local agent" could >be > the right place for doing this kind of reference counting (actually if it >does that > it probably needs to be located in the Kernel - to enable cleanup after >killed processes). I agree that this is a problem, but I my preference would be for a dedicated kernel module to handle multicast join/leave requests. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] SA cache design
> From: Sean Hefty [mailto:[EMAIL PROTECTED] > > I've been given the task of trying to come up with an > implementation for an SA > cache. The intent is to increase the scalability and > performance of the openib > stack. My current thoughts on the implementation are below. > Any feedback is > welcome. Sean, This is great. This is a feature which I find near and dear and is very important to large fabric scalability. If you look in contrib in the infinicon area, you will see a version of a SA replica which we implemented in the linux_discovery tree. The version in SVN is a little dated, but has the major features and capabilities. If you find it useful I could provide a more updated version of that component for your reference. Some features of it (which you should consider or possibly use as reference code): - It maintains a full replica of: - All Node Records - Path Records relevant to this Node (where this node is Source) - Device Management Agent records for IOUs, IOCs and Service Records - even for a large cluster, the footprint of the above will be < 1MB - It is implemented in kernel mode - while user mode may help during initial debug, it will be important for kernel mode ULPs such as SRP, IPoIB and SDP to also make use of these records - It is infact a replica, not a cache. It maintains an up to date replica using the following techniques - registers for SA GID in/out of service notices - such notices when received trigger a query of information about that node only - schedules a periodic full SA query - if notices are successfully registered for, the query is at a slow pace (once every 10 minutes is default, but its configureable) - if notices are not successfully registered for, the query is at a faster pace (once a minute, but its configurable) - since notices are unreliable, the periodic sweep is needed to cover for lost notices, however the SA should resend notices which are not responded to - In addition for CAs it performs IOU, IOC and Service record queries and replicates them - this allows for very fast access to IOU/IOC/Service record info by drivers like SRP - hence allowing for faster reconnection and failure recovery handling - It can handle SA outages and still respond to queries while the SA is down, the SA is slow, or while the synchronization process is being performed (eg. it does all its queries to a temporary replica then updates the main replica, hence if the queries fail or take a long time, the main replica is still available and reasonably accurate). - I like the idea of using the same API for SA queries and allowing an SA mux to choose to query the replica or the actual SA. Hence if later versions choose to extend what is maintained in the replica, it would be transparent to applications - The API could allow for a flag to force a query against the replica or against the actual SA, with the default being to allow the "SA mux" to select which to use > > To keep the design as flexible as possible, my plan is to > implement the cache in > userspace. The interface to the cache would be via MADs. > Clients would send > their queries to the sa_cache instead of the SA itself. The > format of the MADs > would be essentially identical to those used to query the SA > itself. Response > MADs would contain any requested information. If the cache > could not satisfy a > request, the sa_cache would query the SA, update its cache, > then return a reply. - in our stack we had a separate more advanced SA query API (refered to the Subnet Driver API). This has evolved significantly since the old Intel IbAccess days, but still has similarities. It handled all the details of the query including retries (as specified by the caller), timeouts and even multi-level queries (get path records based on Node Guids, etc). It also handled the RMPP aspects and hid the intermediate RMPP headers and control protocol. You may want to consider defining and using such an API instead of MADs, least the user of the SA replica need to also implement RMPP itself. Given such an API the implementation could choose to query the actual SA or the replica and hide the RMPP details in the SA query case. Todd Rimmer ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Hi Eitan, On Thu, 2006-01-05 at 07:27, Eitan Zahavi wrote: > Hi Sean, > > This is great initiative - tackling an important issue. > I am glad you took this on. > > Please see below. > > Sean Hefty wrote: > > I've been given the task of trying to come up with an implementation for > > an SA cache. The intent is to increase the scalability and performance > > of the openib stack. My current thoughts on the implementation are > > below. Any feedback is welcome. > > > > To keep the design as flexible as possible, my plan is to implement the > > cache in userspace. The interface to the cache would be via MADs. > > Clients would send their queries to the sa_cache instead of the SA > > itself. The format of the MADs would be essentially identical to those > > used to query the SA itself. Response MADs would contain any requested > > information. If the cache could not satisfy a request, the sa_cache > > would query the SA, update its cache, then return a reply. > * I think the idea of using MADs to interface with the cache is very good. > * User space implementation: >This also might be a good tradeoff between coding and debugging versus the >the impact on number of connections per second. I hope the impact on > performance >will not be too big. Maybe we can take the path of implementing in user > space and >if the performance penalty will be too high we can port to kernel. > * Regarding the sentence:"Clients would send their queries to the sa_cache > instead of the SA" >I would propose that a "SA MAD send switch" be implemented in the core: > Such a switch >will enable plugging in the SA cache (I would prefer calling it SA local > agent due to >its extended functionality). Once plugged in, this "SA local agent" should > be forwarded all >outgoing SA queries. Once it handles the MAD it should be able to inject > the response through >the core "SA MAD send switch" as if they arrived from the wire. > > > > The benefits that I see with this approach are: > > > > + Clients would only need to send requests to the sa_cache. > > + The sa_cache can be implemented in stages. Requests that it cannot > > handle would just be forwarded to the SA. > > + The sa_cache could be implemented on each host, or a select number of > > hosts. > > + The interface to the sa_cache is similar to that used by the SA. > > + The cache would use virtual memory and could be saved to disk. > > > > Some drawbacks specific to this method are: > > > > - The MAD interface will result in additional data copies and userspace > > to kernel transitions for clients residing on the local system. > > - Clients require a mechanism to locate the sa_cache, or need to make > > assumptions about its location. > The proposal for "SA MAD send switch" in the core will resolve this issue. > No client change will be required as all MADs are sent through the core which > will > redirect them to the SA agent ... I see this as more granular than a complete switch for the entire class. More like on a per attribute basis. > Functional requirements: > * It is clear that the first SA query to cache is PathRecord. >So if a new client wants to connect to another node a new PathRecord >query will not need to be sent to the SA. However, recent work on QoS has > pointed out >that under some QoS schemes PathRecord should not be shared by different > clients >or even connections. There are several ways to make such QoS scheme scale. >Since this is a different discussion topic - I only bring this up such that >we take into account caching might also need to be done by a complex key > (not just >SRC/DST ...) Per the QoS direction, this complex key is indeed part of the enhanced PathRecord, right ? > * Forgive me for bringing the following issue - over and over to the group: >Multicast Join/Leave should be reference counted. The "SA local agent" > could be >the right place for doing this kind of reference counting (actually if it > does that >it probably needs to be located in the Kernel - to enable cleanup after > killed processes). The cache itself may need another level of reference counting (even if invalidation is broadcast). > * Similarly - "Client re-registration" could be made transparent to clients. > > Cache Invalidation: > Several discussions about PathRecord invalidation were spawn in the past. > IMO, it is enough to be notified about death of local processes, remote port > availability (trap 64/65) and > multicast group availability (trap 66/67) in order to invalidate SA cache > information. I think that it's more complicated than this. As an example, how does the SA cache know whether a cached path record needs to be changed based on traps 64/65 ? It seems to me to need to be tightly tied to the SM/SA for this. > So each SA Agent could register to obtain this data. But that solution does > not nicely scale, > as the SA needs to send notif
Re: [openib-general] SA cache design
Hi Sean, On Tue, 2006-01-03 at 20:15, Sean Hefty wrote: > Hal Rosenstock wrote: > >>I've been given the task of trying to come up with an implementation for an > >>SA > >>cache. The intent is to increase the scalability and performance of the > >>openib > >>stack. My current thoughts on the implementation are below. Any feedback > >>is > >>welcome. > >> > >>To keep the design as flexible as possible, my plan is to implement the > >>cache in > >>userspace. The interface to the cache would be via MADs. > > > > Would this be another MAD class which mimics the SA class ? > > I hadn't fully figured this out yet. I'm not sure if another MAD class is > needed or not. My goal is to implement this as transparent to the > application > as possible without violating the spec, perhaps appearing as an SA on a > different LID. The LID for the (real) SA is determined from PortInfo:MasterSMLID so I don't see how this could be done that way. > >> Clients would send > >>their queries to the sa_cache instead of the SA itself. The format of the > >>MADs > >>would be essentially identical to those used to query the SA itself. > >>Response > >>MADs would contain any requested information. If the cache could not > >>satisfy a > >>request, the sa_cache would query the SA, update its cache, then return a > >>reply. > >> > >>The benefits that I see with this approach are: > >> > >>+ Clients would only need to send requests to the sa_cache. > >>+ The sa_cache can be implemented in stages. Requests that it cannot > >>handle > >>would just be forwarded to the SA. > > > > Another option would be for the SA cache to indicate what requests its > > handles (some MADs for this) and have the clients only go to the cache > > for those queries (and direct to the SA for the others). > > I thought about this, but this puts an additional burden on the clients. Sure but how significant is this, especially if the 2 requests look alike with some minor exception(s) like the class. I would think this would make up for eliminating the extra indirection in the case where the cache does not support the request. > Letting the sa_cache forward the request allows it to send the requests to > another sa_cache, rather than directly to the SA. There's some additional > flexibility that we gain in the long term design by forwarding requests. > (I'm > thinking of the possibility of having an sa_cache hierarchy.) Sure; a hierarchial cache should scale even better. > >>+ The sa_cache could be implemented on each host, or a select number of > >>hosts. > >>+ The interface to the sa_cache is similar to that used by the SA. > >>+ The cache would use virtual memory and could be saved to disk. > >> > >>Some drawbacks specific to this method are: > >> > >>- The MAD interface will result in additional data copies and userspace to > >>kernel transitions for clients residing on the local system. > >>- Clients require a mechanism to locate the sa_cache, or need to make > >>assumptions about its location. > > > > Would SA caching be a service ID or set of IDs ? > > I'd like the sa_cache to give the appearance of being a standard SA as much > as > possible. Yes, the closer to the real SA requests the cache requests are the better. > One effect is that an sa_cache may not be able to run on the same > node as the actual SA, Not sure why this would be the case. > but that restriction seems desirable to me. Agreed. > > Are there also issues around cache invalidation ? > > I didn't list cache synchronization as an issue because I couldn't think of > any > problems that were specific to this design, versus being a general issue. Yes, this is a general design issue. The whole idea of how requests are matched to the cache (what info is kept in the cache) and the invalidation are keys. Just take PathRecords as one example. -- Hal > - Sean > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Hi Sean, This is great initiative - tackling an important issue. I am glad you took this on. Please see below. Sean Hefty wrote: I've been given the task of trying to come up with an implementation for an SA cache. The intent is to increase the scalability and performance of the openib stack. My current thoughts on the implementation are below. Any feedback is welcome. To keep the design as flexible as possible, my plan is to implement the cache in userspace. The interface to the cache would be via MADs. Clients would send their queries to the sa_cache instead of the SA itself. The format of the MADs would be essentially identical to those used to query the SA itself. Response MADs would contain any requested information. If the cache could not satisfy a request, the sa_cache would query the SA, update its cache, then return a reply. * I think the idea of using MADs to interface with the cache is very good. * User space implementation: This also might be a good tradeoff between coding and debugging versus the the impact on number of connections per second. I hope the impact on performance will not be too big. Maybe we can take the path of implementing in user space and if the performance penalty will be too high we can port to kernel. * Regarding the sentence:"Clients would send their queries to the sa_cache instead of the SA" I would propose that a "SA MAD send switch" be implemented in the core: Such a switch will enable plugging in the SA cache (I would prefer calling it SA local agent due to its extended functionality). Once plugged in, this "SA local agent" should be forwarded all outgoing SA queries. Once it handles the MAD it should be able to inject the response through the core "SA MAD send switch" as if they arrived from the wire. The benefits that I see with this approach are: + Clients would only need to send requests to the sa_cache. + The sa_cache can be implemented in stages. Requests that it cannot handle would just be forwarded to the SA. + The sa_cache could be implemented on each host, or a select number of hosts. + The interface to the sa_cache is similar to that used by the SA. + The cache would use virtual memory and could be saved to disk. Some drawbacks specific to this method are: - The MAD interface will result in additional data copies and userspace to kernel transitions for clients residing on the local system. - Clients require a mechanism to locate the sa_cache, or need to make assumptions about its location. The proposal for "SA MAD send switch" in the core will resolve this issue. No client change will be required as all MADs are sent through the core which will redirect them to the SA agent ... Functional requirements: * It is clear that the first SA query to cache is PathRecord. So if a new client wants to connect to another node a new PathRecord query will not need to be sent to the SA. However, recent work on QoS has pointed out that under some QoS schemes PathRecord should not be shared by different clients or even connections. There are several ways to make such QoS scheme scale. Since this is a different discussion topic - I only bring this up such that we take into account caching might also need to be done by a complex key (not just SRC/DST ...) * Forgive me for bringing the following issue - over and over to the group: Multicast Join/Leave should be reference counted. The "SA local agent" could be the right place for doing this kind of reference counting (actually if it does that it probably needs to be located in the Kernel - to enable cleanup after killed processes). * Similarly - "Client re-registration" could be made transparent to clients. Cache Invalidation: Several discussions about PathRecord invalidation were spawn in the past. IMO, it is enough to be notified about death of local processes, remote port availability (trap 64/65) and multicast group availability (trap 66/67) in order to invalidate SA cache information. So each SA Agent could register to obtain this data. But that solution does not nicely scale, as the SA needs to send notification to all nodes (but is reliable - could resend until Repressed). However, current IBTA definition for InformInfo (event forwarding mechanism) does not allow for multicast of Report(Notice). The reason is that registration for event forwarding is done with Set(InformInfo) which uses the requester QP and LID as the address for sending the matching report. A simple way around that limitation could be to enable the SM to "pre-register" a well known multicast group target for event forwarding. One issue though, would be that UD multicast is not reliable and some notifications could get lost. A notification sequence number could be used to catch these missed notifications eventually. Eitan ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listin
Re: [openib-general] SA cache design
Hal Rosenstock wrote: I've been given the task of trying to come up with an implementation for an SA cache. The intent is to increase the scalability and performance of the openib stack. My current thoughts on the implementation are below. Any feedback is welcome. To keep the design as flexible as possible, my plan is to implement the cache in userspace. The interface to the cache would be via MADs. Would this be another MAD class which mimics the SA class ? I hadn't fully figured this out yet. I'm not sure if another MAD class is needed or not. My goal is to implement this as transparent to the application as possible without violating the spec, perhaps appearing as an SA on a different LID. Clients would send their queries to the sa_cache instead of the SA itself. The format of the MADs would be essentially identical to those used to query the SA itself. Response MADs would contain any requested information. If the cache could not satisfy a request, the sa_cache would query the SA, update its cache, then return a reply. The benefits that I see with this approach are: + Clients would only need to send requests to the sa_cache. + The sa_cache can be implemented in stages. Requests that it cannot handle would just be forwarded to the SA. Another option would be for the SA cache to indicate what requests its handles (some MADs for this) and have the clients only go to the cache for those queries (and direct to the SA for the others). I thought about this, but this puts an additional burden on the clients. Letting the sa_cache forward the request allows it to send the requests to another sa_cache, rather than directly to the SA. There's some additional flexibility that we gain in the long term design by forwarding requests. (I'm thinking of the possibility of having an sa_cache hierarchy.) + The sa_cache could be implemented on each host, or a select number of hosts. + The interface to the sa_cache is similar to that used by the SA. + The cache would use virtual memory and could be saved to disk. Some drawbacks specific to this method are: - The MAD interface will result in additional data copies and userspace to kernel transitions for clients residing on the local system. - Clients require a mechanism to locate the sa_cache, or need to make assumptions about its location. Would SA caching be a service ID or set of IDs ? I'd like the sa_cache to give the appearance of being a standard SA as much as possible. One effect is that an sa_cache may not be able to run on the same node as the actual SA, but that restriction seems desirable to me. Are there also issues around cache invalidation ? I didn't list cache synchronization as an issue because I couldn't think of any problems that were specific to this design, versus being a general issue. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] SA cache design
Hi Sean, On Tue, 2006-01-03 at 19:42, Sean Hefty wrote: > I've been given the task of trying to come up with an implementation for an > SA > cache. The intent is to increase the scalability and performance of the > openib > stack. My current thoughts on the implementation are below. Any feedback is > welcome. > > To keep the design as flexible as possible, my plan is to implement the cache > in > userspace. The interface to the cache would be via MADs. Would this be another MAD class which mimics the SA class ? > Clients would send > their queries to the sa_cache instead of the SA itself. The format of the > MADs > would be essentially identical to those used to query the SA itself. > Response > MADs would contain any requested information. If the cache could not satisfy > a > request, the sa_cache would query the SA, update its cache, then return a > reply. > > The benefits that I see with this approach are: > > + Clients would only need to send requests to the sa_cache. > + The sa_cache can be implemented in stages. Requests that it cannot handle > would just be forwarded to the SA. Another option would be for the SA cache to indicate what requests its handles (some MADs for this) and have the clients only go to the cache for those queries (and direct to the SA for the others). > + The sa_cache could be implemented on each host, or a select number of hosts. > + The interface to the sa_cache is similar to that used by the SA. > + The cache would use virtual memory and could be saved to disk. > > Some drawbacks specific to this method are: > > - The MAD interface will result in additional data copies and userspace to > kernel transitions for clients residing on the local system. > - Clients require a mechanism to locate the sa_cache, or need to make > assumptions about its location. Would SA caching be a service ID or set of IDs ? Are there also issues around cache invalidation ? -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] SA cache design
I've been given the task of trying to come up with an implementation for an SA cache. The intent is to increase the scalability and performance of the openib stack. My current thoughts on the implementation are below. Any feedback is welcome. To keep the design as flexible as possible, my plan is to implement the cache in userspace. The interface to the cache would be via MADs. Clients would send their queries to the sa_cache instead of the SA itself. The format of the MADs would be essentially identical to those used to query the SA itself. Response MADs would contain any requested information. If the cache could not satisfy a request, the sa_cache would query the SA, update its cache, then return a reply. The benefits that I see with this approach are: + Clients would only need to send requests to the sa_cache. + The sa_cache can be implemented in stages. Requests that it cannot handle would just be forwarded to the SA. + The sa_cache could be implemented on each host, or a select number of hosts. + The interface to the sa_cache is similar to that used by the SA. + The cache would use virtual memory and could be saved to disk. Some drawbacks specific to this method are: - The MAD interface will result in additional data copies and userspace to kernel transitions for clients residing on the local system. - Clients require a mechanism to locate the sa_cache, or need to make assumptions about its location. I'm sure that there are other benefits and drawbacks that I'm missing. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general