On Thu, May 28, 2009 at 02:06:38PM -0500, Hal Rosenstock wrote: > On Thu, May 28, 2009 at 1:57 PM, Bob Ciotti <[email protected]> wrote: > > > > Sorry to bounce this off the list - should it be too remedial. I promise > > that I've been consuming a lot of the spec and OFA code. Maybe you consider > > that a promise or a warning we will be more active :| > > > > Our configuration is >6000 CA in a mix of infinihostIII/connectx and > > longbow extenders and >800 24 port switches on a single subnet. (SGI ICE > > with lots of other stuff plugged in). Its DDR everywhere except across the > > longbows. Hosts range from a few different generations of x86 xeon, x86 > > opteron and itanium. We use lustre but have the srp traffic on a separate > > subnet. > > > > A few weeks ago connection setup times were mentioned on this list along > > with ARP and path record lookups not being scalable. We experience these > > problems as well and need to address these scalability issues. I have a > > quite > > a bit of test data and a few different ideas to bounce off the list RE path > > records, once I am a little more versed in the spec. There has already been > > some work done to limit ARP traffic. > > > > Todays question has to do with SM errors. > > We have been seeing lots of these - sometimes more than others. Digging > > around some it appears that the 6777 represents the number of duplicates? > > This value fluctuates around some, but not alot. Comments in the code > > indicate that any valuse >1 is a problem. Question is, should or is this > > OK to be happening and how does it occur? > > It's an error (and error status of too many records is returned to the > SA client in the end node). > > Gets are only allowed to return 1 record (GetTable requests can deal > with more than 1 record in the response) yet many were found by the SA > that satisfied the request in responding to the Get. Any idea on what > the specific get is that causes this to occur ?
Thats the problem. The at the debug level we are running at I can pin down the source. Is there a state I can go look for on the clients to see what its trying to do? bob > -- Hal > > > We will probably do an update to the 1.4 or 1.4.1 SM in the next few days. > > We are currently running a pre 1.4 top of tree pull from back in dec. bob > > > > > > May 28 00:07:13 324705 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got > > more than one record for SubnAdmGet (6777) > > May 28 00:07:13 336912 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got > > more than one record for SubnAdmGet (6777) > > May 28 00:07:13 338067 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got > > more than one record for SubnAdmGet (6777) > > May 28 00:07:13 570497 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got > > more than one record for SubnAdmGet (6777) > > May 28 00:07:14 417242 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got > > more than one record for SubnAdmGet (6777) > > May 28 00:07:14 899329 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got > > more than one record for SubnAdmGet (6777) > > May 28 00:07:15 245929 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got > > more than one record for SubnAdmGet (6777) > > May 28 00:07:17 076558 [6E38940] 0x01 -> osm_sa_respond: ERR 4C05: Got more > > than one record for SubnAdmGet (6777) > > May 28 00:07:18 118151 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got > > more than one record for SubnAdmGet (6777) > > May 28 00:07:18 328783 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got > > more than one record for SubnAdmGet (6777) > > May 28 00:07:18 341440 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got > > more than one record for SubnAdmGet (6777) > > May 28 00:07:18 578154 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got > > more than one record for SubnAdmGet (6777) > > May 28 00:07:19 425249 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got > > more than one record for SubnAdmGet (6777) > > May 28 00:07:19 907407 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got > > more than one record for SubnAdmGet (6777) > > May 28 00:07:20 267818 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got > > more than one record for SubnAdmGet (6777) > > > > .... > > > > > > > > ------------------------------------------------------------------------- > > Robert B. Ciotti ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??Supercomputing > > Systems Lead > > NASA Advanced Supercomputing (NAS) Division ?? ?? ?? ?? ?? ??TEL (650) > > 604-4408 > > NASA Ames Research Center ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??FAX > > (650) 604-4377 > > Moffett Field, CA 94035-1000 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? > > [email protected] > > ------------------------------------------------------------------------- > > > > _______________________________________________ > > general mailing list > > [email protected] > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
