Sorry to bounce this off the list - should it be too remedial. I promise that I've been consuming a lot of the spec and OFA code. Maybe you consider that a promise or a warning we will be more active :|
Our configuration is >6000 CA in a mix of infinihostIII/connectx and longbow extenders and >800 24 port switches on a single subnet. (SGI ICE with lots of other stuff plugged in). Its DDR everywhere except across the longbows. Hosts range from a few different generations of x86 xeon, x86 opteron and itanium. We use lustre but have the srp traffic on a separate subnet. A few weeks ago connection setup times were mentioned on this list along with ARP and path record lookups not being scalable. We experience these problems as well and need to address these scalability issues. I have a quite a bit of test data and a few different ideas to bounce off the list RE path records, once I am a little more versed in the spec. There has already been some work done to limit ARP traffic. Todays question has to do with SM errors. We have been seeing lots of these - sometimes more than others. Digging around some it appears that the 6777 represents the number of duplicates? This value fluctuates around some, but not alot. Comments in the code indicate that any valuse >1 is a problem. Question is, should or is this OK to be happening and how does it occur? We will probably do an update to the 1.4 or 1.4.1 SM in the next few days. We are currently running a pre 1.4 top of tree pull from back in dec. bob May 28 00:07:13 324705 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:13 336912 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:13 338067 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:13 570497 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:14 417242 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:14 899329 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:15 245929 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:17 076558 [6E38940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:18 118151 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:18 328783 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:18 341440 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:18 578154 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:19 425249 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:19 907407 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:20 267818 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) .... ------------------------------------------------------------------------- Robert B. Ciotti Supercomputing Systems Lead NASA Advanced Supercomputing (NAS) Division TEL (650) 604-4408 NASA Ames Research Center FAX (650) 604-4377 Moffett Field, CA 94035-1000 [email protected] ------------------------------------------------------------------------- _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
