Hello all There was some discussion at yesterday's tutorial about ORTE scalability and where bottlenecks might be occurring. I spent some time last night identifying key information required to answer those questions. I'll be presenting a slide today showing the key timing points that we would need first.
I have also begun (this morning) to instrument the trunk to measure those times. Some really quick results, all done on a Mac G5: 1. It takes about 3 milliseconds to setup a job (i.e., go through the RDS, RAS, and RMAPS frameworks, setup the stage gate triggers, prep io forwarding, etc. - everything before we actually launch). This bounces around a lot (I'm just using gettimeofday), but seems to have at most a slight dependence on the number of processes being launched. 2. It takes roughly 1-3 milliseconds to execute the compound command that registers all of the data from an MPI process (i.e., the data sent at the STG1 stage gate). This is the time required on the HNP to process the command - it doesn't include any time spent actually communicating. It does, however, include time spent packing/unpacking buffers. My tests were all done on a local node for now, so the OOB just passes the buffer across from send to receive. As you would expect, since the info being stored is only from one process, there is no observable scaling dependence here. 3. The time from start of MPI_Init until we do the registry command is taking about 12-20 milliseconds - again, as expected, no observable scaling dependence. There will have to be quite a few tests, of course, but I don't expect the first two values to change very much (obviously, they will depend on the hardware on the head node). I'll keep you posted as we learn more. Ralph