Xiangdong <[email protected]> writes: > For these functions, the flop ratios are all 1.1, while the time ratio are > 1.5-2.2. So the amount of work are sort of balanced for each processes. > Both runs on Stampede and my group cluster gave similar behaviors. Given > that I only use 256 cores, do you think it is likely that my job was > assigned cores with different speeds? How can I test/measure this since > each time the job was assigned to different nodes? > > Are there any other factors I should also look into for the behavior that > flops ratio 1.1 but time ratio 1.5-2.1 for non-communicating functions?
Memory bandwidth can be an issue. Like some nodes could have slower memory installed. Or, like happened to Dave and me at ETH, an old, lopsided ramdisk partition could have been left behind by a previous job, causing slow bandwidth after all the memory ends up faulted onto a single memory channel. You can investigate such issues with numastat and third-party profilers. I would start by seeing if you can reproduce with simpler PETSc examples, then distinguish performance of a flops-limited local operation versus bandwidth-limited operation. It might be simple to figure out, but it might also take a lot of work.
signature.asc
Description: PGP signature
