Hi, I'm running IOR benchmark on a big shared memory machine with Lustre file system. I set up IOR to use an independent file/process so that the aggregated bandwidth is maximized. I ran N MPI processes where N < # of cores in a socket. When I put those N MPI processes on a single socket, its write performance is scalable. However, when I put those N MPI processes on N sockets (so, 1 MPI process/socket), it performance does not scale, and stays the same for more than 4 MPI processes. I expected it would be as scalable as the case of N processes on a single socket. But, it is not.
I think if an MPI process write to an independent file/process, there must not be file locking among MPI processes. However, there seems to be some. Is there any way to avoid that locking or overhead? It may not be file lock issue, but I don't know what is the exact reason for the poor performance. Any help will be appreciated. David