Eureka!
Operationally the 3-argument variants are ALMOST identical. The older
version required len == sizeof(long), while the later version allowed
the len to vary (so an Altix could have more than 64 cpus). However, in
the kernel both effectively treat the 3rd argument as an array of
unsigned longs. It appears that with the later kernel interface, both
"cpu_set_t*" and "unsigned long *" have been used by glibc. So, as long
as the kernel isn't enforcing len==sizeof(long), the cpu_set_t can be
used w/ any 3-argument kernel regardless of what the library headers say.
Looking at the kernel code for various implementations of the 3-arg
version shows that it can be tough to know which from a static test. On
the Altix (using cpu_set_t) one gets errno=EFAULT if the len is too
short to cover all the online cpus, while on other kernels I find that a
too-short mask is padded w/ zeros and no error results. So, we want a
big value for len. However, since the 2-arg version treats the 2nd arg
as an address rather than a len, we can use a len<4096 to ensure an
invalid address will result in errno=EFAULT.
The result is the following, which I've tried in limited testing:
enum {
SCHED_SETAFFINITY_TAKES_2_ARGS,
SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_LONG,
SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_CPU_SET,
SCHED_SETAFFINITY_UNKNOWN
};
/* We want to call by this prototype, even if it is not the real one */
extern sched_setaffinity(int pid, unsigned int len, void *mask);
int probe_setaffinity(void) {
unsigned long mask[511];
int rc;
memset(mask, 0, sizeof(mask));
mask[0] = 1;
rc = sched_setaffinity(0, sizeof(mask), mask);
if (rc >= 0) {
/* Kernel truncates over-length masks -> successful call */
return SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_CPU_SET;
} else if (errno == EINVAL) {
/* Kernel returns EINVAL when len != sizeof(long) */
return SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_LONG;
} else if (errno == EFAULT) {
/* Kernel returns EFAULT having rejected len as an address */
return SCHED_SETAFFINITY_TAKES_2_ARGS;
}
return SCHED_SETAFFINITY_UNKNOWN;
};
Jeff Squyres wrote:
Greetings all. I'm writing this to ask for help from the general
development community. We've run into a problem with Linux processor
affinity, and although I've individually talked to a lot of people
about this, no one has been able to come up with a solution. So I
thought I'd open this to a wider audience.
This is a long-ish e-mail; bear with me.
As you may or may not know, Open MPI includes support for processor and
memory affinity. There are a number of benefits, but I'll skip that
discussion for now. For more information, see the following:
http://www.open-mpi.org/faq/?category=building#build-paffinity
http://www.open-mpi.org/faq/?category=building#build-maffinity
http://www.open-mpi.org/faq/?category=tuning#paffinity-defs
http://www.open-mpi.org/faq/?category=tuning#maffinity-defs
http://www.open-mpi.org/faq/?category=tuning#using-paffinity
Here's the problem: there are 3 different APIs for processor affinity
in Linux. I have not done exhaustive research on this, but which API
you have seems to depend on your version of kernel, glibc, and/or Linux
vendor (i.e., some vendors appear to port different versions of the API
to their particular kernel/glibc). The issue is that all 3 versions of
the API use the same function names (sched_setaffinity() and
sched_getaffinity()), but they change the number and types of the
parameters to these functions.
This is not a big problem for source distributions of Open MPI -- our
configure script figures out which one you have and uses preprocessor
directives to select the Right stuff in our code base for your
platform.
What *is* a big problem, however, is that ISVs can therefore not ship a
binary Open MPI installation and reasonably expect the processor
affinity aspects of it to work on multiple Linux platforms. That is,
if the ISV compiles for API #X and ships a binary to a system that has
API #Y, there are two options:
1. Processor affinity is disabled. This means that the benefits of
processor affinity won't be visible (not hugely important on 2-way
SMPs, but as the number of processors/cores increases, this is going to
become more important), and Open MPI's NUMA-aware collectives won't be
able to be used (because memory affinity may not be useful without
processor affinity guarantees).
2. Processor affinity is enabled, but the code invokes API #X on a
system with API #Y. This will have unpredictable results, the best
case of which will be that processor affinity is simply [effectively]
ignored; the worst case of which will be that the application will fail
(e.g., seg fault).
Clearly, neither of these solutions are attractive.
My question to the developer crowd out there -- can you think of a way
around this? More specifically, is there a way to know -- at run time
-- which API to use? We can do some compiler trickery to compile all
three APIs into a single Open MPI installation and then run-time
dispatch to the Right one, but this is contingent upon being able to
determine which API to dispatch to. A bunch of us have poked around
and not found anything on the system that indicates which API you have
(e.g., looked in /proc and /sys), but not found anything.
Does anyone have any suggestions here?
Many thanks for your time.
--
Paul H. Hargrove phhargr...@lbl.gov
Future Technologies Group
HPC Research Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900