I wouldn’t expect this technique to work the way you want, even if it did start the jobs in the right places.
The L3 cache will have <some> associativity, but I doubt it is 15-way, so there may be random collisions in the cache between the datasets of the different copies of your program. There is enough aggregage space, but no assurance the system will figure out how to tile it correctly. It would probably work more reliably to write a single program with a 30MB array, then start 15 threads, giving each thread a contiguous 2MB chunk of that array. That way, you are assured of virtually distinct addresses, and if you can, for example, mmap a 30 MB file using hugetlbfs, you may be able to get physically contiguous as well. Any reasonable cache design will run that with no collisions. You can use the library functions to set cpuaffinity for each thread. You can run two copies, of this thing, one per socket, using cpuset or numactl > On 2015, Jul 23, at 3:03 PM, mathog <[email protected]> wrote: > > Dell with 2CPU x 12core x 2 threads, shows up in procinfo as 48 cpus. > > Trying to run 30 processes 1 each on different "CPU"s by starting them one at > a time with > > numactl -C 1-30 /$PATH/program #args... > > when 30 have started the script spins waiting for one to exit then another is > started. "top" is showing some of these are running at 50% CPU, so they are > being started on a CPU which already has a job going. I can see where that > would happen, since there doesn't seem to be anything in numactl about load > balancing. The thing is, these processes are _staying_ on the same CPU, never > migrating to another. That I don't understand. I would have thought numactl > sets some mask on the process restricting the CPUs it can move to, but would > not otherwise affect it, so the OS should migrate it when it sees this > situation. In practice it seems to leave it running on whichever CPU it > starts on. Or does linux not migrate processes when they are heavily loading > a single CPU, only when they run out of memory??? > > Also "perf top" shows 81% for the program and 13% for numactl. > > The goal here is to carefully divvy up the load so that exactly 15 jobs run > on each Numa zone, since then the data in all the inner loops will fit within > the 30M of L3 cache on each CPU. If it puts 17 on one and 13 the inner loop > data won't fit and performance slows down dramatically. Looks like I need to > keep track of which job is running where and numactl lock it to that node. > (I don't think there is a queue system on this machine at present.) > > Thanks, > > David Mathog > [email protected] > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, [email protected] sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
