Gilles, I managed to get snapshots of all the /proc/<pid>/status entries for all liggghts jobs, but the Cpus_allowed ist similar no matter if the system was cold or warm booted.
Then I looked around in /proc/ and found sched_debug. This at least shows, that the liggghts-processes are not spread over all cores. Some cores just have on of those, some have none and some have many. I agree that the problem that the processes are not spread over all cores is a consequence but not the root cause. This means I now need to find out how the kernel scheduler decides on which core a process should run and why he can spread 48 tasks over 48 cores when I cold boot the machine and can't when I warm boot it. So I guess I have to proceed to the linux kernel mailing list with this issue. Another thing that points towards the kernel is that yesterday I installed a newer 4.4.0 kernel on the machine and the problem is still there, but not that worse than on the 4.2 kernel. I also tried mpirun -mca... but that didn't change anything. Thanks for your input anyway, at least I now have a sched_debug snapshot, maybe that is helpful in the further investigation. Regards Rainer Am 22.03.2016 um 14:38 schrieb Gilles Gouaillardet: > Rainer, > > a first step could be to gather /proc/pid/status for your 48 tasks. > then you can > grep Cpus_allowed_list > and see if you find something suspucious. > > if your processes are idling, then the scheduler might assign them to > the same core. > in this case, your processes not being spread is a consequence and not a > root cause. > > just to make sure there are no strange side effects, could you > mpirun --mca btl sm,self ... > > Cheers, > > Gilles > > > On Tuesday, March 22, 2016, Rainer Koenig <rainer.koe...@ts.fujitsu.com > <mailto:rainer.koe...@ts.fujitsu.com>> wrote: > > Am 17.03.2016 um 10:40 schrieb Ralph Castain: > > Just some thoughts offhand: > > > > * what version of OMPI are you using? > > dpkg -l openmpi-bin says 1.6.5-8 from Ubuntu 14.04. > > > > * are you saying that after the warm reboot, all 48 procs are > running on a subset of cores? > > Yes. After a cold boot all 48 processses are spread over all 48 cores > and all cores show up as almost 100% in the htop cpu meter. > > After a warm boot, the 48 processes are just spread over a few cores and > the rest of the system is idling. > > > * it sounds like some of the cores have been marked as “offline” > for some reason. Make sure you have hwloc installed on the machine, > and run “lstopo” and see if that is the case > > I tried with lstopo, but the graphics that I got look almost similar. > The visible difference is in the sort of topology for the graphics > adapter and the LAN cards. The path to the graphics shows 2 times the > numbers 4,0 above the lines and the path to the eth0 shows 2 times the > numbers 0,2 above the lines. lstopo for the warm boot looks identical, > but those small numbers are missing now. > > I also tried with hwloc-gather-topology and diff'd the 2 results. There > is nothing special to see. Differneces in /proc/stats/ and > /proc/cpuinfo, but nothing special, just ohter values. > > Something is obviously wrong on a low level, but I'm still struggling to > find it. :-/ > > Rainer > -- > Dipl.-Inf. (FH) Rainer Koenig > Project Manager Linux Clients > Dept. PDG WPS R&D SW OSE > > Fujitsu Technology Solutions > Bürgermeister-Ullrich-Str. 100 > 86199 Augsburg > Germany > > Telephone: +49-821-804-3321 > Telefax: +49-821-804-2131 > Mail: mailto:rainer.koe...@ts.fujitsu.com <javascript:;> > > Internet ts.fujtsu.com <http://ts.fujtsu.com> > Company Details ts.fujitsu.com/imprint.html > <http://ts.fujitsu.com/imprint.html> > _______________________________________________ > users mailing list > us...@open-mpi.org <javascript:;> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/03/28787.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/03/28788.php > -- Dipl.-Inf. (FH) Rainer Koenig Project Manager Linux Clients Dept. PDG WPS R&D SW OSE Fujitsu Technology Solutions Bürgermeister-Ullrich-Str. 100 86199 Augsburg Germany Telephone: +49-821-804-3321 Telefax: +49-821-804-2131 Mail: mailto:rainer.koe...@ts.fujitsu.com Internet ts.fujtsu.com Company Details ts.fujitsu.com/imprint.html