On Aug 3, 2012, at 6:24 PM, Paul Kapinos wrote: > testing our well-known example of the registered memory problem (see > http://www.open-mpi.org/community/lists/users/2012/02/18565.php) on > freshly-installed 1.6.1rc2, found out that "Fall back to send/receive > semantics" did not work always it. However the behaviour has changed: > > 1.5.3. and older: MPI processes hang and block the IB interface(s) forever > > 1.6.1rc2: a) MPI processes run through (if the chunk size is less than 8Gb) > with or without a warning; or > b) MPI processes die (if the chunk size is more than 8Gb)
We talked about this mail today on our weekly teleconference. That's odd. Looking at your output files, I see that they when trying to create a queue pair. Let me explain... Our newest stop-gap scheme on the 1.6 branch is as follows: - figure out how much physical RAM is on the machine - take 85% of that number - M = (85% of physical_memory / num_mpi_procs_on_machine) - don't let any individual MPI process register more than M bytes of memory This is a heuristic. The idea is that we leave 15% of memory available to the OS and other OpenFabrics services running on the machine (IPoIB, subnet management, ...etc.). However, there is a variable OMPI doesn't count -- the amount of registered memory consumed by the meta data consumed by a queue pair. When you take into account the fact that OMPI creates queue pairs lazily (in an attempt to reduce registered memory consumption, which is fairly ironic here ;-) ), we could still run out of registered memory and then try to create a new QP later (e.g., the first time MPI process A sends to B). This QP could fail to be created if there is no more registered memory. That's the type of error that I see in your log files (QP creation fail). But with 15% of RAM left, we're greatly surprised to see this kind of error. Perhaps registering 8+GB buffers does something in the OpenFabrics registration system that we're unaware of (to make overall available registered memory deplete faster). Huh. More below. > Note that the same program which die in (b) run fine over IPoIB (-mca btl > ^openib). However, the performance is very bad in this case... some 1100 sec. > instead of about a minute. Yep, that makes sense. IPoIB is quite inefficient. > Reproducing: compile attached file and let it run on nodes with >=24GB with > log_num_mtt : 20 > log_mtts_per_seg: 3 > (=32Gb, our default values): > $ mpiexec ....<one proc per node> .... a.out 1080000000 1080000001 So I'm not 100% clear on what you mean here: when you set the OFED params to allow registration of more memory than you have physically, does the problem go away? >From your log messages, the warning messages were from machines with nearly >100GB RAM but only 32GB register-able. But only one of those was the same as >one that showed QP creation failures. So I'm not clear which was which. Regardless: can you pump the MTT params up to allow registering all of physical memory on those machines, and see if you get any failures? To be clear: we're trying to determine if we should spend more effort on making OMPI work properly in low-registered-memory-availabile scenarios, or whether we should just emphasize "go reset your MTT parameters to allow registering all of physical memory." > Well, we know about the need to raise the values of one of these parameters, > but I wanted to let you to know that your workaround for the problem is still > not 100% perfect but only 99%. Ok, good. Many thanks for your patience with all of this. > P.S: A note about the informative warning: > -------------------------------------------------------------------------- > WARNING: It appears that your OpenFabrics subsystem is configured to only > allow registering part of your physical memory. > .... > Registerable memory: 32768 MiB > Total memory: 98293 MiB > -------------------------------------------------------------------------- > On node with 24 GB this warning did not came around, although the max. size > of registered memory (32GB) is only 1.5x of RAM, but in > http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem > at least the 2x RAM size is recommended. > > Should this warning not came out in all cases when registered memory < 2x RAM? You're correct -- perhaps this is bad wording in the FAQ. As far as I understand it, it's only necessary to be able to register all of physical memory. Although Mellanox did recommend being able to register at least 2x physical memory. ...that being said, I have never gotten a clear explanation of what exactly those two parameters are. I.e., why you would adjust one and not the other, etc. Mellanox advised us (the OMPI developers) to adjust log_num_mtt, but I never found out why. We'll continue to pester Mellanox to try to get a good answer as to why 2x is recommended. :-) If we get a good answer, we'll update the FAQ wording and/or the limits at which that warning is displayed. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/