On Aug 3, 2012, at 6:24 PM, Paul Kapinos wrote:

> testing our well-known example of the registered memory problem (see 
> http://www.open-mpi.org/community/lists/users/2012/02/18565.php) on 
> freshly-installed 1.6.1rc2, found out that "Fall back to send/receive 
> semantics" did not work always it. However the behaviour has changed:
> 
> 1.5.3. and older: MPI processes hang and block the IB interface(s) forever
> 
> 1.6.1rc2: a) MPI processes run through (if the chunk size is less than 8Gb) 
> with or without a warning; or
>          b) MPI processes die  (if the chunk size is more than 8Gb)

We talked about this mail today on our weekly teleconference.

That's odd.  Looking at your output files, I see that they when trying to 
create a queue pair.  Let me explain...

Our newest stop-gap scheme on the 1.6 branch is as follows:

- figure out how much physical RAM is on the machine
- take 85% of that number
- M = (85% of physical_memory / num_mpi_procs_on_machine)
- don't let any individual MPI process register more than M bytes of memory

This is a heuristic.  The idea is that we leave 15% of memory available to the 
OS and other OpenFabrics services running on the machine (IPoIB, subnet 
management, ...etc.).  However, there is a variable OMPI doesn't count -- the 
amount of registered memory consumed by the meta data consumed by a queue pair. 
 

When you take into account the fact that OMPI creates queue pairs lazily (in an 
attempt to reduce registered memory consumption, which is fairly ironic here 
;-) ), we could still run out of registered memory and then try to create a new 
QP later (e.g., the first time MPI process A sends to B).  This QP could fail 
to be created if there is no more registered memory.

That's the type of error that I see in your log files (QP creation fail).

But with 15% of RAM left, we're greatly surprised to see this kind of error.  
Perhaps registering 8+GB buffers does something in the OpenFabrics registration 
system that we're unaware of (to make overall available registered memory 
deplete faster).  Huh.

More below.

> Note that the same program which die in (b) run fine over IPoIB (-mca btl 
> ^openib). However, the performance is very bad in this case... some 1100 sec. 
> instead of about a minute.

Yep, that makes sense. IPoIB is quite inefficient.  

> Reproducing: compile attached file and let it run on nodes with >=24GB with
>    log_num_mtt     : 20
>    log_mtts_per_seg: 3
> (=32Gb, our default values):
> $ mpiexec ....<one proc per node> .... a.out 1080000000 1080000001

So I'm not 100% clear on what you mean here: when you set the OFED params to 
allow registration of more memory than you have physically, does the problem go 
away?

>From your log messages, the warning messages were from machines with nearly 
>100GB RAM but only 32GB register-able.  But only one of those was the same as 
>one that showed QP creation failures.  So I'm not clear which was which.

Regardless: can you pump the MTT params up to allow registering all of physical 
memory on those machines, and see if you get any failures?

To be clear: we're trying to determine if we should spend more effort on making 
OMPI work properly in low-registered-memory-availabile scenarios, or whether we 
should just emphasize "go reset your MTT parameters to allow registering all of 
physical memory."

> Well, we know about the need to raise the values of one of these parameters, 
> but I wanted to let you to know that your workaround for the problem is still 
> not 100% perfect but only 99%.

Ok, good.  Many thanks for your patience with all of this.

> P.S: A note about the informative warning:
> --------------------------------------------------------------------------
> WARNING: It appears that your OpenFabrics subsystem is configured to only
> allow registering part of your physical memory.
> ....
>  Registerable memory:     32768 MiB
>  Total memory:            98293 MiB
> --------------------------------------------------------------------------
> On node with 24 GB this warning did not came around, although the max. size 
> of registered memory (32GB) is only 1.5x of RAM, but in
> http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem
> at least the 2x RAM size is recommended.
> 
> Should this warning not came out in all cases when registered memory < 2x RAM?


You're correct -- perhaps this is bad wording in the FAQ.  As far as I 
understand it, it's only necessary to be able to register all of physical 
memory.  Although Mellanox did recommend being able to register at least 2x 
physical memory.

...that being said, I have never gotten a clear explanation of what exactly 
those two parameters are.  I.e., why you would adjust one and not the other, 
etc.  Mellanox advised us (the OMPI developers) to adjust log_num_mtt, but I 
never found out why.

We'll continue to pester Mellanox to try to get a good answer as to why 2x is 
recommended.  :-)  If we get a good answer, we'll update the FAQ wording and/or 
the limits at which that warning is displayed.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to