Re: [lng-odp] memory allocation issues

Steve Capper Thu, 19 Jan 2017 08:20:39 -0800

On 19 January 2017 at 13:04, Christophe Milard
<christophe.mil...@linaro.org> wrote:
> Hi Steve,


Hey Christophe,

>
> Maybe you remember me as we have had contact before. Christophe. from
> the LNG ODP team (mikes Holmes team).
>
> I have written the ODP memory allocator and I am having an issue with
> it: It has a requirement that linux processes (we call them ODP
> threads) have to be able to share memory between each other, as normal
> pthreads do. (an "ODP thread" can be either a linux process or a
> pthread)
> The memory should be shareable (at same virtual address) even if it is
> ODP allocated after processes have fork()'d.
>
> I did that the following way: as all our ODP processes are descendant
> of a single root process (we call it the ODP instantiation process), I
> actually pre-reserve a large virtual space area in this process). this
> is done as follows:
>
>  pre_reserved_zone = mmap(NULL, len, PROT_NONE,  MAP_SHARED |
> MAP_ANONYMOUS | MAP_NORESERVE, -1, 0);
>
> The PROT_NONE makes sure that the physical memory is unaccessible,
> hence not used.
>
> Later, when one of the linux processes does an odp_reserve(), in the
> related mmap(), I want to map the real memory on some part of that
> preallocated area, using MAP_FIXED:
>
> mapped_addr = mmap(start, size, PROT_READ | PROT_WRITE, MAP_SHARED |
> MAP_FIXED | mmap_flags, fd, 0);
>
> If "start" is in the pre_reserved_zone, we know it is available in all
> processes, as the prereserved zone is inheritaed by all (because they
> are all descendent of the instantiation process which did this
> pre-reservation)
>
> However, I noticed that, for huge pages at least, if this call fails
> due to a lack of huge pages, the virtual space (from start to
> start+size), seems to be returned as available to the kernel! I
> expected a failed call to leave the system unchanged, not to do half
> of the job...

Unfortunately this appears to make sense due to the pluggable logic in
the kernel. If one mmaps a location, anything in the way is first
munmapp'ed. We need to call munmap as the previous mmap may have been
from special driver logic (remember one can supply an mmap handler for
a driver). Likewise due to the munmap also being potentially special,
we can't roll this back. The only safe thing we can do is leave the
space empty if the later mmap logic fails.
(Also it took me a while and a very strong coffee to understand this,
so it certainly isn't obvious :-)).


> This is of course a problem, since I want my pre-reserved area to
> remain pre-reserved on failure!
> What I did (until now), is that I simply remade the pre-reservation
> (with PROT_NONE) on the specific area behind the failed call.
> This was OK, I though, as concurrent access (from different thread) to
> my odp_reserve() function are mutexed.
> What I forgot is that the differrent threads can actually use malloc()
> or mmap() directely:
> If a thread 1 does a odp_reserve, fails on lack of huge page (point A
> in the code) and re-pre-reserve the area (point B), another thread 2
> could be unlucky enough to do a mmap(NULL,...) between thread 1's A
> and B, and be returned a part of my so-called preallocated address
> space :-(.
>
> So I am working on another strategy: doing a first mapping outside the
> preallocated space, and, on success only, move the resulting area
> (using mremap) into the proeallocated space.
>
> The patch (from the old strategy to the new one) looks as flllows:
> -               mapped_addr = mmap(start, size, PROT_READ | PROT_WRITE,
> -                                  MAP_SHARED | MAP_FIXED | mmap_flags, fd, 
> 0);
> -               /* if mapping fails, re-block the space we tried to take
> -                * as it seems a mapping failure still affect what was 
> there??*/
> -               if (mapped_addr == MAP_FAILED) {
> -                       mmap_flags = MAP_SHARED | MAP_FIXED |
> -                                    MAP_ANONYMOUS | MAP_NORESERVE;
> -                       mmap(start, size, PROT_NONE, mmap_flags, -1, 0);
> -                       mprotect(start, size, PROT_NONE);
> +               /* first, try a normal map. If that works, we move it
> +                * where it should be:
> +                * This is because it turned out that if a mapping fails
> +                * on a the prereserved virtual address space, then
> +                * the prereserved address space which was tried to be mapped
> +                * on becomes available to the kernel again! This was not
> +                * according to expectations: the assumption was that if a
> +                * mapping fails, the system should remain unchanged, but this
> +                * is obvioulsy not true (at least for huge pages when
> +                * exhausted).
> +                * So the strategy is to first map at a non reserved place
> +                * (which can then be freed and returned to the kernel on
> +                * failure) and move it to the prereserved space on
> success only.
> +                */
> +               mapped_addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
> +                                  MAP_SHARED | mmap_flags, fd, 0);
> +               if (mapped_addr != MAP_FAILED) {
> +                       /* If OK, remap at right fixed location */
> +                       mapped_addr = mremap(mapped_addr, size, size,
> +                                            MREMAP_FIXED | MREMAP_MAYMOVE,
> +                                            start);
> +                       if (mapped_addr == MAP_FAILED) {
> +                               ODP_ERR("FIXED mremap failed!\n");
> +                       }
>
> Sadly, the call to mremap() seems to fail for huge pages! (no clue why)
> So I now don't know what to do!! My first approach is not thread safe
> when ODP allocations are mixed with direct linux system calls. The
> second approach does not seem popular either for huge pages.
> All this trying to work around something that really looks like a
> kernel bug: If a mapping fails -whatever the reason- why would linux
> change the system state (i.e. the demapping of what was there seems to
> remain when the mapping of what should have followed is failing)....

Looking at the code it is apparent that mremap logic is actually
missing for HugeTLB, thus mremap will fail for this case (one can see
in vma_to_resize the failure path). A chap from Google tried to
address this in 2011 but it didn't get picked up:
https://lkml.org/lkml/2011/11/3/358

I'm not sure why.

>
> Any ideas?

Good question :-).

One thing I *did* see was that THP *is* supported for mremap. Also,
THP can work on pagecache pages hosted on shmem on recent kernels 4.8
upwards (newer being better).

Is THP something that is workable for you? (One can limit THP to just
controlled areas via madvise).

If you want, we can have quick chat? If so, can you please grab me on
Google HO tomorrow after 13:30 UTC0? (Otherwise I'll be on holiday for
two weeks from: 21st Jan - 5th Feb).

Cheers,
--
Steve

Re: [lng-odp] memory allocation issues

Reply via email to