Thanks, Steve.

THP is not really what we want. It could create jitter among the ODP
threads which we don't really want.
Your answer is quite clear. I could definitively talk to you as to
grab some of your knowledge, but I actually think your answer both
shows that you understood my problem and could relate it to the kernel
code.

At this point I think our only hope would be to get mremap support for
HP in the kernel.

Christophe

On 19 January 2017 at 17:31, Mike Holmes <mike.hol...@linaro.org> wrote:
> Maybe the LNG Kernel team need to pick this up as a topic ?
>
> On 19 January 2017 at 11:18, Steve Capper <steve.cap...@linaro.org> wrote:
>> On 19 January 2017 at 13:04, Christophe Milard
>> <christophe.mil...@linaro.org> wrote:
>>> Hi Steve,
>>
>> Hey Christophe,
>>
>>>
>>> Maybe you remember me as we have had contact before. Christophe. from
>>> the LNG ODP team (mikes Holmes team).
>>>
>>> I have written the ODP memory allocator and I am having an issue with
>>> it: It has a requirement that linux processes (we call them ODP
>>> threads) have to be able to share memory between each other, as normal
>>> pthreads do. (an "ODP thread" can be either a linux process or a
>>> pthread)
>>> The memory should be shareable (at same virtual address) even if it is
>>> ODP allocated after processes have fork()'d.
>>>
>>> I did that the following way: as all our ODP processes are descendant
>>> of a single root process (we call it the ODP instantiation process), I
>>> actually pre-reserve a large virtual space area in this process). this
>>> is done as follows:
>>>
>>>  pre_reserved_zone = mmap(NULL, len, PROT_NONE,  MAP_SHARED |
>>> MAP_ANONYMOUS | MAP_NORESERVE, -1, 0);
>>>
>>> The PROT_NONE makes sure that the physical memory is unaccessible,
>>> hence not used.
>>>
>>> Later, when one of the linux processes does an odp_reserve(), in the
>>> related mmap(), I want to map the real memory on some part of that
>>> preallocated area, using MAP_FIXED:
>>>
>>> mapped_addr = mmap(start, size, PROT_READ | PROT_WRITE, MAP_SHARED |
>>> MAP_FIXED | mmap_flags, fd, 0);
>>>
>>> If "start" is in the pre_reserved_zone, we know it is available in all
>>> processes, as the prereserved zone is inheritaed by all (because they
>>> are all descendent of the instantiation process which did this
>>> pre-reservation)
>>>
>>> However, I noticed that, for huge pages at least, if this call fails
>>> due to a lack of huge pages, the virtual space (from start to
>>> start+size), seems to be returned as available to the kernel! I
>>> expected a failed call to leave the system unchanged, not to do half
>>> of the job...
>>
>> Unfortunately this appears to make sense due to the pluggable logic in
>> the kernel. If one mmaps a location, anything in the way is first
>> munmapp'ed. We need to call munmap as the previous mmap may have been
>> from special driver logic (remember one can supply an mmap handler for
>> a driver). Likewise due to the munmap also being potentially special,
>> we can't roll this back. The only safe thing we can do is leave the
>> space empty if the later mmap logic fails.
>> (Also it took me a while and a very strong coffee to understand this,
>> so it certainly isn't obvious :-)).
>>
>>
>>> This is of course a problem, since I want my pre-reserved area to
>>> remain pre-reserved on failure!
>>> What I did (until now), is that I simply remade the pre-reservation
>>> (with PROT_NONE) on the specific area behind the failed call.
>>> This was OK, I though, as concurrent access (from different thread) to
>>> my odp_reserve() function are mutexed.
>>> What I forgot is that the differrent threads can actually use malloc()
>>> or mmap() directely:
>>> If a thread 1 does a odp_reserve, fails on lack of huge page (point A
>>> in the code) and re-pre-reserve the area (point B), another thread 2
>>> could be unlucky enough to do a mmap(NULL,...) between thread 1's A
>>> and B, and be returned a part of my so-called preallocated address
>>> space :-(.
>>>
>>> So I am working on another strategy: doing a first mapping outside the
>>> preallocated space, and, on success only, move the resulting area
>>> (using mremap) into the proeallocated space.
>>>
>>> The patch (from the old strategy to the new one) looks as flllows:
>>> -               mapped_addr = mmap(start, size, PROT_READ | PROT_WRITE,
>>> -                                  MAP_SHARED | MAP_FIXED | mmap_flags, fd, 
>>> 0);
>>> -               /* if mapping fails, re-block the space we tried to take
>>> -                * as it seems a mapping failure still affect what was 
>>> there??*/
>>> -               if (mapped_addr == MAP_FAILED) {
>>> -                       mmap_flags = MAP_SHARED | MAP_FIXED |
>>> -                                    MAP_ANONYMOUS | MAP_NORESERVE;
>>> -                       mmap(start, size, PROT_NONE, mmap_flags, -1, 0);
>>> -                       mprotect(start, size, PROT_NONE);
>>> +               /* first, try a normal map. If that works, we move it
>>> +                * where it should be:
>>> +                * This is because it turned out that if a mapping fails
>>> +                * on a the prereserved virtual address space, then
>>> +                * the prereserved address space which was tried to be 
>>> mapped
>>> +                * on becomes available to the kernel again! This was not
>>> +                * according to expectations: the assumption was that if a
>>> +                * mapping fails, the system should remain unchanged, but 
>>> this
>>> +                * is obvioulsy not true (at least for huge pages when
>>> +                * exhausted).
>>> +                * So the strategy is to first map at a non reserved place
>>> +                * (which can then be freed and returned to the kernel on
>>> +                * failure) and move it to the prereserved space on
>>> success only.
>>> +                */
>>> +               mapped_addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
>>> +                                  MAP_SHARED | mmap_flags, fd, 0);
>>> +               if (mapped_addr != MAP_FAILED) {
>>> +                       /* If OK, remap at right fixed location */
>>> +                       mapped_addr = mremap(mapped_addr, size, size,
>>> +                                            MREMAP_FIXED | MREMAP_MAYMOVE,
>>> +                                            start);
>>> +                       if (mapped_addr == MAP_FAILED) {
>>> +                               ODP_ERR("FIXED mremap failed!\n");
>>> +                       }
>>>
>>> Sadly, the call to mremap() seems to fail for huge pages! (no clue why)
>>> So I now don't know what to do!! My first approach is not thread safe
>>> when ODP allocations are mixed with direct linux system calls. The
>>> second approach does not seem popular either for huge pages.
>>> All this trying to work around something that really looks like a
>>> kernel bug: If a mapping fails -whatever the reason- why would linux
>>> change the system state (i.e. the demapping of what was there seems to
>>> remain when the mapping of what should have followed is failing)....
>>
>> Looking at the code it is apparent that mremap logic is actually
>> missing for HugeTLB, thus mremap will fail for this case (one can see
>> in vma_to_resize the failure path). A chap from Google tried to
>> address this in 2011 but it didn't get picked up:
>> https://lkml.org/lkml/2011/11/3/358
>>
>> I'm not sure why.
>>
>>>
>>> Any ideas?
>>
>> Good question :-).
>>
>> One thing I *did* see was that THP *is* supported for mremap. Also,
>> THP can work on pagecache pages hosted on shmem on recent kernels 4.8
>> upwards (newer being better).
>>
>> Is THP something that is workable for you? (One can limit THP to just
>> controlled areas via madvise).
>>
>> If you want, we can have quick chat? If so, can you please grab me on
>> Google HO tomorrow after 13:30 UTC0? (Otherwise I'll be on holiday for
>> two weeks from: 21st Jan - 5th Feb).
>>
>> Cheers,
>> --
>> Steve
>
>
>
> --
> Mike Holmes
> Program Manager - Linaro Networking Group
> Linaro.org │ Open source software for ARM SoCs
> "Work should be fun and collaborative, the rest follows"

Reply via email to