Thanks, Steve. THP is not really what we want. It could create jitter among the ODP threads which we don't really want. Your answer is quite clear. I could definitively talk to you as to grab some of your knowledge, but I actually think your answer both shows that you understood my problem and could relate it to the kernel code.
At this point I think our only hope would be to get mremap support for HP in the kernel. Christophe On 19 January 2017 at 17:31, Mike Holmes <mike.hol...@linaro.org> wrote: > Maybe the LNG Kernel team need to pick this up as a topic ? > > On 19 January 2017 at 11:18, Steve Capper <steve.cap...@linaro.org> wrote: >> On 19 January 2017 at 13:04, Christophe Milard >> <christophe.mil...@linaro.org> wrote: >>> Hi Steve, >> >> Hey Christophe, >> >>> >>> Maybe you remember me as we have had contact before. Christophe. from >>> the LNG ODP team (mikes Holmes team). >>> >>> I have written the ODP memory allocator and I am having an issue with >>> it: It has a requirement that linux processes (we call them ODP >>> threads) have to be able to share memory between each other, as normal >>> pthreads do. (an "ODP thread" can be either a linux process or a >>> pthread) >>> The memory should be shareable (at same virtual address) even if it is >>> ODP allocated after processes have fork()'d. >>> >>> I did that the following way: as all our ODP processes are descendant >>> of a single root process (we call it the ODP instantiation process), I >>> actually pre-reserve a large virtual space area in this process). this >>> is done as follows: >>> >>> pre_reserved_zone = mmap(NULL, len, PROT_NONE, MAP_SHARED | >>> MAP_ANONYMOUS | MAP_NORESERVE, -1, 0); >>> >>> The PROT_NONE makes sure that the physical memory is unaccessible, >>> hence not used. >>> >>> Later, when one of the linux processes does an odp_reserve(), in the >>> related mmap(), I want to map the real memory on some part of that >>> preallocated area, using MAP_FIXED: >>> >>> mapped_addr = mmap(start, size, PROT_READ | PROT_WRITE, MAP_SHARED | >>> MAP_FIXED | mmap_flags, fd, 0); >>> >>> If "start" is in the pre_reserved_zone, we know it is available in all >>> processes, as the prereserved zone is inheritaed by all (because they >>> are all descendent of the instantiation process which did this >>> pre-reservation) >>> >>> However, I noticed that, for huge pages at least, if this call fails >>> due to a lack of huge pages, the virtual space (from start to >>> start+size), seems to be returned as available to the kernel! I >>> expected a failed call to leave the system unchanged, not to do half >>> of the job... >> >> Unfortunately this appears to make sense due to the pluggable logic in >> the kernel. If one mmaps a location, anything in the way is first >> munmapp'ed. We need to call munmap as the previous mmap may have been >> from special driver logic (remember one can supply an mmap handler for >> a driver). Likewise due to the munmap also being potentially special, >> we can't roll this back. The only safe thing we can do is leave the >> space empty if the later mmap logic fails. >> (Also it took me a while and a very strong coffee to understand this, >> so it certainly isn't obvious :-)). >> >> >>> This is of course a problem, since I want my pre-reserved area to >>> remain pre-reserved on failure! >>> What I did (until now), is that I simply remade the pre-reservation >>> (with PROT_NONE) on the specific area behind the failed call. >>> This was OK, I though, as concurrent access (from different thread) to >>> my odp_reserve() function are mutexed. >>> What I forgot is that the differrent threads can actually use malloc() >>> or mmap() directely: >>> If a thread 1 does a odp_reserve, fails on lack of huge page (point A >>> in the code) and re-pre-reserve the area (point B), another thread 2 >>> could be unlucky enough to do a mmap(NULL,...) between thread 1's A >>> and B, and be returned a part of my so-called preallocated address >>> space :-(. >>> >>> So I am working on another strategy: doing a first mapping outside the >>> preallocated space, and, on success only, move the resulting area >>> (using mremap) into the proeallocated space. >>> >>> The patch (from the old strategy to the new one) looks as flllows: >>> - mapped_addr = mmap(start, size, PROT_READ | PROT_WRITE, >>> - MAP_SHARED | MAP_FIXED | mmap_flags, fd, >>> 0); >>> - /* if mapping fails, re-block the space we tried to take >>> - * as it seems a mapping failure still affect what was >>> there??*/ >>> - if (mapped_addr == MAP_FAILED) { >>> - mmap_flags = MAP_SHARED | MAP_FIXED | >>> - MAP_ANONYMOUS | MAP_NORESERVE; >>> - mmap(start, size, PROT_NONE, mmap_flags, -1, 0); >>> - mprotect(start, size, PROT_NONE); >>> + /* first, try a normal map. If that works, we move it >>> + * where it should be: >>> + * This is because it turned out that if a mapping fails >>> + * on a the prereserved virtual address space, then >>> + * the prereserved address space which was tried to be >>> mapped >>> + * on becomes available to the kernel again! This was not >>> + * according to expectations: the assumption was that if a >>> + * mapping fails, the system should remain unchanged, but >>> this >>> + * is obvioulsy not true (at least for huge pages when >>> + * exhausted). >>> + * So the strategy is to first map at a non reserved place >>> + * (which can then be freed and returned to the kernel on >>> + * failure) and move it to the prereserved space on >>> success only. >>> + */ >>> + mapped_addr = mmap(NULL, size, PROT_READ | PROT_WRITE, >>> + MAP_SHARED | mmap_flags, fd, 0); >>> + if (mapped_addr != MAP_FAILED) { >>> + /* If OK, remap at right fixed location */ >>> + mapped_addr = mremap(mapped_addr, size, size, >>> + MREMAP_FIXED | MREMAP_MAYMOVE, >>> + start); >>> + if (mapped_addr == MAP_FAILED) { >>> + ODP_ERR("FIXED mremap failed!\n"); >>> + } >>> >>> Sadly, the call to mremap() seems to fail for huge pages! (no clue why) >>> So I now don't know what to do!! My first approach is not thread safe >>> when ODP allocations are mixed with direct linux system calls. The >>> second approach does not seem popular either for huge pages. >>> All this trying to work around something that really looks like a >>> kernel bug: If a mapping fails -whatever the reason- why would linux >>> change the system state (i.e. the demapping of what was there seems to >>> remain when the mapping of what should have followed is failing).... >> >> Looking at the code it is apparent that mremap logic is actually >> missing for HugeTLB, thus mremap will fail for this case (one can see >> in vma_to_resize the failure path). A chap from Google tried to >> address this in 2011 but it didn't get picked up: >> https://lkml.org/lkml/2011/11/3/358 >> >> I'm not sure why. >> >>> >>> Any ideas? >> >> Good question :-). >> >> One thing I *did* see was that THP *is* supported for mremap. Also, >> THP can work on pagecache pages hosted on shmem on recent kernels 4.8 >> upwards (newer being better). >> >> Is THP something that is workable for you? (One can limit THP to just >> controlled areas via madvise). >> >> If you want, we can have quick chat? If so, can you please grab me on >> Google HO tomorrow after 13:30 UTC0? (Otherwise I'll be on holiday for >> two weeks from: 21st Jan - 5th Feb). >> >> Cheers, >> -- >> Steve > > > > -- > Mike Holmes > Program Manager - Linaro Networking Group > Linaro.org │ Open source software for ARM SoCs > "Work should be fun and collaborative, the rest follows"