Re: [lng-odp] memory allocation issues

2017-01-20 Thread Maxim Uvarov
On 01/19/17 19:18, Steve Capper wrote:
> On 19 January 2017 at 13:04, Christophe Milard
>  wrote:
>> Hi Steve,
> 
> Hey Christophe,
> 
>>
>> Maybe you remember me as we have had contact before. Christophe. from
>> the LNG ODP team (mikes Holmes team).
>>
>> I have written the ODP memory allocator and I am having an issue with
>> it: It has a requirement that linux processes (we call them ODP
>> threads) have to be able to share memory between each other, as normal
>> pthreads do. (an "ODP thread" can be either a linux process or a
>> pthread)
>> The memory should be shareable (at same virtual address) even if it is
>> ODP allocated after processes have fork()'d.
>>
>> I did that the following way: as all our ODP processes are descendant
>> of a single root process (we call it the ODP instantiation process), I
>> actually pre-reserve a large virtual space area in this process). this
>> is done as follows:
>>
>>  pre_reserved_zone = mmap(NULL, len, PROT_NONE,  MAP_SHARED |
>> MAP_ANONYMOUS | MAP_NORESERVE, -1, 0);
>>
>> The PROT_NONE makes sure that the physical memory is unaccessible,
>> hence not used.
>>
>> Later, when one of the linux processes does an odp_reserve(), in the
>> related mmap(), I want to map the real memory on some part of that
>> preallocated area, using MAP_FIXED:
>>
>> mapped_addr = mmap(start, size, PROT_READ | PROT_WRITE, MAP_SHARED |
>> MAP_FIXED | mmap_flags, fd, 0);
>>
>> If "start" is in the pre_reserved_zone, we know it is available in all
>> processes, as the prereserved zone is inheritaed by all (because they
>> are all descendent of the instantiation process which did this
>> pre-reservation)
>>
>> However, I noticed that, for huge pages at least, if this call fails
>> due to a lack of huge pages, the virtual space (from start to
>> start+size), seems to be returned as available to the kernel! I
>> expected a failed call to leave the system unchanged, not to do half
>> of the job...
> 
> Unfortunately this appears to make sense due to the pluggable logic in
> the kernel. If one mmaps a location, anything in the way is first
> munmapp'ed. We need to call munmap as the previous mmap may have been
> from special driver logic (remember one can supply an mmap handler for
> a driver). Likewise due to the munmap also being potentially special,
> we can't roll this back. The only safe thing we can do is leave the
> space empty if the later mmap logic fails.
> (Also it took me a while and a very strong coffee to understand this,
> so it certainly isn't obvious :-)).
> 
> 
>> This is of course a problem, since I want my pre-reserved area to
>> remain pre-reserved on failure!
>> What I did (until now), is that I simply remade the pre-reservation
>> (with PROT_NONE) on the specific area behind the failed call.
>> This was OK, I though, as concurrent access (from different thread) to
>> my odp_reserve() function are mutexed.
>> What I forgot is that the differrent threads can actually use malloc()
>> or mmap() directely:
>> If a thread 1 does a odp_reserve, fails on lack of huge page (point A
>> in the code) and re-pre-reserve the area (point B), another thread 2
>> could be unlucky enough to do a mmap(NULL,...) between thread 1's A
>> and B, and be returned a part of my so-called preallocated address
>> space :-(.
>>
>> So I am working on another strategy: doing a first mapping outside the
>> preallocated space, and, on success only, move the resulting area
>> (using mremap) into the proeallocated space.
>>
>> The patch (from the old strategy to the new one) looks as flllows:
>> -   mapped_addr = mmap(start, size, PROT_READ | PROT_WRITE,
>> -  MAP_SHARED | MAP_FIXED | mmap_flags, fd, 
>> 0);
>> -   /* if mapping fails, re-block the space we tried to take
>> -* as it seems a mapping failure still affect what was 
>> there??*/
>> -   if (mapped_addr == MAP_FAILED) {
>> -   mmap_flags = MAP_SHARED | MAP_FIXED |
>> -MAP_ANONYMOUS | MAP_NORESERVE;
>> -   mmap(start, size, PROT_NONE, mmap_flags, -1, 0);
>> -   mprotect(start, size, PROT_NONE);
>> +   /* first, try a normal map. If that works, we move it
>> +* where it should be:
>> +* This is because it turned out that if a mapping fails
>> +* on a the prereserved virtual address space, then
>> +* the prereserved address space which was tried to be mapped
>> +* on becomes available to the kernel again! This was not
>> +* according to expectations: the assumption was that if a
>> +* mapping fails, the system should remain unchanged, but 
>> this
>> +* is obvioulsy not true (at least for huge pages when
>> +* exhausted).
>> +* So the strategy is to first map at a non reserved place
>> +   

Re: [lng-odp] memory allocation issues

2017-01-19 Thread Christophe Milard
Thanks, Steve.

THP is not really what we want. It could create jitter among the ODP
threads which we don't really want.
Your answer is quite clear. I could definitively talk to you as to
grab some of your knowledge, but I actually think your answer both
shows that you understood my problem and could relate it to the kernel
code.

At this point I think our only hope would be to get mremap support for
HP in the kernel.

Christophe

On 19 January 2017 at 17:31, Mike Holmes  wrote:
> Maybe the LNG Kernel team need to pick this up as a topic ?
>
> On 19 January 2017 at 11:18, Steve Capper  wrote:
>> On 19 January 2017 at 13:04, Christophe Milard
>>  wrote:
>>> Hi Steve,
>>
>> Hey Christophe,
>>
>>>
>>> Maybe you remember me as we have had contact before. Christophe. from
>>> the LNG ODP team (mikes Holmes team).
>>>
>>> I have written the ODP memory allocator and I am having an issue with
>>> it: It has a requirement that linux processes (we call them ODP
>>> threads) have to be able to share memory between each other, as normal
>>> pthreads do. (an "ODP thread" can be either a linux process or a
>>> pthread)
>>> The memory should be shareable (at same virtual address) even if it is
>>> ODP allocated after processes have fork()'d.
>>>
>>> I did that the following way: as all our ODP processes are descendant
>>> of a single root process (we call it the ODP instantiation process), I
>>> actually pre-reserve a large virtual space area in this process). this
>>> is done as follows:
>>>
>>>  pre_reserved_zone = mmap(NULL, len, PROT_NONE,  MAP_SHARED |
>>> MAP_ANONYMOUS | MAP_NORESERVE, -1, 0);
>>>
>>> The PROT_NONE makes sure that the physical memory is unaccessible,
>>> hence not used.
>>>
>>> Later, when one of the linux processes does an odp_reserve(), in the
>>> related mmap(), I want to map the real memory on some part of that
>>> preallocated area, using MAP_FIXED:
>>>
>>> mapped_addr = mmap(start, size, PROT_READ | PROT_WRITE, MAP_SHARED |
>>> MAP_FIXED | mmap_flags, fd, 0);
>>>
>>> If "start" is in the pre_reserved_zone, we know it is available in all
>>> processes, as the prereserved zone is inheritaed by all (because they
>>> are all descendent of the instantiation process which did this
>>> pre-reservation)
>>>
>>> However, I noticed that, for huge pages at least, if this call fails
>>> due to a lack of huge pages, the virtual space (from start to
>>> start+size), seems to be returned as available to the kernel! I
>>> expected a failed call to leave the system unchanged, not to do half
>>> of the job...
>>
>> Unfortunately this appears to make sense due to the pluggable logic in
>> the kernel. If one mmaps a location, anything in the way is first
>> munmapp'ed. We need to call munmap as the previous mmap may have been
>> from special driver logic (remember one can supply an mmap handler for
>> a driver). Likewise due to the munmap also being potentially special,
>> we can't roll this back. The only safe thing we can do is leave the
>> space empty if the later mmap logic fails.
>> (Also it took me a while and a very strong coffee to understand this,
>> so it certainly isn't obvious :-)).
>>
>>
>>> This is of course a problem, since I want my pre-reserved area to
>>> remain pre-reserved on failure!
>>> What I did (until now), is that I simply remade the pre-reservation
>>> (with PROT_NONE) on the specific area behind the failed call.
>>> This was OK, I though, as concurrent access (from different thread) to
>>> my odp_reserve() function are mutexed.
>>> What I forgot is that the differrent threads can actually use malloc()
>>> or mmap() directely:
>>> If a thread 1 does a odp_reserve, fails on lack of huge page (point A
>>> in the code) and re-pre-reserve the area (point B), another thread 2
>>> could be unlucky enough to do a mmap(NULL,...) between thread 1's A
>>> and B, and be returned a part of my so-called preallocated address
>>> space :-(.
>>>
>>> So I am working on another strategy: doing a first mapping outside the
>>> preallocated space, and, on success only, move the resulting area
>>> (using mremap) into the proeallocated space.
>>>
>>> The patch (from the old strategy to the new one) looks as flllows:
>>> -   mapped_addr = mmap(start, size, PROT_READ | PROT_WRITE,
>>> -  MAP_SHARED | MAP_FIXED | mmap_flags, fd, 
>>> 0);
>>> -   /* if mapping fails, re-block the space we tried to take
>>> -* as it seems a mapping failure still affect what was 
>>> there??*/
>>> -   if (mapped_addr == MAP_FAILED) {
>>> -   mmap_flags = MAP_SHARED | MAP_FIXED |
>>> -MAP_ANONYMOUS | MAP_NORESERVE;
>>> -   mmap(start, size, PROT_NONE, mmap_flags, -1, 0);
>>> -   mprotect(start, size, PROT_NONE);
>>> +   /* first, try a normal map. If that works, we move it
>>> +* where it should be:
>>> +* This 

Re: [lng-odp] memory allocation issues

2017-01-19 Thread Mike Holmes
Maybe the LNG Kernel team need to pick this up as a topic ?

On 19 January 2017 at 11:18, Steve Capper  wrote:
> On 19 January 2017 at 13:04, Christophe Milard
>  wrote:
>> Hi Steve,
>
> Hey Christophe,
>
>>
>> Maybe you remember me as we have had contact before. Christophe. from
>> the LNG ODP team (mikes Holmes team).
>>
>> I have written the ODP memory allocator and I am having an issue with
>> it: It has a requirement that linux processes (we call them ODP
>> threads) have to be able to share memory between each other, as normal
>> pthreads do. (an "ODP thread" can be either a linux process or a
>> pthread)
>> The memory should be shareable (at same virtual address) even if it is
>> ODP allocated after processes have fork()'d.
>>
>> I did that the following way: as all our ODP processes are descendant
>> of a single root process (we call it the ODP instantiation process), I
>> actually pre-reserve a large virtual space area in this process). this
>> is done as follows:
>>
>>  pre_reserved_zone = mmap(NULL, len, PROT_NONE,  MAP_SHARED |
>> MAP_ANONYMOUS | MAP_NORESERVE, -1, 0);
>>
>> The PROT_NONE makes sure that the physical memory is unaccessible,
>> hence not used.
>>
>> Later, when one of the linux processes does an odp_reserve(), in the
>> related mmap(), I want to map the real memory on some part of that
>> preallocated area, using MAP_FIXED:
>>
>> mapped_addr = mmap(start, size, PROT_READ | PROT_WRITE, MAP_SHARED |
>> MAP_FIXED | mmap_flags, fd, 0);
>>
>> If "start" is in the pre_reserved_zone, we know it is available in all
>> processes, as the prereserved zone is inheritaed by all (because they
>> are all descendent of the instantiation process which did this
>> pre-reservation)
>>
>> However, I noticed that, for huge pages at least, if this call fails
>> due to a lack of huge pages, the virtual space (from start to
>> start+size), seems to be returned as available to the kernel! I
>> expected a failed call to leave the system unchanged, not to do half
>> of the job...
>
> Unfortunately this appears to make sense due to the pluggable logic in
> the kernel. If one mmaps a location, anything in the way is first
> munmapp'ed. We need to call munmap as the previous mmap may have been
> from special driver logic (remember one can supply an mmap handler for
> a driver). Likewise due to the munmap also being potentially special,
> we can't roll this back. The only safe thing we can do is leave the
> space empty if the later mmap logic fails.
> (Also it took me a while and a very strong coffee to understand this,
> so it certainly isn't obvious :-)).
>
>
>> This is of course a problem, since I want my pre-reserved area to
>> remain pre-reserved on failure!
>> What I did (until now), is that I simply remade the pre-reservation
>> (with PROT_NONE) on the specific area behind the failed call.
>> This was OK, I though, as concurrent access (from different thread) to
>> my odp_reserve() function are mutexed.
>> What I forgot is that the differrent threads can actually use malloc()
>> or mmap() directely:
>> If a thread 1 does a odp_reserve, fails on lack of huge page (point A
>> in the code) and re-pre-reserve the area (point B), another thread 2
>> could be unlucky enough to do a mmap(NULL,...) between thread 1's A
>> and B, and be returned a part of my so-called preallocated address
>> space :-(.
>>
>> So I am working on another strategy: doing a first mapping outside the
>> preallocated space, and, on success only, move the resulting area
>> (using mremap) into the proeallocated space.
>>
>> The patch (from the old strategy to the new one) looks as flllows:
>> -   mapped_addr = mmap(start, size, PROT_READ | PROT_WRITE,
>> -  MAP_SHARED | MAP_FIXED | mmap_flags, fd, 
>> 0);
>> -   /* if mapping fails, re-block the space we tried to take
>> -* as it seems a mapping failure still affect what was 
>> there??*/
>> -   if (mapped_addr == MAP_FAILED) {
>> -   mmap_flags = MAP_SHARED | MAP_FIXED |
>> -MAP_ANONYMOUS | MAP_NORESERVE;
>> -   mmap(start, size, PROT_NONE, mmap_flags, -1, 0);
>> -   mprotect(start, size, PROT_NONE);
>> +   /* first, try a normal map. If that works, we move it
>> +* where it should be:
>> +* This is because it turned out that if a mapping fails
>> +* on a the prereserved virtual address space, then
>> +* the prereserved address space which was tried to be mapped
>> +* on becomes available to the kernel again! This was not
>> +* according to expectations: the assumption was that if a
>> +* mapping fails, the system should remain unchanged, but 
>> this
>> +* is obvioulsy not true (at least for huge pages when
>> +* exhausted).
>> +* So the s

Re: [lng-odp] memory allocation issues

2017-01-19 Thread Steve Capper
On 19 January 2017 at 13:04, Christophe Milard
 wrote:
> Hi Steve,

Hey Christophe,

>
> Maybe you remember me as we have had contact before. Christophe. from
> the LNG ODP team (mikes Holmes team).
>
> I have written the ODP memory allocator and I am having an issue with
> it: It has a requirement that linux processes (we call them ODP
> threads) have to be able to share memory between each other, as normal
> pthreads do. (an "ODP thread" can be either a linux process or a
> pthread)
> The memory should be shareable (at same virtual address) even if it is
> ODP allocated after processes have fork()'d.
>
> I did that the following way: as all our ODP processes are descendant
> of a single root process (we call it the ODP instantiation process), I
> actually pre-reserve a large virtual space area in this process). this
> is done as follows:
>
>  pre_reserved_zone = mmap(NULL, len, PROT_NONE,  MAP_SHARED |
> MAP_ANONYMOUS | MAP_NORESERVE, -1, 0);
>
> The PROT_NONE makes sure that the physical memory is unaccessible,
> hence not used.
>
> Later, when one of the linux processes does an odp_reserve(), in the
> related mmap(), I want to map the real memory on some part of that
> preallocated area, using MAP_FIXED:
>
> mapped_addr = mmap(start, size, PROT_READ | PROT_WRITE, MAP_SHARED |
> MAP_FIXED | mmap_flags, fd, 0);
>
> If "start" is in the pre_reserved_zone, we know it is available in all
> processes, as the prereserved zone is inheritaed by all (because they
> are all descendent of the instantiation process which did this
> pre-reservation)
>
> However, I noticed that, for huge pages at least, if this call fails
> due to a lack of huge pages, the virtual space (from start to
> start+size), seems to be returned as available to the kernel! I
> expected a failed call to leave the system unchanged, not to do half
> of the job...

Unfortunately this appears to make sense due to the pluggable logic in
the kernel. If one mmaps a location, anything in the way is first
munmapp'ed. We need to call munmap as the previous mmap may have been
from special driver logic (remember one can supply an mmap handler for
a driver). Likewise due to the munmap also being potentially special,
we can't roll this back. The only safe thing we can do is leave the
space empty if the later mmap logic fails.
(Also it took me a while and a very strong coffee to understand this,
so it certainly isn't obvious :-)).


> This is of course a problem, since I want my pre-reserved area to
> remain pre-reserved on failure!
> What I did (until now), is that I simply remade the pre-reservation
> (with PROT_NONE) on the specific area behind the failed call.
> This was OK, I though, as concurrent access (from different thread) to
> my odp_reserve() function are mutexed.
> What I forgot is that the differrent threads can actually use malloc()
> or mmap() directely:
> If a thread 1 does a odp_reserve, fails on lack of huge page (point A
> in the code) and re-pre-reserve the area (point B), another thread 2
> could be unlucky enough to do a mmap(NULL,...) between thread 1's A
> and B, and be returned a part of my so-called preallocated address
> space :-(.
>
> So I am working on another strategy: doing a first mapping outside the
> preallocated space, and, on success only, move the resulting area
> (using mremap) into the proeallocated space.
>
> The patch (from the old strategy to the new one) looks as flllows:
> -   mapped_addr = mmap(start, size, PROT_READ | PROT_WRITE,
> -  MAP_SHARED | MAP_FIXED | mmap_flags, fd, 
> 0);
> -   /* if mapping fails, re-block the space we tried to take
> -* as it seems a mapping failure still affect what was 
> there??*/
> -   if (mapped_addr == MAP_FAILED) {
> -   mmap_flags = MAP_SHARED | MAP_FIXED |
> -MAP_ANONYMOUS | MAP_NORESERVE;
> -   mmap(start, size, PROT_NONE, mmap_flags, -1, 0);
> -   mprotect(start, size, PROT_NONE);
> +   /* first, try a normal map. If that works, we move it
> +* where it should be:
> +* This is because it turned out that if a mapping fails
> +* on a the prereserved virtual address space, then
> +* the prereserved address space which was tried to be mapped
> +* on becomes available to the kernel again! This was not
> +* according to expectations: the assumption was that if a
> +* mapping fails, the system should remain unchanged, but this
> +* is obvioulsy not true (at least for huge pages when
> +* exhausted).
> +* So the strategy is to first map at a non reserved place
> +* (which can then be freed and returned to the kernel on
> +* failure) and move it to the prereserved space on
> success only.
> + 

[lng-odp] memory allocation issues

2017-01-19 Thread Christophe Milard
Hi Steve,

Maybe you remember me as we have had contact before. Christophe. from
the LNG ODP team (mikes Holmes team).

I have written the ODP memory allocator and I am having an issue with
it: It has a requirement that linux processes (we call them ODP
threads) have to be able to share memory between each other, as normal
pthreads do. (an "ODP thread" can be either a linux process or a
pthread)
The memory should be shareable (at same virtual address) even if it is
ODP allocated after processes have fork()'d.

I did that the following way: as all our ODP processes are descendant
of a single root process (we call it the ODP instantiation process), I
actually pre-reserve a large virtual space area in this process). this
is done as follows:

 pre_reserved_zone = mmap(NULL, len, PROT_NONE,  MAP_SHARED |
MAP_ANONYMOUS | MAP_NORESERVE, -1, 0);

The PROT_NONE makes sure that the physical memory is unaccessible,
hence not used.

Later, when one of the linux processes does an odp_reserve(), in the
related mmap(), I want to map the real memory on some part of that
preallocated area, using MAP_FIXED:

mapped_addr = mmap(start, size, PROT_READ | PROT_WRITE, MAP_SHARED |
MAP_FIXED | mmap_flags, fd, 0);

If "start" is in the pre_reserved_zone, we know it is available in all
processes, as the prereserved zone is inheritaed by all (because they
are all descendent of the instantiation process which did this
pre-reservation)

However, I noticed that, for huge pages at least, if this call fails
due to a lack of huge pages, the virtual space (from start to
start+size), seems to be returned as available to the kernel! I
expected a failed call to leave the system unchanged, not to do half
of the job...
This is of course a problem, since I want my pre-reserved area to
remain pre-reserved on failure!
What I did (until now), is that I simply remade the pre-reservation
(with PROT_NONE) on the specific area behind the failed call.
This was OK, I though, as concurrent access (from different thread) to
my odp_reserve() function are mutexed.
What I forgot is that the differrent threads can actually use malloc()
or mmap() directely:
If a thread 1 does a odp_reserve, fails on lack of huge page (point A
in the code) and re-pre-reserve the area (point B), another thread 2
could be unlucky enough to do a mmap(NULL,...) between thread 1's A
and B, and be returned a part of my so-called preallocated address
space :-(.

So I am working on another strategy: doing a first mapping outside the
preallocated space, and, on success only, move the resulting area
(using mremap) into the proeallocated space.

The patch (from the old strategy to the new one) looks as flllows:
-   mapped_addr = mmap(start, size, PROT_READ | PROT_WRITE,
-  MAP_SHARED | MAP_FIXED | mmap_flags, fd, 0);
-   /* if mapping fails, re-block the space we tried to take
-* as it seems a mapping failure still affect what was there??*/
-   if (mapped_addr == MAP_FAILED) {
-   mmap_flags = MAP_SHARED | MAP_FIXED |
-MAP_ANONYMOUS | MAP_NORESERVE;
-   mmap(start, size, PROT_NONE, mmap_flags, -1, 0);
-   mprotect(start, size, PROT_NONE);
+   /* first, try a normal map. If that works, we move it
+* where it should be:
+* This is because it turned out that if a mapping fails
+* on a the prereserved virtual address space, then
+* the prereserved address space which was tried to be mapped
+* on becomes available to the kernel again! This was not
+* according to expectations: the assumption was that if a
+* mapping fails, the system should remain unchanged, but this
+* is obvioulsy not true (at least for huge pages when
+* exhausted).
+* So the strategy is to first map at a non reserved place
+* (which can then be freed and returned to the kernel on
+* failure) and move it to the prereserved space on
success only.
+*/
+   mapped_addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+  MAP_SHARED | mmap_flags, fd, 0);
+   if (mapped_addr != MAP_FAILED) {
+   /* If OK, remap at right fixed location */
+   mapped_addr = mremap(mapped_addr, size, size,
+MREMAP_FIXED | MREMAP_MAYMOVE,
+start);
+   if (mapped_addr == MAP_FAILED) {
+   ODP_ERR("FIXED mremap failed!\n");
+   }

Sadly, the call to mremap() seems to fail for huge pages! (no clue why)
So I now don't know what to do!! My first approach is not thread safe
when ODP allocations are mixed with direct linux sy