Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-23 Thread David Nellans
On 10/16/2017 12:42 PM, Michal Hocko wrote:
> On Mon 16-10-17 11:00:19, Cristopher Lameter wrote:
>> On Mon, 16 Oct 2017, Michal Hocko wrote:
>>> That being said, the list is far from being complete, I am pretty sure
>>> more would pop out if I thought more thoroughly. The bottom line is that
>>> while I see many problems to actually implement this feature and
>>> maintain it longterm I simply do not see a large benefit outside of a
>>> very specific HW.
>> There is not much new here in terms of problems. The hardware that
>> needs this seems to become more and more plentiful. That is why we need a
>> generic implementation.
> It would really help to name that HW and other potential usecases
> independent on the HW because I am rather skeptical about the
> _plentiful_ part. And so I really do not see any foundation to claim
> the generic part. Because, fundamentally, it is the HW which requires
> the specific memory placement/physically contiguous range etc. So the
> generic implementation doesn't really make sense in such a context.
>

There are TLB's in AMD Xen that can take advantage of contig memory to
improve TLB coverage.  AFAIK contig is not functionally required, its
purely a performance optimization.  Current Xen TLB implementation
doesn't support arbitrary contig lengths, page sizes, etc, but its a
start.  This
type of TLB optimization can be handled on the back end by de-fragging
phys mem (when possible) now that both base and THPs can be easily
migrated; no need for up-front contig, but defrag isn't free either.


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-23 Thread David Nellans
On 10/16/2017 12:42 PM, Michal Hocko wrote:
> On Mon 16-10-17 11:00:19, Cristopher Lameter wrote:
>> On Mon, 16 Oct 2017, Michal Hocko wrote:
>>> That being said, the list is far from being complete, I am pretty sure
>>> more would pop out if I thought more thoroughly. The bottom line is that
>>> while I see many problems to actually implement this feature and
>>> maintain it longterm I simply do not see a large benefit outside of a
>>> very specific HW.
>> There is not much new here in terms of problems. The hardware that
>> needs this seems to become more and more plentiful. That is why we need a
>> generic implementation.
> It would really help to name that HW and other potential usecases
> independent on the HW because I am rather skeptical about the
> _plentiful_ part. And so I really do not see any foundation to claim
> the generic part. Because, fundamentally, it is the HW which requires
> the specific memory placement/physically contiguous range etc. So the
> generic implementation doesn't really make sense in such a context.
>

There are TLB's in AMD Xen that can take advantage of contig memory to
improve TLB coverage.  AFAIK contig is not functionally required, its
purely a performance optimization.  Current Xen TLB implementation
doesn't support arbitrary contig lengths, page sizes, etc, but its a
start.  This
type of TLB optimization can be handled on the back end by de-fragging
phys mem (when possible) now that both base and THPs can be easily
migrated; no need for up-front contig, but defrag isn't free either.


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-17 Thread Vlastimil Babka
On 10/17/2017 08:23 PM, Mike Kravetz wrote:
> On 10/17/2017 07:20 AM, Guy Shattah wrote:
>> 1. CMA has to preconfigured. We're suggesting mechanism that works 'out of 
>> the box'
>> 2. Due to the pre-allocation techniques CMA imposes limitation on maximum 
>>allocated memory. RDMA users often require 1Gb or more, sometimes more.
>> 3. CMA reserves memory in advance, our suggestion is using existing kernel 
>> memory
>>  mechanisms (THP for example) to allocate memory. 
> 
> I would not totally rule out the use of CMA.  I like the way that it reserves
> memory, but does not prohibit use by others.  In addition, there can be
> device (or purpose) specific reservations.

I think the use case are devices that *cannot* function without
contiguous memory, typical examples IIRC are smartphone cameras on with
Android where only single app is working with the device at given time,
so it's ok to reserve single area for the device, and allocation is done
by the driver. Here we are talking about allocations done by potentially
multiple userspace applications, so how do we reconcile that with the
reservations? How does a single flag identify which device's area to
use? How do we prevent one process depleting the area for other
processes? IMHO it's another indication that a generic interface is
infeasible and it should be driver-specific.

BTW, does RDMA need a specific NUMA node to work optimally? (one closest
to the device I presume?) Will it be the job of userspace to discover
and bind itself to that node, in addition to using MAP_CONTIG? Or would
that be another thing best handled by the driver?

> However, since reservations need to happen quite early it is often done on
> the kernel command line.  IMO, this should be avoided if possible.  There
> are interfaces for arch specific code to make reservations.  I do not know
> the system initialization sequence well enough to know if it would be
> possible for driver code to make CMA reservations.  But, it looks doubtful.
> 



Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-17 Thread Vlastimil Babka
On 10/17/2017 08:23 PM, Mike Kravetz wrote:
> On 10/17/2017 07:20 AM, Guy Shattah wrote:
>> 1. CMA has to preconfigured. We're suggesting mechanism that works 'out of 
>> the box'
>> 2. Due to the pre-allocation techniques CMA imposes limitation on maximum 
>>allocated memory. RDMA users often require 1Gb or more, sometimes more.
>> 3. CMA reserves memory in advance, our suggestion is using existing kernel 
>> memory
>>  mechanisms (THP for example) to allocate memory. 
> 
> I would not totally rule out the use of CMA.  I like the way that it reserves
> memory, but does not prohibit use by others.  In addition, there can be
> device (or purpose) specific reservations.

I think the use case are devices that *cannot* function without
contiguous memory, typical examples IIRC are smartphone cameras on with
Android where only single app is working with the device at given time,
so it's ok to reserve single area for the device, and allocation is done
by the driver. Here we are talking about allocations done by potentially
multiple userspace applications, so how do we reconcile that with the
reservations? How does a single flag identify which device's area to
use? How do we prevent one process depleting the area for other
processes? IMHO it's another indication that a generic interface is
infeasible and it should be driver-specific.

BTW, does RDMA need a specific NUMA node to work optimally? (one closest
to the device I presume?) Will it be the job of userspace to discover
and bind itself to that node, in addition to using MAP_CONTIG? Or would
that be another thing best handled by the driver?

> However, since reservations need to happen quite early it is often done on
> the kernel command line.  IMO, this should be avoided if possible.  There
> are interfaces for arch specific code to make reservations.  I do not know
> the system initialization sequence well enough to know if it would be
> possible for driver code to make CMA reservations.  But, it looks doubtful.
> 



Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-17 Thread Mike Kravetz
On 10/17/2017 07:20 AM, Guy Shattah wrote:
> 
> 
>> On Tue, Oct 17 2017, Guy Shattah wrote:
>>> Are you going to be OK with kernel API which implements contiguous
>>> memory allocation?  Possibly with mmap style?  Many drivers could
>>> utilize it instead of having their own weird and possibly non-standard
>>> way to allocate contiguous memory.  Such API won't be available for
>>> user space.
>>
>> What you describe sounds like CMA.  It may be far from perfect but it’s there
>> already and drivers which need contiguous memory can allocate it.
>>
> 
> 1. CMA has to preconfigured. We're suggesting mechanism that works 'out of 
> the box'
> 2. Due to the pre-allocation techniques CMA imposes limitation on maximum 
>allocated memory. RDMA users often require 1Gb or more, sometimes more.
> 3. CMA reserves memory in advance, our suggestion is using existing kernel 
> memory
>  mechanisms (THP for example) to allocate memory. 

I would not totally rule out the use of CMA.  I like the way that it reserves
memory, but does not prohibit use by others.  In addition, there can be
device (or purpose) specific reservations.

However, since reservations need to happen quite early it is often done on
the kernel command line.  IMO, this should be avoided if possible.  There
are interfaces for arch specific code to make reservations.  I do not know
the system initialization sequence well enough to know if it would be
possible for driver code to make CMA reservations.  But, it looks doubtful.

-- 
Mike Kravetz


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-17 Thread Mike Kravetz
On 10/17/2017 07:20 AM, Guy Shattah wrote:
> 
> 
>> On Tue, Oct 17 2017, Guy Shattah wrote:
>>> Are you going to be OK with kernel API which implements contiguous
>>> memory allocation?  Possibly with mmap style?  Many drivers could
>>> utilize it instead of having their own weird and possibly non-standard
>>> way to allocate contiguous memory.  Such API won't be available for
>>> user space.
>>
>> What you describe sounds like CMA.  It may be far from perfect but it’s there
>> already and drivers which need contiguous memory can allocate it.
>>
> 
> 1. CMA has to preconfigured. We're suggesting mechanism that works 'out of 
> the box'
> 2. Due to the pre-allocation techniques CMA imposes limitation on maximum 
>allocated memory. RDMA users often require 1Gb or more, sometimes more.
> 3. CMA reserves memory in advance, our suggestion is using existing kernel 
> memory
>  mechanisms (THP for example) to allocate memory. 

I would not totally rule out the use of CMA.  I like the way that it reserves
memory, but does not prohibit use by others.  In addition, there can be
device (or purpose) specific reservations.

However, since reservations need to happen quite early it is often done on
the kernel command line.  IMO, this should be avoided if possible.  There
are interfaces for arch specific code to make reservations.  I do not know
the system initialization sequence well enough to know if it would be
possible for driver code to make CMA reservations.  But, it looks doubtful.

-- 
Mike Kravetz


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-17 Thread Vlastimil Babka
On 10/17/2017 04:20 PM, Guy Shattah wrote:
> 
> 
>> On Tue, Oct 17 2017, Guy Shattah wrote:
>>> Are you going to be OK with kernel API which implements contiguous
>>> memory allocation?  Possibly with mmap style?  Many drivers could
>>> utilize it instead of having their own weird and possibly non-standard
>>> way to allocate contiguous memory.  Such API won't be available for
>>> user space.
>>
>> What you describe sounds like CMA.  It may be far from perfect but it’s there
>> already and drivers which need contiguous memory can allocate it.
>>
> 
> 1. CMA has to preconfigured. We're suggesting mechanism that works 'out of 
> the box'
> 2. Due to the pre-allocation techniques CMA imposes limitation on maximum 
>allocated memory. RDMA users often require 1Gb or more, sometimes more.
> 3. CMA reserves memory in advance, our suggestion is using existing kernel 
> memory
>  mechanisms (THP for example) to allocate memory. 

You can already use THP, right? madvise(MADV_HUGEPAGE) increases your
chances to get the huge pages. Then you can mlock() them if you want.
And you get the TLB benefits. There's no guarantee of course, but you
shouldn't require a guarantee for MMAP_CONTIG anyway, because it's for
performance reasons, not functionality. So either MMAP_CONTIG would have
to fallback itself, or the userspace caller. Or would your scenario
rather fail than perform suboptimally?

> Guy
> 
> 



Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-17 Thread Vlastimil Babka
On 10/17/2017 04:20 PM, Guy Shattah wrote:
> 
> 
>> On Tue, Oct 17 2017, Guy Shattah wrote:
>>> Are you going to be OK with kernel API which implements contiguous
>>> memory allocation?  Possibly with mmap style?  Many drivers could
>>> utilize it instead of having their own weird and possibly non-standard
>>> way to allocate contiguous memory.  Such API won't be available for
>>> user space.
>>
>> What you describe sounds like CMA.  It may be far from perfect but it’s there
>> already and drivers which need contiguous memory can allocate it.
>>
> 
> 1. CMA has to preconfigured. We're suggesting mechanism that works 'out of 
> the box'
> 2. Due to the pre-allocation techniques CMA imposes limitation on maximum 
>allocated memory. RDMA users often require 1Gb or more, sometimes more.
> 3. CMA reserves memory in advance, our suggestion is using existing kernel 
> memory
>  mechanisms (THP for example) to allocate memory. 

You can already use THP, right? madvise(MADV_HUGEPAGE) increases your
chances to get the huge pages. Then you can mlock() them if you want.
And you get the TLB benefits. There's no guarantee of course, but you
shouldn't require a guarantee for MMAP_CONTIG anyway, because it's for
performance reasons, not functionality. So either MMAP_CONTIG would have
to fallback itself, or the userspace caller. Or would your scenario
rather fail than perform suboptimally?

> Guy
> 
> 



RE: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-17 Thread Guy Shattah


> On Tue, Oct 17 2017, Guy Shattah wrote:
> > Are you going to be OK with kernel API which implements contiguous
> > memory allocation?  Possibly with mmap style?  Many drivers could
> > utilize it instead of having their own weird and possibly non-standard
> > way to allocate contiguous memory.  Such API won't be available for
> > user space.
> 
> What you describe sounds like CMA.  It may be far from perfect but it’s there
> already and drivers which need contiguous memory can allocate it.
> 

1. CMA has to preconfigured. We're suggesting mechanism that works 'out of the 
box'
2. Due to the pre-allocation techniques CMA imposes limitation on maximum 
   allocated memory. RDMA users often require 1Gb or more, sometimes more.
3. CMA reserves memory in advance, our suggestion is using existing kernel 
memory
 mechanisms (THP for example) to allocate memory. 

Guy




RE: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-17 Thread Guy Shattah


> On Tue, Oct 17 2017, Guy Shattah wrote:
> > Are you going to be OK with kernel API which implements contiguous
> > memory allocation?  Possibly with mmap style?  Many drivers could
> > utilize it instead of having their own weird and possibly non-standard
> > way to allocate contiguous memory.  Such API won't be available for
> > user space.
> 
> What you describe sounds like CMA.  It may be far from perfect but it’s there
> already and drivers which need contiguous memory can allocate it.
> 

1. CMA has to preconfigured. We're suggesting mechanism that works 'out of the 
box'
2. Due to the pre-allocation techniques CMA imposes limitation on maximum 
   allocated memory. RDMA users often require 1Gb or more, sometimes more.
3. CMA reserves memory in advance, our suggestion is using existing kernel 
memory
 mechanisms (THP for example) to allocate memory. 

Guy




Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-17 Thread Michal Nazarewicz
On Tue, Oct 17 2017, Guy Shattah wrote:
> Are you going to be OK with kernel API which implements contiguous
> memory allocation?  Possibly with mmap style?  Many drivers could
> utilize it instead of having their own weird and possibly non-standard
> way to allocate contiguous memory.  Such API won't be available for
> user space.

What you describe sounds like CMA.  It may be far from perfect but it’s
there already and drivers which need contiguous memory can allocate it.

-- 
Best regards
ミハウ “퓶퓲퓷퓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-17 Thread Michal Nazarewicz
On Tue, Oct 17 2017, Guy Shattah wrote:
> Are you going to be OK with kernel API which implements contiguous
> memory allocation?  Possibly with mmap style?  Many drivers could
> utilize it instead of having their own weird and possibly non-standard
> way to allocate contiguous memory.  Such API won't be available for
> user space.

What you describe sounds like CMA.  It may be far from perfect but it’s
there already and drivers which need contiguous memory can allocate it.

-- 
Best regards
ミハウ “퓶퓲퓷퓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-17 Thread Michal Hocko
On Tue 17-10-17 10:50:02, Guy Shattah wrote:
[...]
> > Well, we can provide a generic library functions for your driver to use so 
> > that
> > you do not have to care about implementation details but I do not think
> > exposing this API to the userspace in a generic fashion is a good idea.
> > Especially when the only usecase that has been thought through so far seems
> > to be a very special HW optimiztion.
> 
> Are you going to be OK with kernel API which implements contiguous
> memory allocation?

We already do have alloc_contig_range. It is a dumb allocator so it is
not very suitable for short term allocations.

> Possibly with mmap style?  Many drivers could utilize it instead of
> having their own weird and possibly non-standard way to allocate
> contiguous memory.  Such API won't be available for user space.

Yes, an mmap helper which performs and enforces some accounting would be a
good start.

> We can begin with implementing kernel API and postpone the userspace
> api discussion for a future date. if it is sufficient. We might not
> have to discuss it at all.

Yeah, that was my thinking as well.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-17 Thread Michal Hocko
On Tue 17-10-17 10:50:02, Guy Shattah wrote:
[...]
> > Well, we can provide a generic library functions for your driver to use so 
> > that
> > you do not have to care about implementation details but I do not think
> > exposing this API to the userspace in a generic fashion is a good idea.
> > Especially when the only usecase that has been thought through so far seems
> > to be a very special HW optimiztion.
> 
> Are you going to be OK with kernel API which implements contiguous
> memory allocation?

We already do have alloc_contig_range. It is a dumb allocator so it is
not very suitable for short term allocations.

> Possibly with mmap style?  Many drivers could utilize it instead of
> having their own weird and possibly non-standard way to allocate
> contiguous memory.  Such API won't be available for user space.

Yes, an mmap helper which performs and enforces some accounting would be a
good start.

> We can begin with implementing kernel API and postpone the userspace
> api discussion for a future date. if it is sufficient. We might not
> have to discuss it at all.

Yeah, that was my thinking as well.
-- 
Michal Hocko
SUSE Labs


RE: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-17 Thread Guy Shattah


> > On 16/10/2017 11:24, Michal Hocko wrote:
> > > On Sun 15-10-17 10:50:29, Guy Shattah wrote:
> > > >
> > > > On 13/10/2017 19:17, Michal Hocko wrote:
> > > > > On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
> > > > > > On Fri, 13 Oct 2017, Michal Hocko wrote:
> > > > > > > > There are numerous RDMA devices that would all need the
> > > > > > > > mmap implementation. And this covers only the needs of one
> > > > > > > > subsystem. There are other use cases.
> > > > > > > That doesn't prevent providing a library function which
> > > > > > > could be reused by all those drivers. Nothing really too
> > > > > > > much different from remap_pfn_range.
> > > > > > And then in all the other use cases as well. It would be much
> > > > > > easier if mmap could give you the memory you need instead of
> > > > > > havig numerous drivers improvise on their own. This is in
> > > > > > particular also useful for numerous embedded use cases where you
> need contiguous memory.
> > > > > But a generic implementation would have to deal with many issues
> > > > > as already mentioned. If you make this driver specific you can
> > > > > have access control based on fd etc... I really fail to see how
> > > > > this is any different from remap_pfn_range.
> > > > Why have several driver specific implementation if you can
> > > > generalize the idea and implement an already existing POSIX
> > > > standard?
> > > Because users shouldn't really care, really. We do have means to get
> > > large memory and having a guaranteed large memory is a PITA. Just
> > > look at hugetlb and all the issues it exposes. And that one is
> > > preallocated and it requires admin to do a conscious decision about
> > > the amount of the memory. You would like to establish something
> > > similar except without bounds to the size and no pre-allowed amount
> > > by an admin. This sounds just crazy to me.
> >
> > Users do care about the performance they get using devices which
> > benefit from contiguous memory allocation.  Assuming that user
> > requires 700Mb of contiguous memory. Then why allocate giant (1GB)
> > page when you can allocate 700Mb out of the 1GB and put the rest of
> > the 300Mb back in the huge-pages/small-pages pool?
> 
> I believe I have explained that part. Large pages are under admin control and
> responsibility. If you get a free ticket to large memory to any user who can
> pin that memory then you are in serious troubles.
> 
> > > On the other hand if you make this per-device mmap implementation
> > > you can have both admin defined policy on who is allowed this memory
> > > and moreover drivers can implement their fallback strategies which
> > > best suit their needs. I really fail to see how this is any
> > > different from using specialized mmap implementations.
> > We tried doing it in the past. but the maintainer gave us a very good
> > argument:
> > " If you want to support anonymous mmaps to allocate large contiguous
> > pages work with the MM folks on providing that in a generic fashion."
> 
> Well, we can provide a generic library functions for your driver to use so 
> that
> you do not have to care about implementation details but I do not think
> exposing this API to the userspace in a generic fashion is a good idea.
> Especially when the only usecase that has been thought through so far seems
> to be a very special HW optimiztion.

Are you going to be OK with kernel API which implements contiguous memory 
allocation?
Possibly with mmap style?  Many drivers could utilize it instead of having 
their own weird
and possibly non-standard way to allocate contiguous memory.
Such API won't be available for user space.

We can begin with implementing kernel API and postpone the userspace api 
discussion for a future date.
if it is sufficient. We might not have to discuss it at all.
 

> 
> > After discussing it with people who have the same requirements as we
> > do - I totally agree with him
> >
> >
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcom
> m
> >
> ents.gmane.org%2Fgmane.linux.drivers.rdma%2F31467=02%7C01%7Cs
> guy%
> >
> 40mellanox.com%7C24d72e65908044f3d38a08d5149204ee%7Ca652971c7d
> 2e4d9ba6
> >
> a4d149256f461b%7C0%7C0%7C636437539732729965=oueheNfnsMS
> PAGAehcT5
> > ZDteHxMVQ9%2F7nJNKPPfgVvM%3D=0
> >
> > > I might be really wrong but I consider such a general purpose flag
> > > quite dangerous and future maintenance burden. At least from the
> > > hugetlb/THP history I do not see why this should be any different.
> >
> > Could you please elaborate why is it dangerous and future maintenance
> > burden?
> 
> Providing large contiguous memory ranges is not easy and we actually do not
> have any reliable way to offer such a functionality for the kernel users
> because we assume they are not that many. Basically anything larger than
> order-3 is best effort. Even changes constant improvements of the
> compaction still leaves us with something we cannot fully rely on. And now
> you want to 

RE: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-17 Thread Guy Shattah


> > On 16/10/2017 11:24, Michal Hocko wrote:
> > > On Sun 15-10-17 10:50:29, Guy Shattah wrote:
> > > >
> > > > On 13/10/2017 19:17, Michal Hocko wrote:
> > > > > On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
> > > > > > On Fri, 13 Oct 2017, Michal Hocko wrote:
> > > > > > > > There are numerous RDMA devices that would all need the
> > > > > > > > mmap implementation. And this covers only the needs of one
> > > > > > > > subsystem. There are other use cases.
> > > > > > > That doesn't prevent providing a library function which
> > > > > > > could be reused by all those drivers. Nothing really too
> > > > > > > much different from remap_pfn_range.
> > > > > > And then in all the other use cases as well. It would be much
> > > > > > easier if mmap could give you the memory you need instead of
> > > > > > havig numerous drivers improvise on their own. This is in
> > > > > > particular also useful for numerous embedded use cases where you
> need contiguous memory.
> > > > > But a generic implementation would have to deal with many issues
> > > > > as already mentioned. If you make this driver specific you can
> > > > > have access control based on fd etc... I really fail to see how
> > > > > this is any different from remap_pfn_range.
> > > > Why have several driver specific implementation if you can
> > > > generalize the idea and implement an already existing POSIX
> > > > standard?
> > > Because users shouldn't really care, really. We do have means to get
> > > large memory and having a guaranteed large memory is a PITA. Just
> > > look at hugetlb and all the issues it exposes. And that one is
> > > preallocated and it requires admin to do a conscious decision about
> > > the amount of the memory. You would like to establish something
> > > similar except without bounds to the size and no pre-allowed amount
> > > by an admin. This sounds just crazy to me.
> >
> > Users do care about the performance they get using devices which
> > benefit from contiguous memory allocation.  Assuming that user
> > requires 700Mb of contiguous memory. Then why allocate giant (1GB)
> > page when you can allocate 700Mb out of the 1GB and put the rest of
> > the 300Mb back in the huge-pages/small-pages pool?
> 
> I believe I have explained that part. Large pages are under admin control and
> responsibility. If you get a free ticket to large memory to any user who can
> pin that memory then you are in serious troubles.
> 
> > > On the other hand if you make this per-device mmap implementation
> > > you can have both admin defined policy on who is allowed this memory
> > > and moreover drivers can implement their fallback strategies which
> > > best suit their needs. I really fail to see how this is any
> > > different from using specialized mmap implementations.
> > We tried doing it in the past. but the maintainer gave us a very good
> > argument:
> > " If you want to support anonymous mmaps to allocate large contiguous
> > pages work with the MM folks on providing that in a generic fashion."
> 
> Well, we can provide a generic library functions for your driver to use so 
> that
> you do not have to care about implementation details but I do not think
> exposing this API to the userspace in a generic fashion is a good idea.
> Especially when the only usecase that has been thought through so far seems
> to be a very special HW optimiztion.

Are you going to be OK with kernel API which implements contiguous memory 
allocation?
Possibly with mmap style?  Many drivers could utilize it instead of having 
their own weird
and possibly non-standard way to allocate contiguous memory.
Such API won't be available for user space.

We can begin with implementing kernel API and postpone the userspace api 
discussion for a future date.
if it is sufficient. We might not have to discuss it at all.
 

> 
> > After discussing it with people who have the same requirements as we
> > do - I totally agree with him
> >
> >
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcom
> m
> >
> ents.gmane.org%2Fgmane.linux.drivers.rdma%2F31467=02%7C01%7Cs
> guy%
> >
> 40mellanox.com%7C24d72e65908044f3d38a08d5149204ee%7Ca652971c7d
> 2e4d9ba6
> >
> a4d149256f461b%7C0%7C0%7C636437539732729965=oueheNfnsMS
> PAGAehcT5
> > ZDteHxMVQ9%2F7nJNKPPfgVvM%3D=0
> >
> > > I might be really wrong but I consider such a general purpose flag
> > > quite dangerous and future maintenance burden. At least from the
> > > hugetlb/THP history I do not see why this should be any different.
> >
> > Could you please elaborate why is it dangerous and future maintenance
> > burden?
> 
> Providing large contiguous memory ranges is not easy and we actually do not
> have any reliable way to offer such a functionality for the kernel users
> because we assume they are not that many. Basically anything larger than
> order-3 is best effort. Even changes constant improvements of the
> compaction still leaves us with something we cannot fully rely on. And now
> you want to 

Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-17 Thread Vlastimil Babka
On 10/16/2017 10:32 PM, Mike Kravetz wrote:
> Agree.  I only wanted to point out the similarities.
> But, it does make me wonder how much of a benefit hugetlb 1G pages would
> make in the the RDMA performance comparison.  The table in the presentation
> show a average speedup of something like 27% (or so) for contiguous allocation
> which I assume are 2GB in size.  Certainly, using hugetlb is not the ideal
> case, just wondering if it does help and how much.

Good point. If somebody cares about performance benefits of contiguous
memory wrt device access, they would probably want also the TLB
performance benefits of huge pages.


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-17 Thread Vlastimil Babka
On 10/16/2017 10:32 PM, Mike Kravetz wrote:
> Agree.  I only wanted to point out the similarities.
> But, it does make me wonder how much of a benefit hugetlb 1G pages would
> make in the the RDMA performance comparison.  The table in the presentation
> show a average speedup of something like 27% (or so) for contiguous allocation
> which I assume are 2GB in size.  Certainly, using hugetlb is not the ideal
> case, just wondering if it does help and how much.

Good point. If somebody cares about performance benefits of contiguous
memory wrt device access, they would probably want also the TLB
performance benefits of huge pages.


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Mike Kravetz
On 10/16/2017 02:03 PM, Laura Abbott wrote:
> On 10/16/2017 01:32 PM, Mike Kravetz wrote:
>> On 10/16/2017 11:07 AM, Michal Hocko wrote:
>>> On Mon 16-10-17 10:43:38, Mike Kravetz wrote:
 Just to be clear, the posix standard talks about a typed memory object.
 The suggested implementation has one create a connection to the memory
 object to receive a fd, then use mmap as usual to get a mapping backed
 by contiguous pages/memory.  Of course, this type of implementation is
 not a requirement.
>>>
>>> I am not sure that POSIC standard for typed memory is easily
>>> implementable in Linux. Does any OS actually implement this API?
>>
>> A quick search only reveals Blackberry QNX and PlayBook OS.
>>
>> Also somewhat related.  In a earlier thread someone pointed out this
>> out of tree module used for contiguous allocations in SOC (and other?)
>> environments.  It even has the option of making use of CMA.
>> http://processors.wiki.ti.com/index.php/CMEM_Overview
>>
> 
> If we're at the point where we're discussing CMEM, I'd like to
> point out that ion (drivers/staging/android/ion) already provides an
> ioctl interface to allocate CMA and other types of memory. It's
> mostly used for Android as the name implies. I don't pretend the
> interface is perfect but it could be useful as a discussion point
> for allocation interfaces.

Thanks Laura,

I was just pointing out other use cases where people thought contiguous
allocations were useful.  And, it was useful enough that someone actually
wrote code to make it happen.

-- 
Mike Kravetz


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Mike Kravetz
On 10/16/2017 02:03 PM, Laura Abbott wrote:
> On 10/16/2017 01:32 PM, Mike Kravetz wrote:
>> On 10/16/2017 11:07 AM, Michal Hocko wrote:
>>> On Mon 16-10-17 10:43:38, Mike Kravetz wrote:
 Just to be clear, the posix standard talks about a typed memory object.
 The suggested implementation has one create a connection to the memory
 object to receive a fd, then use mmap as usual to get a mapping backed
 by contiguous pages/memory.  Of course, this type of implementation is
 not a requirement.
>>>
>>> I am not sure that POSIC standard for typed memory is easily
>>> implementable in Linux. Does any OS actually implement this API?
>>
>> A quick search only reveals Blackberry QNX and PlayBook OS.
>>
>> Also somewhat related.  In a earlier thread someone pointed out this
>> out of tree module used for contiguous allocations in SOC (and other?)
>> environments.  It even has the option of making use of CMA.
>> http://processors.wiki.ti.com/index.php/CMEM_Overview
>>
> 
> If we're at the point where we're discussing CMEM, I'd like to
> point out that ion (drivers/staging/android/ion) already provides an
> ioctl interface to allocate CMA and other types of memory. It's
> mostly used for Android as the name implies. I don't pretend the
> interface is perfect but it could be useful as a discussion point
> for allocation interfaces.

Thanks Laura,

I was just pointing out other use cases where people thought contiguous
allocations were useful.  And, it was useful enough that someone actually
wrote code to make it happen.

-- 
Mike Kravetz


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Laura Abbott
On 10/16/2017 01:32 PM, Mike Kravetz wrote:
> On 10/16/2017 11:07 AM, Michal Hocko wrote:
>> On Mon 16-10-17 10:43:38, Mike Kravetz wrote:
>>> Just to be clear, the posix standard talks about a typed memory object.
>>> The suggested implementation has one create a connection to the memory
>>> object to receive a fd, then use mmap as usual to get a mapping backed
>>> by contiguous pages/memory.  Of course, this type of implementation is
>>> not a requirement.
>>
>> I am not sure that POSIC standard for typed memory is easily
>> implementable in Linux. Does any OS actually implement this API?
> 
> A quick search only reveals Blackberry QNX and PlayBook OS.
> 
> Also somewhat related.  In a earlier thread someone pointed out this
> out of tree module used for contiguous allocations in SOC (and other?)
> environments.  It even has the option of making use of CMA.
> http://processors.wiki.ti.com/index.php/CMEM_Overview
> 

If we're at the point where we're discussing CMEM, I'd like to
point out that ion (drivers/staging/android/ion) already provides an
ioctl interface to allocate CMA and other types of memory. It's
mostly used for Android as the name implies. I don't pretend the
interface is perfect but it could be useful as a discussion point
for allocation interfaces.

Thanks,
Laura


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Laura Abbott
On 10/16/2017 01:32 PM, Mike Kravetz wrote:
> On 10/16/2017 11:07 AM, Michal Hocko wrote:
>> On Mon 16-10-17 10:43:38, Mike Kravetz wrote:
>>> Just to be clear, the posix standard talks about a typed memory object.
>>> The suggested implementation has one create a connection to the memory
>>> object to receive a fd, then use mmap as usual to get a mapping backed
>>> by contiguous pages/memory.  Of course, this type of implementation is
>>> not a requirement.
>>
>> I am not sure that POSIC standard for typed memory is easily
>> implementable in Linux. Does any OS actually implement this API?
> 
> A quick search only reveals Blackberry QNX and PlayBook OS.
> 
> Also somewhat related.  In a earlier thread someone pointed out this
> out of tree module used for contiguous allocations in SOC (and other?)
> environments.  It even has the option of making use of CMA.
> http://processors.wiki.ti.com/index.php/CMEM_Overview
> 

If we're at the point where we're discussing CMEM, I'd like to
point out that ion (drivers/staging/android/ion) already provides an
ioctl interface to allocate CMA and other types of memory. It's
mostly used for Android as the name implies. I don't pretend the
interface is perfect but it could be useful as a discussion point
for allocation interfaces.

Thanks,
Laura


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Hocko
On Mon 16-10-17 13:32:45, Mike Kravetz wrote:
> On 10/16/2017 11:07 AM, Michal Hocko wrote:
[...]
> > That depends on who is actually going to use the contiguous memory. If
> > we are talking about drivers to communication to the userspace then
> > using driver specific fd with its mmap implementation then we do not
> > need any special fs nor a seperate infrastructure. Well except for a
> > library function to handle the MM side of the thing.
> 
> If we embed this functionality into device specific mmap calls it will
> closely tie the usage to the devices.  However, don't we still have to
> worry about potential interaction with other parts of the mm as you mention
> below?  I guess that would be the library function and how it is used
> by drivers.

Yes, those problems with pinning the amount of contiguous memory are
simply inherent. You have to be really careful when allowing to reserve large
partions of the contiguous memory. Especially if this is going to be a
very dynamic allocator. The main advantage of the per
device mmap is that it has its access control by default via file
permissions. You can simply rule the untrusted user out of the game. You
can also implement the per device usage limits. So you have some tools to
keep the usage under leash and evaluate potential costs vs. benefits.
That sounds to me much more safer than a generic API which would have
a tricky accounting and access control restrictions.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Hocko
On Mon 16-10-17 13:32:45, Mike Kravetz wrote:
> On 10/16/2017 11:07 AM, Michal Hocko wrote:
[...]
> > That depends on who is actually going to use the contiguous memory. If
> > we are talking about drivers to communication to the userspace then
> > using driver specific fd with its mmap implementation then we do not
> > need any special fs nor a seperate infrastructure. Well except for a
> > library function to handle the MM side of the thing.
> 
> If we embed this functionality into device specific mmap calls it will
> closely tie the usage to the devices.  However, don't we still have to
> worry about potential interaction with other parts of the mm as you mention
> below?  I guess that would be the library function and how it is used
> by drivers.

Yes, those problems with pinning the amount of contiguous memory are
simply inherent. You have to be really careful when allowing to reserve large
partions of the contiguous memory. Especially if this is going to be a
very dynamic allocator. The main advantage of the per
device mmap is that it has its access control by default via file
permissions. You can simply rule the untrusted user out of the game. You
can also implement the per device usage limits. So you have some tools to
keep the usage under leash and evaluate potential costs vs. benefits.
That sounds to me much more safer than a generic API which would have
a tricky accounting and access control restrictions.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Mike Kravetz
On 10/16/2017 11:07 AM, Michal Hocko wrote:
> On Mon 16-10-17 10:43:38, Mike Kravetz wrote:
>> Just to be clear, the posix standard talks about a typed memory object.
>> The suggested implementation has one create a connection to the memory
>> object to receive a fd, then use mmap as usual to get a mapping backed
>> by contiguous pages/memory.  Of course, this type of implementation is
>> not a requirement.
> 
> I am not sure that POSIC standard for typed memory is easily
> implementable in Linux. Does any OS actually implement this API?

A quick search only reveals Blackberry QNX and PlayBook OS.

Also somewhat related.  In a earlier thread someone pointed out this
out of tree module used for contiguous allocations in SOC (and other?)
environments.  It even has the option of making use of CMA.
http://processors.wiki.ti.com/index.php/CMEM_Overview

>> However, this type of implementation looks quite a
>> bit like hugetlbfs today.
>> - Both require opening a special file/device, and then calling mmap on
>>   the returned fd.  You can technically use mmap(MAP_HUGETLB), but that
>>   still ends up using hugetbfs.  BTW, there was resistance to adding the
>>   MAP_HUGETLB flag to mmap.
> 
> And I think we shouldn't really shape any API based on hugetlb.

Agree.  I only wanted to point out the similarities.
But, it does make me wonder how much of a benefit hugetlb 1G pages would
make in the the RDMA performance comparison.  The table in the presentation
show a average speedup of something like 27% (or so) for contiguous allocation
which I assume are 2GB in size.  Certainly, using hugetlb is not the ideal
case, just wondering if it does help and how much.

>> - Allocation of contiguous memory is much like 'on demand' allocation of
>>   huge pages.  There are some (not many) users that use this model.  They
>>   attempt to allocate huge pages on demand, and if not available fall back
>>   to base pages.  This is how contiguous allocations would need to work.
>>   Of course, most hugetlbfs users pre-allocate pages for their use, and
>>   this 'might' be something useful for contiguous allocations as well.
> 
> But there is still admin configuration required to consume memory from
> the pool or overcommit that pool.
> 
>> I wonder if going down the path of a separate devide/filesystem/etc for
>> contiguous allocations might be a better option.  It would keep the
>> implementation somewhat separate.  However, I would then be afraid that
>> we end up with another 'separate/special vm' as in the case of hugetlbfs
>> today.
> 
> That depends on who is actually going to use the contiguous memory. If
> we are talking about drivers to communication to the userspace then
> using driver specific fd with its mmap implementation then we do not
> need any special fs nor a seperate infrastructure. Well except for a
> library function to handle the MM side of the thing.

If we embed this functionality into device specific mmap calls it will
closely tie the usage to the devices.  However, don't we still have to
worry about potential interaction with other parts of the mm as you mention
below?  I guess that would be the library function and how it is used
by drivers.

-- 
Mike Kravetz

> If we really need a general purpose physical contiguous memory allocator
> then I would agree that using MAP_ flag might be a way to go but that
> would require a very careful consideration of who is allowed to allocate
> and how much/large blocks. I do not see a good fit to conveying that
> information to the kernel right now. Moreover, and most importantly, I
> haven't heard any sound usecase for such a functionality in the first
> place. There is some hand waving about performance but there are no real
> numbers to back those claims AFAIK. Not to mention a serious
> consideration of potential consequences of the whole MM.
> 


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Mike Kravetz
On 10/16/2017 11:07 AM, Michal Hocko wrote:
> On Mon 16-10-17 10:43:38, Mike Kravetz wrote:
>> Just to be clear, the posix standard talks about a typed memory object.
>> The suggested implementation has one create a connection to the memory
>> object to receive a fd, then use mmap as usual to get a mapping backed
>> by contiguous pages/memory.  Of course, this type of implementation is
>> not a requirement.
> 
> I am not sure that POSIC standard for typed memory is easily
> implementable in Linux. Does any OS actually implement this API?

A quick search only reveals Blackberry QNX and PlayBook OS.

Also somewhat related.  In a earlier thread someone pointed out this
out of tree module used for contiguous allocations in SOC (and other?)
environments.  It even has the option of making use of CMA.
http://processors.wiki.ti.com/index.php/CMEM_Overview

>> However, this type of implementation looks quite a
>> bit like hugetlbfs today.
>> - Both require opening a special file/device, and then calling mmap on
>>   the returned fd.  You can technically use mmap(MAP_HUGETLB), but that
>>   still ends up using hugetbfs.  BTW, there was resistance to adding the
>>   MAP_HUGETLB flag to mmap.
> 
> And I think we shouldn't really shape any API based on hugetlb.

Agree.  I only wanted to point out the similarities.
But, it does make me wonder how much of a benefit hugetlb 1G pages would
make in the the RDMA performance comparison.  The table in the presentation
show a average speedup of something like 27% (or so) for contiguous allocation
which I assume are 2GB in size.  Certainly, using hugetlb is not the ideal
case, just wondering if it does help and how much.

>> - Allocation of contiguous memory is much like 'on demand' allocation of
>>   huge pages.  There are some (not many) users that use this model.  They
>>   attempt to allocate huge pages on demand, and if not available fall back
>>   to base pages.  This is how contiguous allocations would need to work.
>>   Of course, most hugetlbfs users pre-allocate pages for their use, and
>>   this 'might' be something useful for contiguous allocations as well.
> 
> But there is still admin configuration required to consume memory from
> the pool or overcommit that pool.
> 
>> I wonder if going down the path of a separate devide/filesystem/etc for
>> contiguous allocations might be a better option.  It would keep the
>> implementation somewhat separate.  However, I would then be afraid that
>> we end up with another 'separate/special vm' as in the case of hugetlbfs
>> today.
> 
> That depends on who is actually going to use the contiguous memory. If
> we are talking about drivers to communication to the userspace then
> using driver specific fd with its mmap implementation then we do not
> need any special fs nor a seperate infrastructure. Well except for a
> library function to handle the MM side of the thing.

If we embed this functionality into device specific mmap calls it will
closely tie the usage to the devices.  However, don't we still have to
worry about potential interaction with other parts of the mm as you mention
below?  I guess that would be the library function and how it is used
by drivers.

-- 
Mike Kravetz

> If we really need a general purpose physical contiguous memory allocator
> then I would agree that using MAP_ flag might be a way to go but that
> would require a very careful consideration of who is allowed to allocate
> and how much/large blocks. I do not see a good fit to conveying that
> information to the kernel right now. Moreover, and most importantly, I
> haven't heard any sound usecase for such a functionality in the first
> place. There is some hand waving about performance but there are no real
> numbers to back those claims AFAIK. Not to mention a serious
> consideration of potential consequences of the whole MM.
> 


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Hocko
On Mon 16-10-17 12:56:43, Cristopher Lameter wrote:
> On Mon, 16 Oct 2017, Michal Hocko wrote:
> 
> > > We already have that issue and have ways to control that by tracking
> > > pinned and mlocked pages as well as limits on their allocations.
> >
> > Ohh, it is very different because mlock limit is really small (64kB)
> > which is not even close to what this is supposed to be about. Moreover
> > mlock doesn't prevent from migration and so it doesn't prevent
> > compaction to form higher order allocations.
> 
> The mlock limit is configurable. There is a tracking of pinned pages as
> well.

I am not aware of any such generic tracking API. The attempt by Peter
has never been merged. So what we have right now is just an adhoc
tracking...
 
> > Really, this is just too dangerous without a deep consideration of all
> > the potential consequences. The more I am thinking about this the more I
> > am convinced that this all should be driver specific mmap based thing.
> > If it turns out to be too restrictive over time and there are more
> > experiences about the usage we can consider thinking about a more
> > generic API. But starting from the generic MAP_ flag is just asking for
> > problems.
> 
> This issue is already present with the pinning of lots of memory via the
> RDMA API when in use for large gigabyte ranges.

... like in those

> There is nothing new aside
> from memory being contiguous with this approach.

which makes a hell of a difference. Once you allow to pin larger blocks
of memory you make the whole compaction hopelessly ineffective.

> > > There is not much new here in terms of problems. The hardware that
> > > needs this seems to become more and more plentiful. That is why we need a
> > > generic implementation.
> >
> > It would really help to name that HW and other potential usecases
> > independent on the HW because I am rather skeptical about the
> > _plentiful_ part. And so I really do not see any foundation to claim
> > the generic part. Because, fundamentally, it is the HW which requires
> > the specific memory placement/physically contiguous range etc. So the
> > generic implementation doesn't really make sense in such a context.
> 
> RDMA hardware? Storage interfaces? Look at what the RDMA subsystem
> and storage (NVME?) support.
> 
> This is not a hardware specific thing but a reflection of the general
> limitations of the exiting 4k page struct scheme that limits performance
> and causes severe pressure on I/O devices.

This is something more for storage people to comment. I expect (NVME)
storage to use DAX and it support for large and direct access. Nothing
really prevents RDMA HW to provide mmap implementation to use contiguous
pages, we already provide an API to allocate large memory.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Hocko
On Mon 16-10-17 12:56:43, Cristopher Lameter wrote:
> On Mon, 16 Oct 2017, Michal Hocko wrote:
> 
> > > We already have that issue and have ways to control that by tracking
> > > pinned and mlocked pages as well as limits on their allocations.
> >
> > Ohh, it is very different because mlock limit is really small (64kB)
> > which is not even close to what this is supposed to be about. Moreover
> > mlock doesn't prevent from migration and so it doesn't prevent
> > compaction to form higher order allocations.
> 
> The mlock limit is configurable. There is a tracking of pinned pages as
> well.

I am not aware of any such generic tracking API. The attempt by Peter
has never been merged. So what we have right now is just an adhoc
tracking...
 
> > Really, this is just too dangerous without a deep consideration of all
> > the potential consequences. The more I am thinking about this the more I
> > am convinced that this all should be driver specific mmap based thing.
> > If it turns out to be too restrictive over time and there are more
> > experiences about the usage we can consider thinking about a more
> > generic API. But starting from the generic MAP_ flag is just asking for
> > problems.
> 
> This issue is already present with the pinning of lots of memory via the
> RDMA API when in use for large gigabyte ranges.

... like in those

> There is nothing new aside
> from memory being contiguous with this approach.

which makes a hell of a difference. Once you allow to pin larger blocks
of memory you make the whole compaction hopelessly ineffective.

> > > There is not much new here in terms of problems. The hardware that
> > > needs this seems to become more and more plentiful. That is why we need a
> > > generic implementation.
> >
> > It would really help to name that HW and other potential usecases
> > independent on the HW because I am rather skeptical about the
> > _plentiful_ part. And so I really do not see any foundation to claim
> > the generic part. Because, fundamentally, it is the HW which requires
> > the specific memory placement/physically contiguous range etc. So the
> > generic implementation doesn't really make sense in such a context.
> 
> RDMA hardware? Storage interfaces? Look at what the RDMA subsystem
> and storage (NVME?) support.
> 
> This is not a hardware specific thing but a reflection of the general
> limitations of the exiting 4k page struct scheme that limits performance
> and causes severe pressure on I/O devices.

This is something more for storage people to comment. I expect (NVME)
storage to use DAX and it support for large and direct access. Nothing
really prevents RDMA HW to provide mmap implementation to use contiguous
pages, we already provide an API to allocate large memory.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Hocko
On Mon 16-10-17 10:43:38, Mike Kravetz wrote:
> On 10/15/2017 12:50 AM, Guy Shattah wrote:
> > On 13/10/2017 19:17, Michal Hocko wrote:
[...]
> >> But a generic implementation would have to deal with many issues as
> >> already mentioned. If you make this driver specific you can have access
> >> control based on fd etc... I really fail to see how this is any
> >> different from remap_pfn_range.
> > Why have several driver specific implementation if you can generalize the 
> > idea and implement
> > an already existing POSIX standard?
> 
> Just to be clear, the posix standard talks about a typed memory object.
> The suggested implementation has one create a connection to the memory
> object to receive a fd, then use mmap as usual to get a mapping backed
> by contiguous pages/memory.  Of course, this type of implementation is
> not a requirement.

I am not sure that POSIC standard for typed memory is easily
implementable in Linux. Does any OS actually implement this API?

> However, this type of implementation looks quite a
> bit like hugetlbfs today.
> - Both require opening a special file/device, and then calling mmap on
>   the returned fd.  You can technically use mmap(MAP_HUGETLB), but that
>   still ends up using hugetbfs.  BTW, there was resistance to adding the
>   MAP_HUGETLB flag to mmap.

And I think we shouldn't really shape any API based on hugetlb.

> - Allocation of contiguous memory is much like 'on demand' allocation of
>   huge pages.  There are some (not many) users that use this model.  They
>   attempt to allocate huge pages on demand, and if not available fall back
>   to base pages.  This is how contiguous allocations would need to work.
>   Of course, most hugetlbfs users pre-allocate pages for their use, and
>   this 'might' be something useful for contiguous allocations as well.

But there is still admin configuration required to consume memory from
the pool or overcommit that pool.

> I wonder if going down the path of a separate devide/filesystem/etc for
> contiguous allocations might be a better option.  It would keep the
> implementation somewhat separate.  However, I would then be afraid that
> we end up with another 'separate/special vm' as in the case of hugetlbfs
> today.

That depends on who is actually going to use the contiguous memory. If
we are talking about drivers to communication to the userspace then
using driver specific fd with its mmap implementation then we do not
need any special fs nor a seperate infrastructure. Well except for a
library function to handle the MM side of the thing.

If we really need a general purpose physical contiguous memory allocator
then I would agree that using MAP_ flag might be a way to go but that
would require a very careful consideration of who is allowed to allocate
and how much/large blocks. I do not see a good fit to conveying that
information to the kernel right now. Moreover, and most importantly, I
haven't heard any sound usecase for such a functionality in the first
place. There is some hand waving about performance but there are no real
numbers to back those claims AFAIK. Not to mention a serious
consideration of potential consequences of the whole MM.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Hocko
On Mon 16-10-17 10:43:38, Mike Kravetz wrote:
> On 10/15/2017 12:50 AM, Guy Shattah wrote:
> > On 13/10/2017 19:17, Michal Hocko wrote:
[...]
> >> But a generic implementation would have to deal with many issues as
> >> already mentioned. If you make this driver specific you can have access
> >> control based on fd etc... I really fail to see how this is any
> >> different from remap_pfn_range.
> > Why have several driver specific implementation if you can generalize the 
> > idea and implement
> > an already existing POSIX standard?
> 
> Just to be clear, the posix standard talks about a typed memory object.
> The suggested implementation has one create a connection to the memory
> object to receive a fd, then use mmap as usual to get a mapping backed
> by contiguous pages/memory.  Of course, this type of implementation is
> not a requirement.

I am not sure that POSIC standard for typed memory is easily
implementable in Linux. Does any OS actually implement this API?

> However, this type of implementation looks quite a
> bit like hugetlbfs today.
> - Both require opening a special file/device, and then calling mmap on
>   the returned fd.  You can technically use mmap(MAP_HUGETLB), but that
>   still ends up using hugetbfs.  BTW, there was resistance to adding the
>   MAP_HUGETLB flag to mmap.

And I think we shouldn't really shape any API based on hugetlb.

> - Allocation of contiguous memory is much like 'on demand' allocation of
>   huge pages.  There are some (not many) users that use this model.  They
>   attempt to allocate huge pages on demand, and if not available fall back
>   to base pages.  This is how contiguous allocations would need to work.
>   Of course, most hugetlbfs users pre-allocate pages for their use, and
>   this 'might' be something useful for contiguous allocations as well.

But there is still admin configuration required to consume memory from
the pool or overcommit that pool.

> I wonder if going down the path of a separate devide/filesystem/etc for
> contiguous allocations might be a better option.  It would keep the
> implementation somewhat separate.  However, I would then be afraid that
> we end up with another 'separate/special vm' as in the case of hugetlbfs
> today.

That depends on who is actually going to use the contiguous memory. If
we are talking about drivers to communication to the userspace then
using driver specific fd with its mmap implementation then we do not
need any special fs nor a seperate infrastructure. Well except for a
library function to handle the MM side of the thing.

If we really need a general purpose physical contiguous memory allocator
then I would agree that using MAP_ flag might be a way to go but that
would require a very careful consideration of who is allowed to allocate
and how much/large blocks. I do not see a good fit to conveying that
information to the kernel right now. Moreover, and most importantly, I
haven't heard any sound usecase for such a functionality in the first
place. There is some hand waving about performance but there are no real
numbers to back those claims AFAIK. Not to mention a serious
consideration of potential consequences of the whole MM.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Christopher Lameter
On Mon, 16 Oct 2017, Michal Hocko wrote:

> > We already have that issue and have ways to control that by tracking
> > pinned and mlocked pages as well as limits on their allocations.
>
> Ohh, it is very different because mlock limit is really small (64kB)
> which is not even close to what this is supposed to be about. Moreover
> mlock doesn't prevent from migration and so it doesn't prevent
> compaction to form higher order allocations.

The mlock limit is configurable. There is a tracking of pinned pages as
well.

> Really, this is just too dangerous without a deep consideration of all
> the potential consequences. The more I am thinking about this the more I
> am convinced that this all should be driver specific mmap based thing.
> If it turns out to be too restrictive over time and there are more
> experiences about the usage we can consider thinking about a more
> generic API. But starting from the generic MAP_ flag is just asking for
> problems.

This issue is already present with the pinning of lots of memory via the
RDMA API when in use for large gigabyte ranges. There is nothing new aside
from memory being contiguous with this approach.

> > There is not much new here in terms of problems. The hardware that
> > needs this seems to become more and more plentiful. That is why we need a
> > generic implementation.
>
> It would really help to name that HW and other potential usecases
> independent on the HW because I am rather skeptical about the
> _plentiful_ part. And so I really do not see any foundation to claim
> the generic part. Because, fundamentally, it is the HW which requires
> the specific memory placement/physically contiguous range etc. So the
> generic implementation doesn't really make sense in such a context.

RDMA hardware? Storage interfaces? Look at what the RDMA subsystem
and storage (NVME?) support.

This is not a hardware specific thing but a reflection of the general
limitations of the exiting 4k page struct scheme that limits performance
and causes severe pressure on I/O devices.


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Christopher Lameter
On Mon, 16 Oct 2017, Michal Hocko wrote:

> > We already have that issue and have ways to control that by tracking
> > pinned and mlocked pages as well as limits on their allocations.
>
> Ohh, it is very different because mlock limit is really small (64kB)
> which is not even close to what this is supposed to be about. Moreover
> mlock doesn't prevent from migration and so it doesn't prevent
> compaction to form higher order allocations.

The mlock limit is configurable. There is a tracking of pinned pages as
well.

> Really, this is just too dangerous without a deep consideration of all
> the potential consequences. The more I am thinking about this the more I
> am convinced that this all should be driver specific mmap based thing.
> If it turns out to be too restrictive over time and there are more
> experiences about the usage we can consider thinking about a more
> generic API. But starting from the generic MAP_ flag is just asking for
> problems.

This issue is already present with the pinning of lots of memory via the
RDMA API when in use for large gigabyte ranges. There is nothing new aside
from memory being contiguous with this approach.

> > There is not much new here in terms of problems. The hardware that
> > needs this seems to become more and more plentiful. That is why we need a
> > generic implementation.
>
> It would really help to name that HW and other potential usecases
> independent on the HW because I am rather skeptical about the
> _plentiful_ part. And so I really do not see any foundation to claim
> the generic part. Because, fundamentally, it is the HW which requires
> the specific memory placement/physically contiguous range etc. So the
> generic implementation doesn't really make sense in such a context.

RDMA hardware? Storage interfaces? Look at what the RDMA subsystem
and storage (NVME?) support.

This is not a hardware specific thing but a reflection of the general
limitations of the exiting 4k page struct scheme that limits performance
and causes severe pressure on I/O devices.


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Christopher Lameter
On Mon, 16 Oct 2017, Michal Hocko wrote:

> On Mon 16-10-17 11:02:24, Cristopher Lameter wrote:
> > On Mon, 16 Oct 2017, Michal Hocko wrote:
> >
> > > > So I mmap(MAP_CONTIG) 1GB working of working memory, prefer some data
> > > > structures there, maybe recieve from network, then decide to write
> > > > some and not write some other.
> > >
> > > Why would you want this?
> >
> > Because we are receiving a 1GB block of data and then wan to write it to
> > disk. Maybe we want to modify things a bit and may not write all that we
> > received.
>
> And why do you need that in a single contiguous numbers? If performance,
> do you have any numbers that would clearly tell the difference?

Again we have that in the presentation. Why keep asking the same question
if you already have the answer multiple times?

1G of data requires 25 page structs to handle if the memory is not
contiguous. This is more than most controllers can support and thus the
overhead will dominate I/O. Also the scatter gather lists will cover lots
of linked 4k pages even to manage.

And in practice we already have multiple gigabytes per requests which
makes it even more severe. You cannot do a "cp" operation anymore. Instead
you need to have special code that allocates huge pages, does direct I/O
etc etc,



Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Christopher Lameter
On Mon, 16 Oct 2017, Michal Hocko wrote:

> On Mon 16-10-17 11:02:24, Cristopher Lameter wrote:
> > On Mon, 16 Oct 2017, Michal Hocko wrote:
> >
> > > > So I mmap(MAP_CONTIG) 1GB working of working memory, prefer some data
> > > > structures there, maybe recieve from network, then decide to write
> > > > some and not write some other.
> > >
> > > Why would you want this?
> >
> > Because we are receiving a 1GB block of data and then wan to write it to
> > disk. Maybe we want to modify things a bit and may not write all that we
> > received.
>
> And why do you need that in a single contiguous numbers? If performance,
> do you have any numbers that would clearly tell the difference?

Again we have that in the presentation. Why keep asking the same question
if you already have the answer multiple times?

1G of data requires 25 page structs to handle if the memory is not
contiguous. This is more than most controllers can support and thus the
overhead will dominate I/O. Also the scatter gather lists will cover lots
of linked 4k pages even to manage.

And in practice we already have multiple gigabytes per requests which
makes it even more severe. You cannot do a "cp" operation anymore. Instead
you need to have special code that allocates huge pages, does direct I/O
etc etc,



Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Mike Kravetz
On 10/15/2017 12:50 AM, Guy Shattah wrote:
> On 13/10/2017 19:17, Michal Hocko wrote:
>> On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
>>> On Fri, 13 Oct 2017, Michal Hocko wrote:
>>>
> There is a generic posix interface that could we used for a variety of
> specific hardware dependent use cases.
 Yes you wrote that already and my counter argument was that this generic
 posix interface shouldn't bypass virtual memory abstraction.
>>> It does do that? In what way?
>> availability of the virtual address space depends on the availability of
>> the same sized contiguous physical memory range. That sounds like the
>> abstraction is gone to large part to me.
> In what way? userspace users will still be working with virtual memory.
> 
>>
> There are numerous RDMA devices that would all need the mmap
> implementation. And this covers only the needs of one subsystem. There are
> other use cases.
 That doesn't prevent providing a library function which could be reused
 by all those drivers. Nothing really too much different from
 remap_pfn_range.
>>> And then in all the other use cases as well. It would be much easier if
>>> mmap could give you the memory you need instead of havig numerous drivers
>>> improvise on their own. This is in particular also useful
>>> for numerous embedded use cases where you need contiguous memory.
>> But a generic implementation would have to deal with many issues as
>> already mentioned. If you make this driver specific you can have access
>> control based on fd etc... I really fail to see how this is any
>> different from remap_pfn_range.
> Why have several driver specific implementation if you can generalize the 
> idea and implement
> an already existing POSIX standard?

Just to be clear, the posix standard talks about a typed memory object.
The suggested implementation has one create a connection to the memory
object to receive a fd, then use mmap as usual to get a mapping backed
by contiguous pages/memory.  Of course, this type of implementation is
not a requirement.  However, this type of implementation looks quite a
bit like hugetlbfs today.
- Both require opening a special file/device, and then calling mmap on
  the returned fd.  You can technically use mmap(MAP_HUGETLB), but that
  still ends up using hugetbfs.  BTW, there was resistance to adding the
  MAP_HUGETLB flag to mmap.
- Allocation of contiguous memory is much like 'on demand' allocation of
  huge pages.  There are some (not many) users that use this model.  They
  attempt to allocate huge pages on demand, and if not available fall back
  to base pages.  This is how contiguous allocations would need to work.
  Of course, most hugetlbfs users pre-allocate pages for their use, and
  this 'might' be something useful for contiguous allocations as well.

I wonder if going down the path of a separate devide/filesystem/etc for
contiguous allocations might be a better option.  It would keep the
implementation somewhat separate.  However, I would then be afraid that
we end up with another 'separate/special vm' as in the case of hugetlbfs
today.

-- 
Mike Kravetz


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Mike Kravetz
On 10/15/2017 12:50 AM, Guy Shattah wrote:
> On 13/10/2017 19:17, Michal Hocko wrote:
>> On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
>>> On Fri, 13 Oct 2017, Michal Hocko wrote:
>>>
> There is a generic posix interface that could we used for a variety of
> specific hardware dependent use cases.
 Yes you wrote that already and my counter argument was that this generic
 posix interface shouldn't bypass virtual memory abstraction.
>>> It does do that? In what way?
>> availability of the virtual address space depends on the availability of
>> the same sized contiguous physical memory range. That sounds like the
>> abstraction is gone to large part to me.
> In what way? userspace users will still be working with virtual memory.
> 
>>
> There are numerous RDMA devices that would all need the mmap
> implementation. And this covers only the needs of one subsystem. There are
> other use cases.
 That doesn't prevent providing a library function which could be reused
 by all those drivers. Nothing really too much different from
 remap_pfn_range.
>>> And then in all the other use cases as well. It would be much easier if
>>> mmap could give you the memory you need instead of havig numerous drivers
>>> improvise on their own. This is in particular also useful
>>> for numerous embedded use cases where you need contiguous memory.
>> But a generic implementation would have to deal with many issues as
>> already mentioned. If you make this driver specific you can have access
>> control based on fd etc... I really fail to see how this is any
>> different from remap_pfn_range.
> Why have several driver specific implementation if you can generalize the 
> idea and implement
> an already existing POSIX standard?

Just to be clear, the posix standard talks about a typed memory object.
The suggested implementation has one create a connection to the memory
object to receive a fd, then use mmap as usual to get a mapping backed
by contiguous pages/memory.  Of course, this type of implementation is
not a requirement.  However, this type of implementation looks quite a
bit like hugetlbfs today.
- Both require opening a special file/device, and then calling mmap on
  the returned fd.  You can technically use mmap(MAP_HUGETLB), but that
  still ends up using hugetbfs.  BTW, there was resistance to adding the
  MAP_HUGETLB flag to mmap.
- Allocation of contiguous memory is much like 'on demand' allocation of
  huge pages.  There are some (not many) users that use this model.  They
  attempt to allocate huge pages on demand, and if not available fall back
  to base pages.  This is how contiguous allocations would need to work.
  Of course, most hugetlbfs users pre-allocate pages for their use, and
  this 'might' be something useful for contiguous allocations as well.

I wonder if going down the path of a separate devide/filesystem/etc for
contiguous allocations might be a better option.  It would keep the
implementation somewhat separate.  However, I would then be afraid that
we end up with another 'separate/special vm' as in the case of hugetlbfs
today.

-- 
Mike Kravetz


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Hocko
On Mon 16-10-17 11:00:19, Cristopher Lameter wrote:
> On Mon, 16 Oct 2017, Michal Hocko wrote:
> 
> > But putting that aside. Pinning a lot of memory might cause many
> > performance issues and misbehavior. There are still kernel users
> > who need high order memory to work properly. On top of that you are
> > basically allowing an untrusted user to deplete higher order pages very
> > easily unless there is a clever way to enforce per user limit on this.
> 
> We already have that issue and have ways to control that by tracking
> pinned and mlocked pages as well as limits on their allocations.

Ohh, it is very different because mlock limit is really small (64kB)
which is not even close to what this is supposed to be about. Moreover
mlock doesn't prevent from migration and so it doesn't prevent
compaction to form higher order allocations.

Really, this is just too dangerous without a deep consideration of all
the potential consequences. The more I am thinking about this the more I
am convinced that this all should be driver specific mmap based thing.
If it turns out to be too restrictive over time and there are more
experiences about the usage we can consider thinking about a more
generic API. But starting from the generic MAP_ flag is just asking for
problems.

> > That being said, the list is far from being complete, I am pretty sure
> > more would pop out if I thought more thoroughly. The bottom line is that
> > while I see many problems to actually implement this feature and
> > maintain it longterm I simply do not see a large benefit outside of a
> > very specific HW.
> 
> There is not much new here in terms of problems. The hardware that
> needs this seems to become more and more plentiful. That is why we need a
> generic implementation.

It would really help to name that HW and other potential usecases
independent on the HW because I am rather skeptical about the
_plentiful_ part. And so I really do not see any foundation to claim
the generic part. Because, fundamentally, it is the HW which requires
the specific memory placement/physically contiguous range etc. So the
generic implementation doesn't really make sense in such a context.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Hocko
On Mon 16-10-17 11:00:19, Cristopher Lameter wrote:
> On Mon, 16 Oct 2017, Michal Hocko wrote:
> 
> > But putting that aside. Pinning a lot of memory might cause many
> > performance issues and misbehavior. There are still kernel users
> > who need high order memory to work properly. On top of that you are
> > basically allowing an untrusted user to deplete higher order pages very
> > easily unless there is a clever way to enforce per user limit on this.
> 
> We already have that issue and have ways to control that by tracking
> pinned and mlocked pages as well as limits on their allocations.

Ohh, it is very different because mlock limit is really small (64kB)
which is not even close to what this is supposed to be about. Moreover
mlock doesn't prevent from migration and so it doesn't prevent
compaction to form higher order allocations.

Really, this is just too dangerous without a deep consideration of all
the potential consequences. The more I am thinking about this the more I
am convinced that this all should be driver specific mmap based thing.
If it turns out to be too restrictive over time and there are more
experiences about the usage we can consider thinking about a more
generic API. But starting from the generic MAP_ flag is just asking for
problems.

> > That being said, the list is far from being complete, I am pretty sure
> > more would pop out if I thought more thoroughly. The bottom line is that
> > while I see many problems to actually implement this feature and
> > maintain it longterm I simply do not see a large benefit outside of a
> > very specific HW.
> 
> There is not much new here in terms of problems. The hardware that
> needs this seems to become more and more plentiful. That is why we need a
> generic implementation.

It would really help to name that HW and other potential usecases
independent on the HW because I am rather skeptical about the
_plentiful_ part. And so I really do not see any foundation to claim
the generic part. Because, fundamentally, it is the HW which requires
the specific memory placement/physically contiguous range etc. So the
generic implementation doesn't really make sense in such a context.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Hocko
On Mon 16-10-17 11:02:24, Cristopher Lameter wrote:
> On Mon, 16 Oct 2017, Michal Hocko wrote:
> 
> > > So I mmap(MAP_CONTIG) 1GB working of working memory, prefer some data
> > > structures there, maybe recieve from network, then decide to write
> > > some and not write some other.
> >
> > Why would you want this?
> 
> Because we are receiving a 1GB block of data and then wan to write it to
> disk. Maybe we want to modify things a bit and may not write all that we
> received.
 
And why do you need that in a single contiguous numbers? If performance,
do you have any numbers that would clearly tell the difference?

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Hocko
On Mon 16-10-17 11:02:24, Cristopher Lameter wrote:
> On Mon, 16 Oct 2017, Michal Hocko wrote:
> 
> > > So I mmap(MAP_CONTIG) 1GB working of working memory, prefer some data
> > > structures there, maybe recieve from network, then decide to write
> > > some and not write some other.
> >
> > Why would you want this?
> 
> Because we are receiving a 1GB block of data and then wan to write it to
> disk. Maybe we want to modify things a bit and may not write all that we
> received.
 
And why do you need that in a single contiguous numbers? If performance,
do you have any numbers that would clearly tell the difference?

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Christopher Lameter
On Mon, 16 Oct 2017, Michal Hocko wrote:

> > So I mmap(MAP_CONTIG) 1GB working of working memory, prefer some data
> > structures there, maybe recieve from network, then decide to write
> > some and not write some other.
>
> Why would you want this?

Because we are receiving a 1GB block of data and then wan to write it to
disk. Maybe we want to modify things a bit and may not write all that we
received.




Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Christopher Lameter
On Mon, 16 Oct 2017, Michal Hocko wrote:

> > So I mmap(MAP_CONTIG) 1GB working of working memory, prefer some data
> > structures there, maybe recieve from network, then decide to write
> > some and not write some other.
>
> Why would you want this?

Because we are receiving a 1GB block of data and then wan to write it to
disk. Maybe we want to modify things a bit and may not write all that we
received.




Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Christopher Lameter
On Mon, 16 Oct 2017, Michal Hocko wrote:

> But putting that aside. Pinning a lot of memory might cause many
> performance issues and misbehavior. There are still kernel users
> who need high order memory to work properly. On top of that you are
> basically allowing an untrusted user to deplete higher order pages very
> easily unless there is a clever way to enforce per user limit on this.

We already have that issue and have ways to control that by tracking
pinned and mlocked pages as well as limits on their allocations.

> That being said, the list is far from being complete, I am pretty sure
> more would pop out if I thought more thoroughly. The bottom line is that
> while I see many problems to actually implement this feature and
> maintain it longterm I simply do not see a large benefit outside of a
> very specific HW.

There is not much new here in terms of problems. The hardware that
needs this seems to become more and more plentiful. That is why we need a
generic implementation.



Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Christopher Lameter
On Mon, 16 Oct 2017, Michal Hocko wrote:

> But putting that aside. Pinning a lot of memory might cause many
> performance issues and misbehavior. There are still kernel users
> who need high order memory to work properly. On top of that you are
> basically allowing an untrusted user to deplete higher order pages very
> easily unless there is a clever way to enforce per user limit on this.

We already have that issue and have ways to control that by tracking
pinned and mlocked pages as well as limits on their allocations.

> That being said, the list is far from being complete, I am pretty sure
> more would pop out if I thought more thoroughly. The bottom line is that
> while I see many problems to actually implement this feature and
> maintain it longterm I simply do not see a large benefit outside of a
> very specific HW.

There is not much new here in terms of problems. The hardware that
needs this seems to become more and more plentiful. That is why we need a
generic implementation.



Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Hocko
On Mon 16-10-17 12:11:04, Guy Shattah wrote:
> 
> 
> On 16/10/2017 11:24, Michal Hocko wrote:
> > On Sun 15-10-17 10:50:29, Guy Shattah wrote:
> > > 
> > > On 13/10/2017 19:17, Michal Hocko wrote:
> > > > On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
> > > > > On Fri, 13 Oct 2017, Michal Hocko wrote:
> > > > > 
> > > > > > > There is a generic posix interface that could we used for a 
> > > > > > > variety of
> > > > > > > specific hardware dependent use cases.
> > > > > > Yes you wrote that already and my counter argument was that this 
> > > > > > generic
> > > > > > posix interface shouldn't bypass virtual memory abstraction.
> > > > > It does do that? In what way?
> > > > availability of the virtual address space depends on the availability of
> > > > the same sized contiguous physical memory range. That sounds like the
> > > > abstraction is gone to large part to me.
> > > In what way? userspace users will still be working with virtual memory.
> > So you are saying that providing an API which fails randomly because of
> > the physically fragmented memory is OK? Users shouldn't really care
> > about the state of the physical memory. That is what we have the virtual
> > memory for.
> 
> Users still see and work with virtual addresses, just as before.
> Users using the suggested API are aware that API might fail since it
> involves current system memory state. This won't be the first system
> call or the last one to fail due to reasons beyond user control. For
> example: any user app might fail due to number of open files, disk
> space, memory availability, network availability. All beyond user
> control.

But the memory fragmentation is not something that directly map to the
memory usage. As such it behaves more or less randomly to the memory
utilization (see the difference to examples mentioned above?). It
depends on many other things basically rendering such an API to be
useless unless you guarantee that the large part of the memory is
movable.

> A smart user always has their ways to handle exceptions.  A typical
> user failing to allocate contiguous memory and May fallback to
> allocating non-contiguous memory. And by the way - even if each vendor
> implements their own methods to allocate contiguous memory then this
> vendor specific API might fail too.  For the same reasons.

yes the kernel side mmap implementation would have to care about this as
well. Nobody is questioning that part. I am just questioning such a
generic purpouse API is reasonable.
 
> > > > > > > There are numerous RDMA devices that would all need the mmap
> > > > > > > implementation. And this covers only the needs of one subsystem. 
> > > > > > > There are
> > > > > > > other use cases.
> > > > > > That doesn't prevent providing a library function which could be 
> > > > > > reused
> > > > > > by all those drivers. Nothing really too much different from
> > > > > > remap_pfn_range.
> > > > > And then in all the other use cases as well. It would be much easier 
> > > > > if
> > > > > mmap could give you the memory you need instead of havig numerous 
> > > > > drivers
> > > > > improvise on their own. This is in particular also useful
> > > > > for numerous embedded use cases where you need contiguous memory.
> > > > But a generic implementation would have to deal with many issues as
> > > > already mentioned. If you make this driver specific you can have access
> > > > control based on fd etc... I really fail to see how this is any
> > > > different from remap_pfn_range.
> > > Why have several driver specific implementation if you can generalize the
> > > idea and implement
> > > an already existing POSIX standard?
> > Because users shouldn't really care, really. We do have means to get
> > large memory and having a guaranteed large memory is a PITA. Just look
> > at hugetlb and all the issues it exposes. And that one is preallocated
> > and it requires admin to do a conscious decision about the amount of the
> > memory. You would like to establish something similar except without
> > bounds to the size and no pre-allowed amount by an admin. This sounds
> > just crazy to me.
> 
> Users do care about the performance they get using devices which
> benefit from contiguous memory allocation.  Assuming that user
> requires 700Mb of contiguous memory. Then why allocate giant (1GB)
> page when you can allocate 700Mb out of the 1GB and put the rest of
> the 300Mb back in the huge-pages/small-pages pool?

I believe I have explained that part. Large pages are under admin
control and responsibility. If you get a free ticket to large memory to
any user who can pin that memory then you are in serious troubles.
 
> > On the other hand if you make this per-device mmap implementation you
> > can have both admin defined policy on who is allowed this memory and
> > moreover drivers can implement their fallback strategies which best suit
> > their needs. I really fail to see how this is any different from using
> > specialized mmap 

Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Hocko
On Mon 16-10-17 12:11:04, Guy Shattah wrote:
> 
> 
> On 16/10/2017 11:24, Michal Hocko wrote:
> > On Sun 15-10-17 10:50:29, Guy Shattah wrote:
> > > 
> > > On 13/10/2017 19:17, Michal Hocko wrote:
> > > > On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
> > > > > On Fri, 13 Oct 2017, Michal Hocko wrote:
> > > > > 
> > > > > > > There is a generic posix interface that could we used for a 
> > > > > > > variety of
> > > > > > > specific hardware dependent use cases.
> > > > > > Yes you wrote that already and my counter argument was that this 
> > > > > > generic
> > > > > > posix interface shouldn't bypass virtual memory abstraction.
> > > > > It does do that? In what way?
> > > > availability of the virtual address space depends on the availability of
> > > > the same sized contiguous physical memory range. That sounds like the
> > > > abstraction is gone to large part to me.
> > > In what way? userspace users will still be working with virtual memory.
> > So you are saying that providing an API which fails randomly because of
> > the physically fragmented memory is OK? Users shouldn't really care
> > about the state of the physical memory. That is what we have the virtual
> > memory for.
> 
> Users still see and work with virtual addresses, just as before.
> Users using the suggested API are aware that API might fail since it
> involves current system memory state. This won't be the first system
> call or the last one to fail due to reasons beyond user control. For
> example: any user app might fail due to number of open files, disk
> space, memory availability, network availability. All beyond user
> control.

But the memory fragmentation is not something that directly map to the
memory usage. As such it behaves more or less randomly to the memory
utilization (see the difference to examples mentioned above?). It
depends on many other things basically rendering such an API to be
useless unless you guarantee that the large part of the memory is
movable.

> A smart user always has their ways to handle exceptions.  A typical
> user failing to allocate contiguous memory and May fallback to
> allocating non-contiguous memory. And by the way - even if each vendor
> implements their own methods to allocate contiguous memory then this
> vendor specific API might fail too.  For the same reasons.

yes the kernel side mmap implementation would have to care about this as
well. Nobody is questioning that part. I am just questioning such a
generic purpouse API is reasonable.
 
> > > > > > > There are numerous RDMA devices that would all need the mmap
> > > > > > > implementation. And this covers only the needs of one subsystem. 
> > > > > > > There are
> > > > > > > other use cases.
> > > > > > That doesn't prevent providing a library function which could be 
> > > > > > reused
> > > > > > by all those drivers. Nothing really too much different from
> > > > > > remap_pfn_range.
> > > > > And then in all the other use cases as well. It would be much easier 
> > > > > if
> > > > > mmap could give you the memory you need instead of havig numerous 
> > > > > drivers
> > > > > improvise on their own. This is in particular also useful
> > > > > for numerous embedded use cases where you need contiguous memory.
> > > > But a generic implementation would have to deal with many issues as
> > > > already mentioned. If you make this driver specific you can have access
> > > > control based on fd etc... I really fail to see how this is any
> > > > different from remap_pfn_range.
> > > Why have several driver specific implementation if you can generalize the
> > > idea and implement
> > > an already existing POSIX standard?
> > Because users shouldn't really care, really. We do have means to get
> > large memory and having a guaranteed large memory is a PITA. Just look
> > at hugetlb and all the issues it exposes. And that one is preallocated
> > and it requires admin to do a conscious decision about the amount of the
> > memory. You would like to establish something similar except without
> > bounds to the size and no pre-allowed amount by an admin. This sounds
> > just crazy to me.
> 
> Users do care about the performance they get using devices which
> benefit from contiguous memory allocation.  Assuming that user
> requires 700Mb of contiguous memory. Then why allocate giant (1GB)
> page when you can allocate 700Mb out of the 1GB and put the rest of
> the 300Mb back in the huge-pages/small-pages pool?

I believe I have explained that part. Large pages are under admin
control and responsibility. If you get a free ticket to large memory to
any user who can pin that memory then you are in serious troubles.
 
> > On the other hand if you make this per-device mmap implementation you
> > can have both admin defined policy on who is allowed this memory and
> > moreover drivers can implement their fallback strategies which best suit
> > their needs. I really fail to see how this is any different from using
> > specialized mmap 

Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Hocko
On Mon 16-10-17 11:54:47, Pavel Machek wrote:
> On Mon 2017-10-16 10:18:04, Michal Hocko wrote:
> > On Sun 15-10-17 08:58:56, Pavel Machek wrote:
[...]
> > > So you'd suggest using ioctl() for allocating memory?
> > 
> > Why not using standard mmap on the device fd?
> 
> No, sorry, that's something very different work, right? Lets say I
> have a disk, and I'd like to write to it, using continguous memory for
> performance.
> 
> So I mmap(MAP_CONTIG) 1GB working of working memory, prefer some data
> structures there, maybe recieve from network, then decide to write
> some and not write some other.

Why would you want this?
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Hocko
On Mon 16-10-17 11:54:47, Pavel Machek wrote:
> On Mon 2017-10-16 10:18:04, Michal Hocko wrote:
> > On Sun 15-10-17 08:58:56, Pavel Machek wrote:
[...]
> > > So you'd suggest using ioctl() for allocating memory?
> > 
> > Why not using standard mmap on the device fd?
> 
> No, sorry, that's something very different work, right? Lets say I
> have a disk, and I'd like to write to it, using continguous memory for
> performance.
> 
> So I mmap(MAP_CONTIG) 1GB working of working memory, prefer some data
> structures there, maybe recieve from network, then decide to write
> some and not write some other.

Why would you want this?
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Guy Shattah



On 16/10/2017 13:33, Michal Nazarewicz wrote:

On Sun, Oct 15 2017, Guy Shattah wrote:

Why have several driver specific implementation if you can generalize
the idea and implement an already existing POSIX standard?

Why is there a need for contiguous allocation?


This was explained in detail during a talk delivered by me and 
Christopher Lameter
during Plumbers conference 2017 @ 
https://linuxplumbersconf.org/2017/ocw/proposals/4669

Please see the slides there.


If generalisation is the issue, then the solution is to define a common
API where user-space can allocate memory *in the context of* a device.
This provides a ‘give me memory I can use for this device’ request which
is what user space really wants.
Do you suggest to add a whole new common API instead of merely adding a 
flag to existing one?




Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Guy Shattah



On 16/10/2017 13:33, Michal Nazarewicz wrote:

On Sun, Oct 15 2017, Guy Shattah wrote:

Why have several driver specific implementation if you can generalize
the idea and implement an already existing POSIX standard?

Why is there a need for contiguous allocation?


This was explained in detail during a talk delivered by me and 
Christopher Lameter
during Plumbers conference 2017 @ 
https://linuxplumbersconf.org/2017/ocw/proposals/4669

Please see the slides there.


If generalisation is the issue, then the solution is to define a common
API where user-space can allocate memory *in the context of* a device.
This provides a ‘give me memory I can use for this device’ request which
is what user space really wants.
Do you suggest to add a whole new common API instead of merely adding a 
flag to existing one?




Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Nazarewicz
On Sun, Oct 15 2017, Guy Shattah wrote:
> Why have several driver specific implementation if you can generalize
> the idea and implement an already existing POSIX standard?

Why is there a need for contiguous allocation?

CPU cares only to the point of huge pages and there’s already an effort
in the kernel to allocate huge pages transparently without user space
being aware of it.

If not CPU than various devices all of which may have very different
needs.  Some may be behind an IO MMU.  Some may support DMA.  Some may
indeed require physically continuous memory.  How is user space to know?

Furthermore, user space does not care whether allocation is physically
contiguous or not.  What it cares about is whether given allocation can
be passed as a buffer to a particular device.

If generalisation is the issue, then the solution is to define a common
API where user-space can allocate memory *in the context of* a device.
This provides a ‘give me memory I can use for this device’ request which
is what user space really wants.

So yeah, like others in this thread, the reason for this change alludes
me.  On the other hand, I don’t care much so I’ll limit myself to this
one message.

-- 
Best regards
ミハウ “퓶퓲퓷퓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Nazarewicz
On Sun, Oct 15 2017, Guy Shattah wrote:
> Why have several driver specific implementation if you can generalize
> the idea and implement an already existing POSIX standard?

Why is there a need for contiguous allocation?

CPU cares only to the point of huge pages and there’s already an effort
in the kernel to allocate huge pages transparently without user space
being aware of it.

If not CPU than various devices all of which may have very different
needs.  Some may be behind an IO MMU.  Some may support DMA.  Some may
indeed require physically continuous memory.  How is user space to know?

Furthermore, user space does not care whether allocation is physically
contiguous or not.  What it cares about is whether given allocation can
be passed as a buffer to a particular device.

If generalisation is the issue, then the solution is to define a common
API where user-space can allocate memory *in the context of* a device.
This provides a ‘give me memory I can use for this device’ request which
is what user space really wants.

So yeah, like others in this thread, the reason for this change alludes
me.  On the other hand, I don’t care much so I’ll limit myself to this
one message.

-- 
Best regards
ミハウ “퓶퓲퓷퓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Pavel Machek
On Mon 2017-10-16 10:18:04, Michal Hocko wrote:
> On Sun 15-10-17 08:58:56, Pavel Machek wrote:
> > Hi!
> > 
> > > Yes you wrote that already and my counter argument was that this generic
> > > posix interface shouldn't bypass virtual memory abstraction.
> > > 
> > > > > > The contiguous allocations are particularly useful for the RDMA API 
> > > > > > which
> > > > > > allows registering user space memory with devices.
> > > > >
> > > > > then make those devices expose an implementation of an mmap which does
> > > > > that. You would get both a proper access control (via fd), accounting
> > > > > and others.
> > > > 
> > > > There are numerous RDMA devices that would all need the mmap
> > > > implementation. And this covers only the needs of one subsystem. There 
> > > > are
> > > > other use cases.
> > > 
> > > That doesn't prevent providing a library function which could be reused
> > > by all those drivers. Nothing really too much different from
> > > remap_pfn_range.
> > 
> > So you'd suggest using ioctl() for allocating memory?
> 
> Why not using standard mmap on the device fd?

No, sorry, that's something very different work, right? Lets say I
have a disk, and I'd like to write to it, using continguous memory for
performance.

So I mmap(MAP_CONTIG) 1GB working of working memory, prefer some data
structures there, maybe recieve from network, then decide to write
some and not write some other.

mmap(sda) does something very different... Everything you write to
that mmap will eventually go to the disk, and you don't have complete
control when.

Also, you can do mmap(MAP_CONTIG) and use that to both disk and
network. That would not work with mmap(sda) and mmap(eth0)...

Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Pavel Machek
On Mon 2017-10-16 10:18:04, Michal Hocko wrote:
> On Sun 15-10-17 08:58:56, Pavel Machek wrote:
> > Hi!
> > 
> > > Yes you wrote that already and my counter argument was that this generic
> > > posix interface shouldn't bypass virtual memory abstraction.
> > > 
> > > > > > The contiguous allocations are particularly useful for the RDMA API 
> > > > > > which
> > > > > > allows registering user space memory with devices.
> > > > >
> > > > > then make those devices expose an implementation of an mmap which does
> > > > > that. You would get both a proper access control (via fd), accounting
> > > > > and others.
> > > > 
> > > > There are numerous RDMA devices that would all need the mmap
> > > > implementation. And this covers only the needs of one subsystem. There 
> > > > are
> > > > other use cases.
> > > 
> > > That doesn't prevent providing a library function which could be reused
> > > by all those drivers. Nothing really too much different from
> > > remap_pfn_range.
> > 
> > So you'd suggest using ioctl() for allocating memory?
> 
> Why not using standard mmap on the device fd?

No, sorry, that's something very different work, right? Lets say I
have a disk, and I'd like to write to it, using continguous memory for
performance.

So I mmap(MAP_CONTIG) 1GB working of working memory, prefer some data
structures there, maybe recieve from network, then decide to write
some and not write some other.

mmap(sda) does something very different... Everything you write to
that mmap will eventually go to the disk, and you don't have complete
control when.

Also, you can do mmap(MAP_CONTIG) and use that to both disk and
network. That would not work with mmap(sda) and mmap(eth0)...

Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Guy Shattah



On 16/10/2017 11:24, Michal Hocko wrote:

On Sun 15-10-17 10:50:29, Guy Shattah wrote:


On 13/10/2017 19:17, Michal Hocko wrote:

On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:

On Fri, 13 Oct 2017, Michal Hocko wrote:


There is a generic posix interface that could we used for a variety of
specific hardware dependent use cases.

Yes you wrote that already and my counter argument was that this generic
posix interface shouldn't bypass virtual memory abstraction.

It does do that? In what way?

availability of the virtual address space depends on the availability of
the same sized contiguous physical memory range. That sounds like the
abstraction is gone to large part to me.

In what way? userspace users will still be working with virtual memory.

So you are saying that providing an API which fails randomly because of
the physically fragmented memory is OK? Users shouldn't really care
about the state of the physical memory. That is what we have the virtual
memory for.


Users still see and work with virtual addresses, just as before.
Users using the suggested API are aware that API might fail since it 
involves current
system memory state. This won't be the first system call or the last one 
to fail due to
reasons beyond user control. For example: any user app might fail due to 
number of
open files, disk space, memory availability, network availability. All 
beyond user control.

A smart user always has their ways to handle exceptions.
A typical user failing to allocate contiguous memory and May fallback to 
allocating
non-contiguous memory. And by the way - even if each vendor implements 
their own
methods to allocate contiguous memory then this vendor specific API 
might fail too.

For the same reasons.




  

There are numerous RDMA devices that would all need the mmap
implementation. And this covers only the needs of one subsystem. There are
other use cases.

That doesn't prevent providing a library function which could be reused
by all those drivers. Nothing really too much different from
remap_pfn_range.

And then in all the other use cases as well. It would be much easier if
mmap could give you the memory you need instead of havig numerous drivers
improvise on their own. This is in particular also useful
for numerous embedded use cases where you need contiguous memory.

But a generic implementation would have to deal with many issues as
already mentioned. If you make this driver specific you can have access
control based on fd etc... I really fail to see how this is any
different from remap_pfn_range.

Why have several driver specific implementation if you can generalize the
idea and implement
an already existing POSIX standard?

Because users shouldn't really care, really. We do have means to get
large memory and having a guaranteed large memory is a PITA. Just look
at hugetlb and all the issues it exposes. And that one is preallocated
and it requires admin to do a conscious decision about the amount of the
memory. You would like to establish something similar except without
bounds to the size and no pre-allowed amount by an admin. This sounds
just crazy to me.


Users do care about the performance they get using devices which benefit
from contiguous memory allocation.
Assuming that user requires 700Mb of contiguous memory. Then why allocate
giant (1GB) page when you can allocate 700Mb out of the 1GB and put the 
rest of the

300Mb back in the huge-pages/small-pages pool?




On the other hand if you make this per-device mmap implementation you
can have both admin defined policy on who is allowed this memory and
moreover drivers can implement their fallback strategies which best suit
their needs. I really fail to see how this is any different from using
specialized mmap implementations.
We tried doing it in the past. but the maintainer gave us a very good 
argument:

" If you want to support anonymous mmaps to allocate large contiguous
pages work with the MM folks on providing that in a generic fashion."

After discussing it with people who have the same requirements as we do -
I totally agree with him

http://comments.gmane.org/gmane.linux.drivers.rdma/31467


I might be really wrong but I consider such a general purpose flag quite
dangerous and future maintenance burden. At least from the hugetlb/THP
history I do not see why this should be any different.
Could you please elaborate why is it dangerous and future maintenance 
burden?


Thanks.




Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Guy Shattah



On 16/10/2017 11:24, Michal Hocko wrote:

On Sun 15-10-17 10:50:29, Guy Shattah wrote:


On 13/10/2017 19:17, Michal Hocko wrote:

On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:

On Fri, 13 Oct 2017, Michal Hocko wrote:


There is a generic posix interface that could we used for a variety of
specific hardware dependent use cases.

Yes you wrote that already and my counter argument was that this generic
posix interface shouldn't bypass virtual memory abstraction.

It does do that? In what way?

availability of the virtual address space depends on the availability of
the same sized contiguous physical memory range. That sounds like the
abstraction is gone to large part to me.

In what way? userspace users will still be working with virtual memory.

So you are saying that providing an API which fails randomly because of
the physically fragmented memory is OK? Users shouldn't really care
about the state of the physical memory. That is what we have the virtual
memory for.


Users still see and work with virtual addresses, just as before.
Users using the suggested API are aware that API might fail since it 
involves current
system memory state. This won't be the first system call or the last one 
to fail due to
reasons beyond user control. For example: any user app might fail due to 
number of
open files, disk space, memory availability, network availability. All 
beyond user control.

A smart user always has their ways to handle exceptions.
A typical user failing to allocate contiguous memory and May fallback to 
allocating
non-contiguous memory. And by the way - even if each vendor implements 
their own
methods to allocate contiguous memory then this vendor specific API 
might fail too.

For the same reasons.




  

There are numerous RDMA devices that would all need the mmap
implementation. And this covers only the needs of one subsystem. There are
other use cases.

That doesn't prevent providing a library function which could be reused
by all those drivers. Nothing really too much different from
remap_pfn_range.

And then in all the other use cases as well. It would be much easier if
mmap could give you the memory you need instead of havig numerous drivers
improvise on their own. This is in particular also useful
for numerous embedded use cases where you need contiguous memory.

But a generic implementation would have to deal with many issues as
already mentioned. If you make this driver specific you can have access
control based on fd etc... I really fail to see how this is any
different from remap_pfn_range.

Why have several driver specific implementation if you can generalize the
idea and implement
an already existing POSIX standard?

Because users shouldn't really care, really. We do have means to get
large memory and having a guaranteed large memory is a PITA. Just look
at hugetlb and all the issues it exposes. And that one is preallocated
and it requires admin to do a conscious decision about the amount of the
memory. You would like to establish something similar except without
bounds to the size and no pre-allowed amount by an admin. This sounds
just crazy to me.


Users do care about the performance they get using devices which benefit
from contiguous memory allocation.
Assuming that user requires 700Mb of contiguous memory. Then why allocate
giant (1GB) page when you can allocate 700Mb out of the 1GB and put the 
rest of the

300Mb back in the huge-pages/small-pages pool?




On the other hand if you make this per-device mmap implementation you
can have both admin defined policy on who is allowed this memory and
moreover drivers can implement their fallback strategies which best suit
their needs. I really fail to see how this is any different from using
specialized mmap implementations.
We tried doing it in the past. but the maintainer gave us a very good 
argument:

" If you want to support anonymous mmaps to allocate large contiguous
pages work with the MM folks on providing that in a generic fashion."

After discussing it with people who have the same requirements as we do -
I totally agree with him

http://comments.gmane.org/gmane.linux.drivers.rdma/31467


I might be really wrong but I consider such a general purpose flag quite
dangerous and future maintenance burden. At least from the hugetlb/THP
history I do not see why this should be any different.
Could you please elaborate why is it dangerous and future maintenance 
burden?


Thanks.




Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Hocko
On Sun 15-10-17 10:50:29, Guy Shattah wrote:
> 
> 
> On 13/10/2017 19:17, Michal Hocko wrote:
> > On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
> > > On Fri, 13 Oct 2017, Michal Hocko wrote:
> > > 
> > > > > There is a generic posix interface that could we used for a variety of
> > > > > specific hardware dependent use cases.
> > > > Yes you wrote that already and my counter argument was that this generic
> > > > posix interface shouldn't bypass virtual memory abstraction.
> > > It does do that? In what way?
> > availability of the virtual address space depends on the availability of
> > the same sized contiguous physical memory range. That sounds like the
> > abstraction is gone to large part to me.
>
> In what way? userspace users will still be working with virtual memory.

So you are saying that providing an API which fails randomly because of
the physically fragmented memory is OK? Users shouldn't really care
about the state of the physical memory. That is what we have the virtual
memory for.
 
> > > > > There are numerous RDMA devices that would all need the mmap
> > > > > implementation. And this covers only the needs of one subsystem. 
> > > > > There are
> > > > > other use cases.
> > > > That doesn't prevent providing a library function which could be reused
> > > > by all those drivers. Nothing really too much different from
> > > > remap_pfn_range.
> > > And then in all the other use cases as well. It would be much easier if
> > > mmap could give you the memory you need instead of havig numerous drivers
> > > improvise on their own. This is in particular also useful
> > > for numerous embedded use cases where you need contiguous memory.
> > But a generic implementation would have to deal with many issues as
> > already mentioned. If you make this driver specific you can have access
> > control based on fd etc... I really fail to see how this is any
> > different from remap_pfn_range.
> Why have several driver specific implementation if you can generalize the
> idea and implement
> an already existing POSIX standard?

Because users shouldn't really care, really. We do have means to get
large memory and having a guaranteed large memory is a PITA. Just look
at hugetlb and all the issues it exposes. And that one is preallocated
and it requires admin to do a conscious decision about the amount of the
memory. You would like to establish something similar except without
bounds to the size and no pre-allowed amount by an admin. This sounds
just crazy to me.

On the other hand if you make this per-device mmap implementation you
can have both admin defined policy on who is allowed this memory and
moreover drivers can implement their fallback strategies which best suit
their needs. I really fail to see how this is any different from using
specialized mmap implementations.

I might be really wrong but I consider such a general purpose flag quite
dangerous and future maintenance burden. At least from the hugetlb/THP
history I do not see why this should be any different.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Hocko
On Sun 15-10-17 10:50:29, Guy Shattah wrote:
> 
> 
> On 13/10/2017 19:17, Michal Hocko wrote:
> > On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
> > > On Fri, 13 Oct 2017, Michal Hocko wrote:
> > > 
> > > > > There is a generic posix interface that could we used for a variety of
> > > > > specific hardware dependent use cases.
> > > > Yes you wrote that already and my counter argument was that this generic
> > > > posix interface shouldn't bypass virtual memory abstraction.
> > > It does do that? In what way?
> > availability of the virtual address space depends on the availability of
> > the same sized contiguous physical memory range. That sounds like the
> > abstraction is gone to large part to me.
>
> In what way? userspace users will still be working with virtual memory.

So you are saying that providing an API which fails randomly because of
the physically fragmented memory is OK? Users shouldn't really care
about the state of the physical memory. That is what we have the virtual
memory for.
 
> > > > > There are numerous RDMA devices that would all need the mmap
> > > > > implementation. And this covers only the needs of one subsystem. 
> > > > > There are
> > > > > other use cases.
> > > > That doesn't prevent providing a library function which could be reused
> > > > by all those drivers. Nothing really too much different from
> > > > remap_pfn_range.
> > > And then in all the other use cases as well. It would be much easier if
> > > mmap could give you the memory you need instead of havig numerous drivers
> > > improvise on their own. This is in particular also useful
> > > for numerous embedded use cases where you need contiguous memory.
> > But a generic implementation would have to deal with many issues as
> > already mentioned. If you make this driver specific you can have access
> > control based on fd etc... I really fail to see how this is any
> > different from remap_pfn_range.
> Why have several driver specific implementation if you can generalize the
> idea and implement
> an already existing POSIX standard?

Because users shouldn't really care, really. We do have means to get
large memory and having a guaranteed large memory is a PITA. Just look
at hugetlb and all the issues it exposes. And that one is preallocated
and it requires admin to do a conscious decision about the amount of the
memory. You would like to establish something similar except without
bounds to the size and no pre-allowed amount by an admin. This sounds
just crazy to me.

On the other hand if you make this per-device mmap implementation you
can have both admin defined policy on who is allowed this memory and
moreover drivers can implement their fallback strategies which best suit
their needs. I really fail to see how this is any different from using
specialized mmap implementations.

I might be really wrong but I consider such a general purpose flag quite
dangerous and future maintenance burden. At least from the hugetlb/THP
history I do not see why this should be any different.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Hocko
On Sun 15-10-17 08:58:56, Pavel Machek wrote:
> Hi!
> 
> > Yes you wrote that already and my counter argument was that this generic
> > posix interface shouldn't bypass virtual memory abstraction.
> > 
> > > > > The contiguous allocations are particularly useful for the RDMA API 
> > > > > which
> > > > > allows registering user space memory with devices.
> > > >
> > > > then make those devices expose an implementation of an mmap which does
> > > > that. You would get both a proper access control (via fd), accounting
> > > > and others.
> > > 
> > > There are numerous RDMA devices that would all need the mmap
> > > implementation. And this covers only the needs of one subsystem. There are
> > > other use cases.
> > 
> > That doesn't prevent providing a library function which could be reused
> > by all those drivers. Nothing really too much different from
> > remap_pfn_range.
> 
> So you'd suggest using ioctl() for allocating memory?

Why not using standard mmap on the device fd?
 
> That sounds quite ugly to me... mmap(MAP_CONTIG) is not nice, either, but 
> better than
> each driver inventing custom interface...

As already pointed out elsewhere, I do not really see a different to
remap_pfn_range from the API point of view. A driver has some
requirements to the memory so those can be reflected in the mmap
implementation for the driver. I really do not see how that would be a
general interface without a lot of headache in future. Contiguous memory
is a hard requirement to guarantee or give out without risks.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-16 Thread Michal Hocko
On Sun 15-10-17 08:58:56, Pavel Machek wrote:
> Hi!
> 
> > Yes you wrote that already and my counter argument was that this generic
> > posix interface shouldn't bypass virtual memory abstraction.
> > 
> > > > > The contiguous allocations are particularly useful for the RDMA API 
> > > > > which
> > > > > allows registering user space memory with devices.
> > > >
> > > > then make those devices expose an implementation of an mmap which does
> > > > that. You would get both a proper access control (via fd), accounting
> > > > and others.
> > > 
> > > There are numerous RDMA devices that would all need the mmap
> > > implementation. And this covers only the needs of one subsystem. There are
> > > other use cases.
> > 
> > That doesn't prevent providing a library function which could be reused
> > by all those drivers. Nothing really too much different from
> > remap_pfn_range.
> 
> So you'd suggest using ioctl() for allocating memory?

Why not using standard mmap on the device fd?
 
> That sounds quite ugly to me... mmap(MAP_CONTIG) is not nice, either, but 
> better than
> each driver inventing custom interface...

As already pointed out elsewhere, I do not really see a different to
remap_pfn_range from the API point of view. A driver has some
requirements to the memory so those can be reflected in the mmap
implementation for the driver. I really do not see how that would be a
general interface without a lot of headache in future. Contiguous memory
is a hard requirement to guarantee or give out without risks.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-15 Thread Pavel Machek
Hi!

> Yes you wrote that already and my counter argument was that this generic
> posix interface shouldn't bypass virtual memory abstraction.
> 
> > > > The contiguous allocations are particularly useful for the RDMA API 
> > > > which
> > > > allows registering user space memory with devices.
> > >
> > > then make those devices expose an implementation of an mmap which does
> > > that. You would get both a proper access control (via fd), accounting
> > > and others.
> > 
> > There are numerous RDMA devices that would all need the mmap
> > implementation. And this covers only the needs of one subsystem. There are
> > other use cases.
> 
> That doesn't prevent providing a library function which could be reused
> by all those drivers. Nothing really too much different from
> remap_pfn_range.

So you'd suggest using ioctl() for allocating memory?

That sounds quite ugly to me... mmap(MAP_CONTIG) is not nice, either, but 
better than
each driver inventing custom interface...
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-15 Thread Pavel Machek
Hi!

> Yes you wrote that already and my counter argument was that this generic
> posix interface shouldn't bypass virtual memory abstraction.
> 
> > > > The contiguous allocations are particularly useful for the RDMA API 
> > > > which
> > > > allows registering user space memory with devices.
> > >
> > > then make those devices expose an implementation of an mmap which does
> > > that. You would get both a proper access control (via fd), accounting
> > > and others.
> > 
> > There are numerous RDMA devices that would all need the mmap
> > implementation. And this covers only the needs of one subsystem. There are
> > other use cases.
> 
> That doesn't prevent providing a library function which could be reused
> by all those drivers. Nothing really too much different from
> remap_pfn_range.

So you'd suggest using ioctl() for allocating memory?

That sounds quite ugly to me... mmap(MAP_CONTIG) is not nice, either, but 
better than
each driver inventing custom interface...
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


RE: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-15 Thread Guy Shattah


On 13/10/2017 19:17, Michal Hocko wrote:
> On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
>> On Fri, 13 Oct 2017, Michal Hocko wrote:
>>
 There is a generic posix interface that could we used for a variety 
 of specific hardware dependent use cases.
>>> Yes you wrote that already and my counter argument was that this 
>>> generic posix interface shouldn't bypass virtual memory abstraction.
>> It does do that? In what way?
> availability of the virtual address space depends on the availability 
> of the same sized contiguous physical memory range. That sounds like 
> the abstraction is gone to large part to me.

In what way? userspace users will still be working with virtual memory.

>
 There are numerous RDMA devices that would all need the mmap 
 implementation. And this covers only the needs of one subsystem. 
 There are other use cases.
>>> That doesn't prevent providing a library function which could be 
>>> reused by all those drivers. Nothing really too much different from 
>>> remap_pfn_range.
>> And then in all the other use cases as well. It would be much easier 
>> if mmap could give you the memory you need instead of havig numerous 
>> drivers improvise on their own. This is in particular also useful for 
>> numerous embedded use cases where you need contiguous memory.
> But a generic implementation would have to deal with many issues as 
> already mentioned. If you make this driver specific you can have 
> access control based on fd etc... I really fail to see how this is any 
> different from remap_pfn_range.

Why have several driver specific implementation if you can generalize the idea 
and implement an already existing POSIX standard?
--
Guy


RE: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-15 Thread Guy Shattah


On 13/10/2017 19:17, Michal Hocko wrote:
> On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
>> On Fri, 13 Oct 2017, Michal Hocko wrote:
>>
 There is a generic posix interface that could we used for a variety 
 of specific hardware dependent use cases.
>>> Yes you wrote that already and my counter argument was that this 
>>> generic posix interface shouldn't bypass virtual memory abstraction.
>> It does do that? In what way?
> availability of the virtual address space depends on the availability 
> of the same sized contiguous physical memory range. That sounds like 
> the abstraction is gone to large part to me.

In what way? userspace users will still be working with virtual memory.

>
 There are numerous RDMA devices that would all need the mmap 
 implementation. And this covers only the needs of one subsystem. 
 There are other use cases.
>>> That doesn't prevent providing a library function which could be 
>>> reused by all those drivers. Nothing really too much different from 
>>> remap_pfn_range.
>> And then in all the other use cases as well. It would be much easier 
>> if mmap could give you the memory you need instead of havig numerous 
>> drivers improvise on their own. This is in particular also useful for 
>> numerous embedded use cases where you need contiguous memory.
> But a generic implementation would have to deal with many issues as 
> already mentioned. If you make this driver specific you can have 
> access control based on fd etc... I really fail to see how this is any 
> different from remap_pfn_range.

Why have several driver specific implementation if you can generalize the idea 
and implement an already existing POSIX standard?
--
Guy


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-15 Thread Guy Shattah



On 13/10/2017 19:17, Michal Hocko wrote:

On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:

On Fri, 13 Oct 2017, Michal Hocko wrote:


There is a generic posix interface that could we used for a variety of
specific hardware dependent use cases.

Yes you wrote that already and my counter argument was that this generic
posix interface shouldn't bypass virtual memory abstraction.

It does do that? In what way?

availability of the virtual address space depends on the availability of
the same sized contiguous physical memory range. That sounds like the
abstraction is gone to large part to me.

In what way? userspace users will still be working with virtual memory.




There are numerous RDMA devices that would all need the mmap
implementation. And this covers only the needs of one subsystem. There are
other use cases.

That doesn't prevent providing a library function which could be reused
by all those drivers. Nothing really too much different from
remap_pfn_range.

And then in all the other use cases as well. It would be much easier if
mmap could give you the memory you need instead of havig numerous drivers
improvise on their own. This is in particular also useful
for numerous embedded use cases where you need contiguous memory.

But a generic implementation would have to deal with many issues as
already mentioned. If you make this driver specific you can have access
control based on fd etc... I really fail to see how this is any
different from remap_pfn_range.
Why have several driver specific implementation if you can generalize 
the idea and implement

an already existing POSIX standard?
--
Guy


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-15 Thread Guy Shattah



On 13/10/2017 19:17, Michal Hocko wrote:

On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:

On Fri, 13 Oct 2017, Michal Hocko wrote:


There is a generic posix interface that could we used for a variety of
specific hardware dependent use cases.

Yes you wrote that already and my counter argument was that this generic
posix interface shouldn't bypass virtual memory abstraction.

It does do that? In what way?

availability of the virtual address space depends on the availability of
the same sized contiguous physical memory range. That sounds like the
abstraction is gone to large part to me.

In what way? userspace users will still be working with virtual memory.




There are numerous RDMA devices that would all need the mmap
implementation. And this covers only the needs of one subsystem. There are
other use cases.

That doesn't prevent providing a library function which could be reused
by all those drivers. Nothing really too much different from
remap_pfn_range.

And then in all the other use cases as well. It would be much easier if
mmap could give you the memory you need instead of havig numerous drivers
improvise on their own. This is in particular also useful
for numerous embedded use cases where you need contiguous memory.

But a generic implementation would have to deal with many issues as
already mentioned. If you make this driver specific you can have access
control based on fd etc... I really fail to see how this is any
different from remap_pfn_range.
Why have several driver specific implementation if you can generalize 
the idea and implement

an already existing POSIX standard?
--
Guy


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-14 Thread Michal Hocko
On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
> On Fri, 13 Oct 2017, Michal Hocko wrote:
> 
> > > There is a generic posix interface that could we used for a variety of
> > > specific hardware dependent use cases.
> >
> > Yes you wrote that already and my counter argument was that this generic
> > posix interface shouldn't bypass virtual memory abstraction.
> 
> It does do that? In what way?

availability of the virtual address space depends on the availability of
the same sized contiguous physical memory range. That sounds like the
abstraction is gone to large part to me.

> > > There are numerous RDMA devices that would all need the mmap
> > > implementation. And this covers only the needs of one subsystem. There are
> > > other use cases.
> >
> > That doesn't prevent providing a library function which could be reused
> > by all those drivers. Nothing really too much different from
> > remap_pfn_range.
> 
> And then in all the other use cases as well. It would be much easier if
> mmap could give you the memory you need instead of havig numerous drivers
> improvise on their own. This is in particular also useful
> for numerous embedded use cases where you need contiguous memory.

But a generic implementation would have to deal with many issues as
already mentioned. If you make this driver specific you can have access
control based on fd etc... I really fail to see how this is any
different from remap_pfn_range.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-14 Thread Michal Hocko
On Fri 13-10-17 10:56:13, Cristopher Lameter wrote:
> On Fri, 13 Oct 2017, Michal Hocko wrote:
> 
> > > There is a generic posix interface that could we used for a variety of
> > > specific hardware dependent use cases.
> >
> > Yes you wrote that already and my counter argument was that this generic
> > posix interface shouldn't bypass virtual memory abstraction.
> 
> It does do that? In what way?

availability of the virtual address space depends on the availability of
the same sized contiguous physical memory range. That sounds like the
abstraction is gone to large part to me.

> > > There are numerous RDMA devices that would all need the mmap
> > > implementation. And this covers only the needs of one subsystem. There are
> > > other use cases.
> >
> > That doesn't prevent providing a library function which could be reused
> > by all those drivers. Nothing really too much different from
> > remap_pfn_range.
> 
> And then in all the other use cases as well. It would be much easier if
> mmap could give you the memory you need instead of havig numerous drivers
> improvise on their own. This is in particular also useful
> for numerous embedded use cases where you need contiguous memory.

But a generic implementation would have to deal with many issues as
already mentioned. If you make this driver specific you can have access
control based on fd etc... I really fail to see how this is any
different from remap_pfn_range.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-13 Thread Christopher Lameter
On Fri, 13 Oct 2017, Michal Hocko wrote:

> > There is a generic posix interface that could we used for a variety of
> > specific hardware dependent use cases.
>
> Yes you wrote that already and my counter argument was that this generic
> posix interface shouldn't bypass virtual memory abstraction.

It does do that? In what way?

> > There are numerous RDMA devices that would all need the mmap
> > implementation. And this covers only the needs of one subsystem. There are
> > other use cases.
>
> That doesn't prevent providing a library function which could be reused
> by all those drivers. Nothing really too much different from
> remap_pfn_range.

And then in all the other use cases as well. It would be much easier if
mmap could give you the memory you need instead of havig numerous drivers
improvise on their own. This is in particular also useful
for numerous embedded use cases where you need contiguous memory.


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-13 Thread Christopher Lameter
On Fri, 13 Oct 2017, Michal Hocko wrote:

> > There is a generic posix interface that could we used for a variety of
> > specific hardware dependent use cases.
>
> Yes you wrote that already and my counter argument was that this generic
> posix interface shouldn't bypass virtual memory abstraction.

It does do that? In what way?

> > There are numerous RDMA devices that would all need the mmap
> > implementation. And this covers only the needs of one subsystem. There are
> > other use cases.
>
> That doesn't prevent providing a library function which could be reused
> by all those drivers. Nothing really too much different from
> remap_pfn_range.

And then in all the other use cases as well. It would be much easier if
mmap could give you the memory you need instead of havig numerous drivers
improvise on their own. This is in particular also useful
for numerous embedded use cases where you need contiguous memory.


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-13 Thread Michal Hocko
On Fri 13-10-17 10:42:37, Cristopher Lameter wrote:
> On Fri, 13 Oct 2017, Michal Hocko wrote:
> 
> > On Fri 13-10-17 10:20:06, Cristopher Lameter wrote:
> > > On Fri, 13 Oct 2017, Michal Hocko wrote:
> > [...]
> > > > I am not really convinced this is a good interface. You are basically
> > > > trying to bypass virtual memory abstraction and that is quite
> > > > contradicting the mmap API to me.
> > >
> > > This is a standardized posix interface as described in our presentation at
> > > the plumbers conference. See the presentation on contiguous allocations.
> >
> > Are you trying to desing a generic interface with a very specific and HW
> > dependent usecase in mind?
> 
> There is a generic posix interface that could we used for a variety of
> specific hardware dependent use cases.

Yes you wrote that already and my counter argument was that this generic
posix interface shouldn't bypass virtual memory abstraction.

> > > The contiguous allocations are particularly useful for the RDMA API which
> > > allows registering user space memory with devices.
> >
> > then make those devices expose an implementation of an mmap which does
> > that. You would get both a proper access control (via fd), accounting
> > and others.
> 
> There are numerous RDMA devices that would all need the mmap
> implementation. And this covers only the needs of one subsystem. There are
> other use cases.

That doesn't prevent providing a library function which could be reused
by all those drivers. Nothing really too much different from
remap_pfn_range.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-13 Thread Michal Hocko
On Fri 13-10-17 10:42:37, Cristopher Lameter wrote:
> On Fri, 13 Oct 2017, Michal Hocko wrote:
> 
> > On Fri 13-10-17 10:20:06, Cristopher Lameter wrote:
> > > On Fri, 13 Oct 2017, Michal Hocko wrote:
> > [...]
> > > > I am not really convinced this is a good interface. You are basically
> > > > trying to bypass virtual memory abstraction and that is quite
> > > > contradicting the mmap API to me.
> > >
> > > This is a standardized posix interface as described in our presentation at
> > > the plumbers conference. See the presentation on contiguous allocations.
> >
> > Are you trying to desing a generic interface with a very specific and HW
> > dependent usecase in mind?
> 
> There is a generic posix interface that could we used for a variety of
> specific hardware dependent use cases.

Yes you wrote that already and my counter argument was that this generic
posix interface shouldn't bypass virtual memory abstraction.

> > > The contiguous allocations are particularly useful for the RDMA API which
> > > allows registering user space memory with devices.
> >
> > then make those devices expose an implementation of an mmap which does
> > that. You would get both a proper access control (via fd), accounting
> > and others.
> 
> There are numerous RDMA devices that would all need the mmap
> implementation. And this covers only the needs of one subsystem. There are
> other use cases.

That doesn't prevent providing a library function which could be reused
by all those drivers. Nothing really too much different from
remap_pfn_range.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-13 Thread Christopher Lameter
On Fri, 13 Oct 2017, Michal Hocko wrote:

> On Fri 13-10-17 10:20:06, Cristopher Lameter wrote:
> > On Fri, 13 Oct 2017, Michal Hocko wrote:
> [...]
> > > I am not really convinced this is a good interface. You are basically
> > > trying to bypass virtual memory abstraction and that is quite
> > > contradicting the mmap API to me.
> >
> > This is a standardized posix interface as described in our presentation at
> > the plumbers conference. See the presentation on contiguous allocations.
>
> Are you trying to desing a generic interface with a very specific and HW
> dependent usecase in mind?

There is a generic posix interface that could we used for a variety of
specific hardware dependent use cases.

> > The contiguous allocations are particularly useful for the RDMA API which
> > allows registering user space memory with devices.
>
> then make those devices expose an implementation of an mmap which does
> that. You would get both a proper access control (via fd), accounting
> and others.

There are numerous RDMA devices that would all need the mmap
implementation. And this covers only the needs of one subsystem. There are
other use cases.





Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-13 Thread Christopher Lameter
On Fri, 13 Oct 2017, Michal Hocko wrote:

> On Fri 13-10-17 10:20:06, Cristopher Lameter wrote:
> > On Fri, 13 Oct 2017, Michal Hocko wrote:
> [...]
> > > I am not really convinced this is a good interface. You are basically
> > > trying to bypass virtual memory abstraction and that is quite
> > > contradicting the mmap API to me.
> >
> > This is a standardized posix interface as described in our presentation at
> > the plumbers conference. See the presentation on contiguous allocations.
>
> Are you trying to desing a generic interface with a very specific and HW
> dependent usecase in mind?

There is a generic posix interface that could we used for a variety of
specific hardware dependent use cases.

> > The contiguous allocations are particularly useful for the RDMA API which
> > allows registering user space memory with devices.
>
> then make those devices expose an implementation of an mmap which does
> that. You would get both a proper access control (via fd), accounting
> and others.

There are numerous RDMA devices that would all need the mmap
implementation. And this covers only the needs of one subsystem. There are
other use cases.





Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-13 Thread Michal Hocko
On Fri 13-10-17 10:20:06, Cristopher Lameter wrote:
> On Fri, 13 Oct 2017, Michal Hocko wrote:
[...]
> > I am not really convinced this is a good interface. You are basically
> > trying to bypass virtual memory abstraction and that is quite
> > contradicting the mmap API to me.
> 
> This is a standardized posix interface as described in our presentation at
> the plumbers conference. See the presentation on contiguous allocations.

Are you trying to desing a generic interface with a very specific and HW
dependent usecase in mind?
 
> The contiguous allocations are particularly useful for the RDMA API which
> allows registering user space memory with devices.

then make those devices expose an implementation of an mmap which does
that. You would get both a proper access control (via fd), accounting
and others.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-13 Thread Michal Hocko
On Fri 13-10-17 10:20:06, Cristopher Lameter wrote:
> On Fri, 13 Oct 2017, Michal Hocko wrote:
[...]
> > I am not really convinced this is a good interface. You are basically
> > trying to bypass virtual memory abstraction and that is quite
> > contradicting the mmap API to me.
> 
> This is a standardized posix interface as described in our presentation at
> the plumbers conference. See the presentation on contiguous allocations.

Are you trying to desing a generic interface with a very specific and HW
dependent usecase in mind?
 
> The contiguous allocations are particularly useful for the RDMA API which
> allows registering user space memory with devices.

then make those devices expose an implementation of an mmap which does
that. You would get both a proper access control (via fd), accounting
and others.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-13 Thread Christopher Lameter
On Fri, 13 Oct 2017, Michal Hocko wrote:

> I would, quite contrary, suggest a device specific mmap implementation
> which would guarantee both the best memory wrt. physical contiguous
> aspect as well as the placement - what if the device have a restriction
> on that as well?

Contemporary high end devices can handle all of memory. If someone does
not have the requirements to get all that hardware can give you in terms
of speed then they also wont need contiguous memory.

> > Yes, it remains contiguous.  It is locked in memory.
>
> Hmm, so hugetlb on steroids...

Its actually better because there is no requirements of allocation in
exacytly 2M chunks. The remainder can be used for regular 4k page
allocations.

> > > Who is going to use such an interface? And probably many other
> > > questions...
> >
> > Thanks for asking.  I am just throwing out the idea of providing an 
> > interface
> > for doing contiguous memory allocations from user space.  There are at least
> > two (and possibly more) devices that could benefit from such an interface.
>
> I am not really convinced this is a good interface. You are basically
> trying to bypass virtual memory abstraction and that is quite
> contradicting the mmap API to me.

This is a standardized posix interface as described in our presentation at
the plumbers conference. See the presentation on contiguous allocations.

The contiguous allocations are particularly useful for the RDMA API which
allows registering user space memory with devices.



Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-13 Thread Christopher Lameter
On Fri, 13 Oct 2017, Michal Hocko wrote:

> I would, quite contrary, suggest a device specific mmap implementation
> which would guarantee both the best memory wrt. physical contiguous
> aspect as well as the placement - what if the device have a restriction
> on that as well?

Contemporary high end devices can handle all of memory. If someone does
not have the requirements to get all that hardware can give you in terms
of speed then they also wont need contiguous memory.

> > Yes, it remains contiguous.  It is locked in memory.
>
> Hmm, so hugetlb on steroids...

Its actually better because there is no requirements of allocation in
exacytly 2M chunks. The remainder can be used for regular 4k page
allocations.

> > > Who is going to use such an interface? And probably many other
> > > questions...
> >
> > Thanks for asking.  I am just throwing out the idea of providing an 
> > interface
> > for doing contiguous memory allocations from user space.  There are at least
> > two (and possibly more) devices that could benefit from such an interface.
>
> I am not really convinced this is a good interface. You are basically
> trying to bypass virtual memory abstraction and that is quite
> contradicting the mmap API to me.

This is a standardized posix interface as described in our presentation at
the plumbers conference. See the presentation on contiguous allocations.

The contiguous allocations are particularly useful for the RDMA API which
allows registering user space memory with devices.



Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-13 Thread Christopher Lameter
On Thu, 12 Oct 2017, Anshuman Khandual wrote:

> > +static long __alloc_vma_contig_range(struct vm_area_struct *vma)
> > +{
> > +   gfp_t gfp = GFP_HIGHUSER | __GFP_ZERO;
>
> Would it be GFP_HIGHUSER_MOVABLE instead ? Why __GFP_ZERO ? If its
> coming from Buddy, every thing should have already been zeroed out
> in there. Am I missing something ?

Contiguous pages cannot and should not be moved. They will no longer be
contiguous then. Also the page migration code cannot handle this case.



Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-13 Thread Christopher Lameter
On Thu, 12 Oct 2017, Anshuman Khandual wrote:

> > +static long __alloc_vma_contig_range(struct vm_area_struct *vma)
> > +{
> > +   gfp_t gfp = GFP_HIGHUSER | __GFP_ZERO;
>
> Would it be GFP_HIGHUSER_MOVABLE instead ? Why __GFP_ZERO ? If its
> coming from Buddy, every thing should have already been zeroed out
> in there. Am I missing something ?

Contiguous pages cannot and should not be moved. They will no longer be
contiguous then. Also the page migration code cannot handle this case.



Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-13 Thread Michal Hocko
On Thu 12-10-17 10:19:16, Mike Kravetz wrote:
> On 10/12/2017 07:37 AM, Michal Hocko wrote:
> > On Wed 11-10-17 18:46:11, Mike Kravetz wrote:
> >> Add new MAP_CONTIG flag to mmap system call.  Check for flag in normal
> >> mmap flag processing.  If present, pre-allocate a contiguous set of
> >> pages to back the mapping.  These pages will be used a fault time, and
> >> the MAP_CONTIG flag implies populating the mapping at the mmap time.
> > 
> > I have only briefly read through the previous discussion and it is still
> > not clear to me _why_ we want such a interface. I didn't give it much
> > time yet but I do not think this is a good idea at all.
> 
> Thanks for looking Michal.  The primary use case comes from devices that can
> realize performance benefits if operating on physically contiguous memory.
> What sparked this effort was Christoph and Guy's plumbers presentation
> where they showed RDMA performance benefits that could be realized with
> contiguous memory.  I also remember sitting in a presentation about
> Intel's QuackAssist technology at Vault last year.  The presenter mentioned
> that their compression engine needed to be passed a physically contiguous
> buffer.  I asked how a user could obtain such a buffer.  They said they
> had a special driver/ioctl for that.  Yuck!  I'm guessing there are other
> specific use cases.  That is why I wanted to start the discussion as to
> whether there should be an interface to provide this functionality.

I would, quite contrary, suggest a device specific mmap implementation
which would guarantee both the best memory wrt. physical contiguous
aspect as well as the placement - what if the device have a restriction
on that as well?
 
> > any user to simply consume larger order memory blocks? What would
> > prevent from that?
> 
> We certainly would want to put restrictions in place for contiguous
> memory allocations.  Since it makes sense to pre-populate and lock
> contiguous allocations, using the same restrictions as mlock is a start.
> However, I can see the possible need for more restrictions.

Absolutely. mlock limit is per process (resp. mm) so a single user could
simply deplete large blocks. No good...
 
> > Does the memory always stays contiguous? How much contiguous it will be?
> 
> Yes, it remains contiguous.  It is locked in memory.

Hmm, so hugetlb on steroids...

> > Who is going to use such an interface? And probably many other
> > questions...
> 
> Thanks for asking.  I am just throwing out the idea of providing an interface
> for doing contiguous memory allocations from user space.  There are at least
> two (and possibly more) devices that could benefit from such an interface.

I am not really convinced this is a good interface. You are basically
trying to bypass virtual memory abstraction and that is quite
contradicting the mmap API to me.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-13 Thread Michal Hocko
On Thu 12-10-17 10:19:16, Mike Kravetz wrote:
> On 10/12/2017 07:37 AM, Michal Hocko wrote:
> > On Wed 11-10-17 18:46:11, Mike Kravetz wrote:
> >> Add new MAP_CONTIG flag to mmap system call.  Check for flag in normal
> >> mmap flag processing.  If present, pre-allocate a contiguous set of
> >> pages to back the mapping.  These pages will be used a fault time, and
> >> the MAP_CONTIG flag implies populating the mapping at the mmap time.
> > 
> > I have only briefly read through the previous discussion and it is still
> > not clear to me _why_ we want such a interface. I didn't give it much
> > time yet but I do not think this is a good idea at all.
> 
> Thanks for looking Michal.  The primary use case comes from devices that can
> realize performance benefits if operating on physically contiguous memory.
> What sparked this effort was Christoph and Guy's plumbers presentation
> where they showed RDMA performance benefits that could be realized with
> contiguous memory.  I also remember sitting in a presentation about
> Intel's QuackAssist technology at Vault last year.  The presenter mentioned
> that their compression engine needed to be passed a physically contiguous
> buffer.  I asked how a user could obtain such a buffer.  They said they
> had a special driver/ioctl for that.  Yuck!  I'm guessing there are other
> specific use cases.  That is why I wanted to start the discussion as to
> whether there should be an interface to provide this functionality.

I would, quite contrary, suggest a device specific mmap implementation
which would guarantee both the best memory wrt. physical contiguous
aspect as well as the placement - what if the device have a restriction
on that as well?
 
> > any user to simply consume larger order memory blocks? What would
> > prevent from that?
> 
> We certainly would want to put restrictions in place for contiguous
> memory allocations.  Since it makes sense to pre-populate and lock
> contiguous allocations, using the same restrictions as mlock is a start.
> However, I can see the possible need for more restrictions.

Absolutely. mlock limit is per process (resp. mm) so a single user could
simply deplete large blocks. No good...
 
> > Does the memory always stays contiguous? How much contiguous it will be?
> 
> Yes, it remains contiguous.  It is locked in memory.

Hmm, so hugetlb on steroids...

> > Who is going to use such an interface? And probably many other
> > questions...
> 
> Thanks for asking.  I am just throwing out the idea of providing an interface
> for doing contiguous memory allocations from user space.  There are at least
> two (and possibly more) devices that could benefit from such an interface.

I am not really convinced this is a good interface. You are basically
trying to bypass virtual memory abstraction and that is quite
contradicting the mmap API to me.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-12 Thread Mike Kravetz
On 10/12/2017 07:37 AM, Michal Hocko wrote:
> On Wed 11-10-17 18:46:11, Mike Kravetz wrote:
>> Add new MAP_CONTIG flag to mmap system call.  Check for flag in normal
>> mmap flag processing.  If present, pre-allocate a contiguous set of
>> pages to back the mapping.  These pages will be used a fault time, and
>> the MAP_CONTIG flag implies populating the mapping at the mmap time.
> 
> I have only briefly read through the previous discussion and it is still
> not clear to me _why_ we want such a interface. I didn't give it much
> time yet but I do not think this is a good idea at all.

Thanks for looking Michal.  The primary use case comes from devices that can
realize performance benefits if operating on physically contiguous memory.
What sparked this effort was Christoph and Guy's plumbers presentation
where they showed RDMA performance benefits that could be realized with
contiguous memory.  I also remember sitting in a presentation about
Intel's QuackAssist technology at Vault last year.  The presenter mentioned
that their compression engine needed to be passed a physically contiguous
buffer.  I asked how a user could obtain such a buffer.  They said they
had a special driver/ioctl for that.  Yuck!  I'm guessing there are other
specific use cases.  That is why I wanted to start the discussion as to
whether there should be an interface to provide this functionality.

> Why? Do we want
> any user to simply consume larger order memory blocks? What would
> prevent from that?

We certainly would want to put restrictions in place for contiguous
memory allocations.  Since it makes sense to pre-populate and lock
contiguous allocations, using the same restrictions as mlock is a start.
However, I can see the possible need for more restrictions.

>Also why should even userspace care about larger
> memory blocks? We have huge pages (be it preallocated or transparent)
> for that purpose already. Why should we add yet another another type

The 'sweet spot' for the Mellanox RDMA example is 2GB.  We can not
achieve that with huge pages (on x86) today.

>  What is the guaratee of such a mapping.

There is no guarantee.  My suggestion is that mmap(MAP_CONTIG) would fail
with ENOMEM if a sufficiently sized contiguous area could not be found.
The caller would need to deal with failure.

> Does the memory always stays contiguous? How much contiguous it will be?

Yes, it remains contiguous.  It is locked in memory.

> Who is going to use such an interface? And probably many other
> questions...

Thanks for asking.  I am just throwing out the idea of providing an interface
for doing contiguous memory allocations from user space.  There are at least
two (and possibly more) devices that could benefit from such an interface.

-- 
Mike Kravetz


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-12 Thread Mike Kravetz
On 10/12/2017 07:37 AM, Michal Hocko wrote:
> On Wed 11-10-17 18:46:11, Mike Kravetz wrote:
>> Add new MAP_CONTIG flag to mmap system call.  Check for flag in normal
>> mmap flag processing.  If present, pre-allocate a contiguous set of
>> pages to back the mapping.  These pages will be used a fault time, and
>> the MAP_CONTIG flag implies populating the mapping at the mmap time.
> 
> I have only briefly read through the previous discussion and it is still
> not clear to me _why_ we want such a interface. I didn't give it much
> time yet but I do not think this is a good idea at all.

Thanks for looking Michal.  The primary use case comes from devices that can
realize performance benefits if operating on physically contiguous memory.
What sparked this effort was Christoph and Guy's plumbers presentation
where they showed RDMA performance benefits that could be realized with
contiguous memory.  I also remember sitting in a presentation about
Intel's QuackAssist technology at Vault last year.  The presenter mentioned
that their compression engine needed to be passed a physically contiguous
buffer.  I asked how a user could obtain such a buffer.  They said they
had a special driver/ioctl for that.  Yuck!  I'm guessing there are other
specific use cases.  That is why I wanted to start the discussion as to
whether there should be an interface to provide this functionality.

> Why? Do we want
> any user to simply consume larger order memory blocks? What would
> prevent from that?

We certainly would want to put restrictions in place for contiguous
memory allocations.  Since it makes sense to pre-populate and lock
contiguous allocations, using the same restrictions as mlock is a start.
However, I can see the possible need for more restrictions.

>Also why should even userspace care about larger
> memory blocks? We have huge pages (be it preallocated or transparent)
> for that purpose already. Why should we add yet another another type

The 'sweet spot' for the Mellanox RDMA example is 2GB.  We can not
achieve that with huge pages (on x86) today.

>  What is the guaratee of such a mapping.

There is no guarantee.  My suggestion is that mmap(MAP_CONTIG) would fail
with ENOMEM if a sufficiently sized contiguous area could not be found.
The caller would need to deal with failure.

> Does the memory always stays contiguous? How much contiguous it will be?

Yes, it remains contiguous.  It is locked in memory.

> Who is going to use such an interface? And probably many other
> questions...

Thanks for asking.  I am just throwing out the idea of providing an interface
for doing contiguous memory allocations from user space.  There are at least
two (and possibly more) devices that could benefit from such an interface.

-- 
Mike Kravetz


Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-12 Thread Michal Hocko
On Wed 11-10-17 18:46:11, Mike Kravetz wrote:
> Add new MAP_CONTIG flag to mmap system call.  Check for flag in normal
> mmap flag processing.  If present, pre-allocate a contiguous set of
> pages to back the mapping.  These pages will be used a fault time, and
> the MAP_CONTIG flag implies populating the mapping at the mmap time.

I have only briefly read through the previous discussion and it is still
not clear to me _why_ we want such a interface. I didn't give it much
time yet but I do not think this is a good idea at all. Why? Do we want
any user to simply consume larger order memory blocks? What would
prevent from that? Also why should even userspace care about larger
memory blocks? We have huge pages (be it preallocated or transparent)
for that purpose already. Why should we add yet another another type
of physically contiguous memory. What is the guaratee of such a mapping.
Does the memory always stays contiguous? How much contiguous it will be?
Who is going to use such an interface? And probably many other
questions...

> Signed-off-by: Mike Kravetz 
> ---
>  include/uapi/asm-generic/mman.h |  1 +
>  mm/mmap.c   | 94 
> +
>  2 files changed, 95 insertions(+)
> 
> diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
> index 7162cd4cca73..e8046b4c4ac4 100644
> --- a/include/uapi/asm-generic/mman.h
> +++ b/include/uapi/asm-generic/mman.h
> @@ -12,6 +12,7 @@
>  #define MAP_NONBLOCK 0x1 /* do not block on IO */
>  #define MAP_STACK0x2 /* give out an address that is best 
> suited for process/thread stacks */
>  #define MAP_HUGETLB  0x4 /* create a huge page mapping */
> +#define MAP_CONTIG   0x8 /* back with contiguous pages */
>  
>  /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 680506faceae..aee7917ee073 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -167,6 +167,16 @@ static struct vm_area_struct *remove_vma(struct 
> vm_area_struct *vma)
>  {
>   struct vm_area_struct *next = vma->vm_next;
>  
> + if (vma->vm_flags & VM_CONTIG) {
> + /*
> +  * Do any necessary clean up when freeing a vma backed
> +  * by a contiguous allocation.
> +  *
> +  * Not very useful in it's present form.
> +  */
> + VM_BUG_ON(!vma->vm_private_data);
> + vma->vm_private_data = NULL;
> + }
>   might_sleep();
>   if (vma->vm_ops && vma->vm_ops->close)
>   vma->vm_ops->close(vma);
> @@ -1378,6 +1388,18 @@ unsigned long do_mmap(struct file *file, unsigned long 
> addr,
>   vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
>   mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
>  
> + /*
> +  * MAP_CONTIG has some restrictions,
> +  * and also implies additional mmap and vma flags.
> +  */
> + if (flags & MAP_CONTIG) {
> + if (!(flags & MAP_ANONYMOUS))
> + return -EINVAL;
> +
> + flags |= MAP_POPULATE | MAP_LOCKED;
> + vm_flags |= (VM_CONTIG | VM_LOCKED | VM_DONTEXPAND);
> + }
> +
>   if (flags & MAP_LOCKED)
>   if (!can_do_mlock())
>   return -EPERM;
> @@ -1547,6 +1569,71 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct 
> __user *, arg)
>  #endif /* __ARCH_WANT_SYS_OLD_MMAP */
>  
>  /*
> + * Attempt to allocate a contiguous range of pages to back the
> + * specified vma.  vm_private_data is used as a 'pointer' to the
> + * allocated pages.  Larger requests and more fragmented memory
> + * make the allocation more likely to fail.  So, caller must deal
> + * with this situation.
> + */
> +static long __alloc_vma_contig_range(struct vm_area_struct *vma)
> +{
> + gfp_t gfp = GFP_HIGHUSER | __GFP_ZERO;
> + unsigned long order;
> +
> + VM_BUG_ON_VMA(vma->vm_private_data != NULL, vma);
> + order = get_order(vma->vm_end - vma->vm_start);
> +
> + /*
> +  * FIXME - Incomplete implementation.  For now, just handle
> +  * allocations < MAX_ORDER in size.  However, this should really
> +  * handle arbitrary size allocations.
> +  */
> + if (order >= MAX_ORDER)
> + return -ENOMEM;
> +
> + vma->vm_private_data = alloc_pages_vma(gfp, order, vma, vma->vm_start,
> + numa_node_id(), false);
> + if (!vma->vm_private_data)
> + return -ENOMEM;
> +
> + /*
> +  * split large allocation so it can be treated as individual
> +  * pages when populating the mapping and at unmap time.
> +  */
> + if (order) {
> + unsigned long vma_pages = (vma->vm_end - vma->vm_start) /
> + PAGE_SIZE;
> + unsigned long order_pages 

Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-12 Thread Michal Hocko
On Wed 11-10-17 18:46:11, Mike Kravetz wrote:
> Add new MAP_CONTIG flag to mmap system call.  Check for flag in normal
> mmap flag processing.  If present, pre-allocate a contiguous set of
> pages to back the mapping.  These pages will be used a fault time, and
> the MAP_CONTIG flag implies populating the mapping at the mmap time.

I have only briefly read through the previous discussion and it is still
not clear to me _why_ we want such a interface. I didn't give it much
time yet but I do not think this is a good idea at all. Why? Do we want
any user to simply consume larger order memory blocks? What would
prevent from that? Also why should even userspace care about larger
memory blocks? We have huge pages (be it preallocated or transparent)
for that purpose already. Why should we add yet another another type
of physically contiguous memory. What is the guaratee of such a mapping.
Does the memory always stays contiguous? How much contiguous it will be?
Who is going to use such an interface? And probably many other
questions...

> Signed-off-by: Mike Kravetz 
> ---
>  include/uapi/asm-generic/mman.h |  1 +
>  mm/mmap.c   | 94 
> +
>  2 files changed, 95 insertions(+)
> 
> diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
> index 7162cd4cca73..e8046b4c4ac4 100644
> --- a/include/uapi/asm-generic/mman.h
> +++ b/include/uapi/asm-generic/mman.h
> @@ -12,6 +12,7 @@
>  #define MAP_NONBLOCK 0x1 /* do not block on IO */
>  #define MAP_STACK0x2 /* give out an address that is best 
> suited for process/thread stacks */
>  #define MAP_HUGETLB  0x4 /* create a huge page mapping */
> +#define MAP_CONTIG   0x8 /* back with contiguous pages */
>  
>  /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 680506faceae..aee7917ee073 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -167,6 +167,16 @@ static struct vm_area_struct *remove_vma(struct 
> vm_area_struct *vma)
>  {
>   struct vm_area_struct *next = vma->vm_next;
>  
> + if (vma->vm_flags & VM_CONTIG) {
> + /*
> +  * Do any necessary clean up when freeing a vma backed
> +  * by a contiguous allocation.
> +  *
> +  * Not very useful in it's present form.
> +  */
> + VM_BUG_ON(!vma->vm_private_data);
> + vma->vm_private_data = NULL;
> + }
>   might_sleep();
>   if (vma->vm_ops && vma->vm_ops->close)
>   vma->vm_ops->close(vma);
> @@ -1378,6 +1388,18 @@ unsigned long do_mmap(struct file *file, unsigned long 
> addr,
>   vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
>   mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
>  
> + /*
> +  * MAP_CONTIG has some restrictions,
> +  * and also implies additional mmap and vma flags.
> +  */
> + if (flags & MAP_CONTIG) {
> + if (!(flags & MAP_ANONYMOUS))
> + return -EINVAL;
> +
> + flags |= MAP_POPULATE | MAP_LOCKED;
> + vm_flags |= (VM_CONTIG | VM_LOCKED | VM_DONTEXPAND);
> + }
> +
>   if (flags & MAP_LOCKED)
>   if (!can_do_mlock())
>   return -EPERM;
> @@ -1547,6 +1569,71 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct 
> __user *, arg)
>  #endif /* __ARCH_WANT_SYS_OLD_MMAP */
>  
>  /*
> + * Attempt to allocate a contiguous range of pages to back the
> + * specified vma.  vm_private_data is used as a 'pointer' to the
> + * allocated pages.  Larger requests and more fragmented memory
> + * make the allocation more likely to fail.  So, caller must deal
> + * with this situation.
> + */
> +static long __alloc_vma_contig_range(struct vm_area_struct *vma)
> +{
> + gfp_t gfp = GFP_HIGHUSER | __GFP_ZERO;
> + unsigned long order;
> +
> + VM_BUG_ON_VMA(vma->vm_private_data != NULL, vma);
> + order = get_order(vma->vm_end - vma->vm_start);
> +
> + /*
> +  * FIXME - Incomplete implementation.  For now, just handle
> +  * allocations < MAX_ORDER in size.  However, this should really
> +  * handle arbitrary size allocations.
> +  */
> + if (order >= MAX_ORDER)
> + return -ENOMEM;
> +
> + vma->vm_private_data = alloc_pages_vma(gfp, order, vma, vma->vm_start,
> + numa_node_id(), false);
> + if (!vma->vm_private_data)
> + return -ENOMEM;
> +
> + /*
> +  * split large allocation so it can be treated as individual
> +  * pages when populating the mapping and at unmap time.
> +  */
> + if (order) {
> + unsigned long vma_pages = (vma->vm_end - vma->vm_start) /
> + PAGE_SIZE;
> + unsigned long order_pages = 1 << order;
> +

Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-12 Thread Anshuman Khandual
On 10/12/2017 07:16 AM, Mike Kravetz wrote:
> Add new MAP_CONTIG flag to mmap system call.  Check for flag in normal
> mmap flag processing.  If present, pre-allocate a contiguous set of
> pages to back the mapping.  These pages will be used a fault time, and
> the MAP_CONTIG flag implies populating the mapping at the mmap time.
> 
> Signed-off-by: Mike Kravetz 
> ---
>  include/uapi/asm-generic/mman.h |  1 +
>  mm/mmap.c   | 94 
> +
>  2 files changed, 95 insertions(+)
> 
> diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
> index 7162cd4cca73..e8046b4c4ac4 100644
> --- a/include/uapi/asm-generic/mman.h
> +++ b/include/uapi/asm-generic/mman.h
> @@ -12,6 +12,7 @@
>  #define MAP_NONBLOCK 0x1 /* do not block on IO */
>  #define MAP_STACK0x2 /* give out an address that is best 
> suited for process/thread stacks */
>  #define MAP_HUGETLB  0x4 /* create a huge page mapping */
> +#define MAP_CONTIG   0x8 /* back with contiguous pages */
>  
>  /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 680506faceae..aee7917ee073 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -167,6 +167,16 @@ static struct vm_area_struct *remove_vma(struct 
> vm_area_struct *vma)
>  {
>   struct vm_area_struct *next = vma->vm_next;
>  
> + if (vma->vm_flags & VM_CONTIG) {
> + /*
> +  * Do any necessary clean up when freeing a vma backed
> +  * by a contiguous allocation.
> +  *
> +  * Not very useful in it's present form.
> +  */
> + VM_BUG_ON(!vma->vm_private_data);
> + vma->vm_private_data = NULL;
> + }
>   might_sleep();
>   if (vma->vm_ops && vma->vm_ops->close)
>   vma->vm_ops->close(vma);
> @@ -1378,6 +1388,18 @@ unsigned long do_mmap(struct file *file, unsigned long 
> addr,
>   vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
>   mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
>  
> + /*
> +  * MAP_CONTIG has some restrictions,
> +  * and also implies additional mmap and vma flags.
> +  */
> + if (flags & MAP_CONTIG) {
> + if (!(flags & MAP_ANONYMOUS))
> + return -EINVAL;
> +
> + flags |= MAP_POPULATE | MAP_LOCKED;
> + vm_flags |= (VM_CONTIG | VM_LOCKED | VM_DONTEXPAND);
> + }
> +
>   if (flags & MAP_LOCKED)
>   if (!can_do_mlock())
>   return -EPERM;
> @@ -1547,6 +1569,71 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct 
> __user *, arg)
>  #endif /* __ARCH_WANT_SYS_OLD_MMAP */
>  
>  /*
> + * Attempt to allocate a contiguous range of pages to back the
> + * specified vma.  vm_private_data is used as a 'pointer' to the
> + * allocated pages.  Larger requests and more fragmented memory
> + * make the allocation more likely to fail.  So, caller must deal
> + * with this situation.
> + */
> +static long __alloc_vma_contig_range(struct vm_area_struct *vma)
> +{
> + gfp_t gfp = GFP_HIGHUSER | __GFP_ZERO;

Would it be GFP_HIGHUSER_MOVABLE instead ? Why __GFP_ZERO ? If its
coming from Buddy, every thing should have already been zeroed out
in there. Am I missing something ?

> + unsigned long order;
> +
> + VM_BUG_ON_VMA(vma->vm_private_data != NULL, vma);
> + order = get_order(vma->vm_end - vma->vm_start);
> +
> + /*
> +  * FIXME - Incomplete implementation.  For now, just handle
> +  * allocations < MAX_ORDER in size.  However, this should really
> +  * handle arbitrary size allocations.
> +  */
> + if (order >= MAX_ORDER)
> + return -ENOMEM;
> +
> + vma->vm_private_data = alloc_pages_vma(gfp, order, vma, vma->vm_start,
> + numa_node_id(), false);

This is where I was experimenting for requests beyond MAX_ORDER
with alloc_contig_range().

> + if (!vma->vm_private_data)
> + return -ENOMEM;
> +
> + /*
> +  * split large allocation so it can be treated as individual
> +  * pages when populating the mapping and at unmap time.
> +  */
> + if (order) {
> + unsigned long vma_pages = (vma->vm_end - vma->vm_start) /
> + PAGE_SIZE;
> + unsigned long order_pages = 1 << order;
> + unsigned long i;
> + struct page *page = vma->vm_private_data;
> +
> + split_page((struct page *)vma->vm_private_data, order);
> +
> + /*
> +  * 'order' rounds up size of vma to next power of 2.  We
> +  * will not need/use the extra pages so free them now.
> +  */
> + for (i = vma_pages; i < order_pages; i++)
> +  

Re: [RFC PATCH 3/3] mm/map_contig: Add mmap(MAP_CONTIG) support

2017-10-12 Thread Anshuman Khandual
On 10/12/2017 07:16 AM, Mike Kravetz wrote:
> Add new MAP_CONTIG flag to mmap system call.  Check for flag in normal
> mmap flag processing.  If present, pre-allocate a contiguous set of
> pages to back the mapping.  These pages will be used a fault time, and
> the MAP_CONTIG flag implies populating the mapping at the mmap time.
> 
> Signed-off-by: Mike Kravetz 
> ---
>  include/uapi/asm-generic/mman.h |  1 +
>  mm/mmap.c   | 94 
> +
>  2 files changed, 95 insertions(+)
> 
> diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
> index 7162cd4cca73..e8046b4c4ac4 100644
> --- a/include/uapi/asm-generic/mman.h
> +++ b/include/uapi/asm-generic/mman.h
> @@ -12,6 +12,7 @@
>  #define MAP_NONBLOCK 0x1 /* do not block on IO */
>  #define MAP_STACK0x2 /* give out an address that is best 
> suited for process/thread stacks */
>  #define MAP_HUGETLB  0x4 /* create a huge page mapping */
> +#define MAP_CONTIG   0x8 /* back with contiguous pages */
>  
>  /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 680506faceae..aee7917ee073 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -167,6 +167,16 @@ static struct vm_area_struct *remove_vma(struct 
> vm_area_struct *vma)
>  {
>   struct vm_area_struct *next = vma->vm_next;
>  
> + if (vma->vm_flags & VM_CONTIG) {
> + /*
> +  * Do any necessary clean up when freeing a vma backed
> +  * by a contiguous allocation.
> +  *
> +  * Not very useful in it's present form.
> +  */
> + VM_BUG_ON(!vma->vm_private_data);
> + vma->vm_private_data = NULL;
> + }
>   might_sleep();
>   if (vma->vm_ops && vma->vm_ops->close)
>   vma->vm_ops->close(vma);
> @@ -1378,6 +1388,18 @@ unsigned long do_mmap(struct file *file, unsigned long 
> addr,
>   vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
>   mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
>  
> + /*
> +  * MAP_CONTIG has some restrictions,
> +  * and also implies additional mmap and vma flags.
> +  */
> + if (flags & MAP_CONTIG) {
> + if (!(flags & MAP_ANONYMOUS))
> + return -EINVAL;
> +
> + flags |= MAP_POPULATE | MAP_LOCKED;
> + vm_flags |= (VM_CONTIG | VM_LOCKED | VM_DONTEXPAND);
> + }
> +
>   if (flags & MAP_LOCKED)
>   if (!can_do_mlock())
>   return -EPERM;
> @@ -1547,6 +1569,71 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct 
> __user *, arg)
>  #endif /* __ARCH_WANT_SYS_OLD_MMAP */
>  
>  /*
> + * Attempt to allocate a contiguous range of pages to back the
> + * specified vma.  vm_private_data is used as a 'pointer' to the
> + * allocated pages.  Larger requests and more fragmented memory
> + * make the allocation more likely to fail.  So, caller must deal
> + * with this situation.
> + */
> +static long __alloc_vma_contig_range(struct vm_area_struct *vma)
> +{
> + gfp_t gfp = GFP_HIGHUSER | __GFP_ZERO;

Would it be GFP_HIGHUSER_MOVABLE instead ? Why __GFP_ZERO ? If its
coming from Buddy, every thing should have already been zeroed out
in there. Am I missing something ?

> + unsigned long order;
> +
> + VM_BUG_ON_VMA(vma->vm_private_data != NULL, vma);
> + order = get_order(vma->vm_end - vma->vm_start);
> +
> + /*
> +  * FIXME - Incomplete implementation.  For now, just handle
> +  * allocations < MAX_ORDER in size.  However, this should really
> +  * handle arbitrary size allocations.
> +  */
> + if (order >= MAX_ORDER)
> + return -ENOMEM;
> +
> + vma->vm_private_data = alloc_pages_vma(gfp, order, vma, vma->vm_start,
> + numa_node_id(), false);

This is where I was experimenting for requests beyond MAX_ORDER
with alloc_contig_range().

> + if (!vma->vm_private_data)
> + return -ENOMEM;
> +
> + /*
> +  * split large allocation so it can be treated as individual
> +  * pages when populating the mapping and at unmap time.
> +  */
> + if (order) {
> + unsigned long vma_pages = (vma->vm_end - vma->vm_start) /
> + PAGE_SIZE;
> + unsigned long order_pages = 1 << order;
> + unsigned long i;
> + struct page *page = vma->vm_private_data;
> +
> + split_page((struct page *)vma->vm_private_data, order);
> +
> + /*
> +  * 'order' rounds up size of vma to next power of 2.  We
> +  * will not need/use the extra pages so free them now.
> +  */
> + for (i = vma_pages; i < order_pages; i++)
> + put_page(page + i);