Re: [PATCH v2 0/3] File Sealing & memfd_create()

2014-06-03 Thread Peter Zijlstra
On Mon, Jun 02, 2014 at 06:04:02PM +0200, David Herrmann wrote:
> Correct, the problem is not accounting for pinned-pages, but waiting
> for them to get released. Furthermore, Peter's patches make VM_PINNED
> an optional feature, so we'd still miss all the short-term GUP users.
> Sadly, that means we cannot even use it to test for pending GUP users.

Right, I'm not bothered about temporary pins, they'll go away quickly
and generally not bother reclaim and the like (much). The thing I
'worry' about is the persistent pins, which have unbounded life spans.


pgp4UPvaPdgiR.pgp
Description: PGP signature


Re: [PATCH v2 0/3] File Sealing memfd_create()

2014-06-03 Thread Peter Zijlstra
On Mon, Jun 02, 2014 at 06:04:02PM +0200, David Herrmann wrote:
 Correct, the problem is not accounting for pinned-pages, but waiting
 for them to get released. Furthermore, Peter's patches make VM_PINNED
 an optional feature, so we'd still miss all the short-term GUP users.
 Sadly, that means we cannot even use it to test for pending GUP users.

Right, I'm not bothered about temporary pins, they'll go away quickly
and generally not bother reclaim and the like (much). The thing I
'worry' about is the persistent pins, which have unbounded life spans.


pgp4UPvaPdgiR.pgp
Description: PGP signature


Re: [PATCH v2 0/3] File Sealing & memfd_create()

2014-06-02 Thread David Herrmann
Hi

On Mon, Jun 2, 2014 at 11:14 AM, Jan Kara  wrote:
> On Mon 02-06-14 13:42:59, Minchan Kim wrote:
>> On Mon, May 19, 2014 at 01:44:25PM +0200, David Herrmann wrote:
>> > I tried doing the page-replacement in the last 4 days, but honestly,
>> > it's far more complex than I thought. So if no-one more experienced
>> > with mm/ comes up with a simple implementation, I'll have to delay
>> > this for some more weeks.
>> >
>> > However, I still wonder why we try to fix this as part of this
>> > patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
>> > amount of time. Same is true for network block-devices, NFS, iscsi,
>> > maybe loop-devices, ... This means, _any_ once mapped page can be
>> > written to after an arbitrary delay. This can break any feature that
>> > makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
>> > sealing, ..).
>> >
>> > Shouldn't we try to fix the _cause_ of this?
>>
>> I didn't follow this patchset and couldn't find what's your most cocern
>> but at a first glance, it seems you have troubled with pinned page.
>> If so, it's really big problem for CMA and I think peterz's approach(ie,
>> mm_mpin) is really make sense to me.
>   Well, his concern are pinned pages (and also pages used for direct IO and
> similar) but not because they are pinned but because they can be modified
> while someone holds reference to them. So I'm not sure Peter's patches will
> help here.

Correct, the problem is not accounting for pinned-pages, but waiting
for them to get released. Furthermore, Peter's patches make VM_PINNED
an optional feature, so we'd still miss all the short-term GUP users.
Sadly, that means we cannot even use it to test for pending GUP users.

Thanks
David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] File Sealing & memfd_create()

2014-06-02 Thread Jan Kara
On Mon 02-06-14 13:42:59, Minchan Kim wrote:
> Hello,
> 
> On Mon, May 19, 2014 at 01:44:25PM +0200, David Herrmann wrote:
> > Hi
> > 
> > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins  wrote:
> > > The aspect which really worries me is this: the maintenance burden.
> > > This approach would add some peculiar new code, introducing a rare
> > > special case: which we might get right today, but will very easily
> > > forget tomorrow when making some other changes to mm.  If we compile
> > > a list of danger areas in mm, this would surely belong on that list.
> > 
> > I tried doing the page-replacement in the last 4 days, but honestly,
> > it's far more complex than I thought. So if no-one more experienced
> > with mm/ comes up with a simple implementation, I'll have to delay
> > this for some more weeks.
> > 
> > However, I still wonder why we try to fix this as part of this
> > patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
> > amount of time. Same is true for network block-devices, NFS, iscsi,
> > maybe loop-devices, ... This means, _any_ once mapped page can be
> > written to after an arbitrary delay. This can break any feature that
> > makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
> > sealing, ..).
> > 
> > Shouldn't we try to fix the _cause_ of this?
> 
> I didn't follow this patchset and couldn't find what's your most cocern
> but at a first glance, it seems you have troubled with pinned page.
> If so, it's really big problem for CMA and I think peterz's approach(ie,
> mm_mpin) is really make sense to me.
  Well, his concern are pinned pages (and also pages used for direct IO and
similar) but not because they are pinned but because they can be modified
while someone holds reference to them. So I'm not sure Peter's patches will
help here.
 
> https://lkml.org/lkml/2014/5/26/340

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] File Sealing memfd_create()

2014-06-02 Thread Jan Kara
On Mon 02-06-14 13:42:59, Minchan Kim wrote:
 Hello,
 
 On Mon, May 19, 2014 at 01:44:25PM +0200, David Herrmann wrote:
  Hi
  
  On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins hu...@google.com wrote:
   The aspect which really worries me is this: the maintenance burden.
   This approach would add some peculiar new code, introducing a rare
   special case: which we might get right today, but will very easily
   forget tomorrow when making some other changes to mm.  If we compile
   a list of danger areas in mm, this would surely belong on that list.
  
  I tried doing the page-replacement in the last 4 days, but honestly,
  it's far more complex than I thought. So if no-one more experienced
  with mm/ comes up with a simple implementation, I'll have to delay
  this for some more weeks.
  
  However, I still wonder why we try to fix this as part of this
  patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
  amount of time. Same is true for network block-devices, NFS, iscsi,
  maybe loop-devices, ... This means, _any_ once mapped page can be
  written to after an arbitrary delay. This can break any feature that
  makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
  sealing, ..).
  
  Shouldn't we try to fix the _cause_ of this?
 
 I didn't follow this patchset and couldn't find what's your most cocern
 but at a first glance, it seems you have troubled with pinned page.
 If so, it's really big problem for CMA and I think peterz's approach(ie,
 mm_mpin) is really make sense to me.
  Well, his concern are pinned pages (and also pages used for direct IO and
similar) but not because they are pinned but because they can be modified
while someone holds reference to them. So I'm not sure Peter's patches will
help here.
 
 https://lkml.org/lkml/2014/5/26/340

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] File Sealing memfd_create()

2014-06-02 Thread David Herrmann
Hi

On Mon, Jun 2, 2014 at 11:14 AM, Jan Kara j...@suse.cz wrote:
 On Mon 02-06-14 13:42:59, Minchan Kim wrote:
 On Mon, May 19, 2014 at 01:44:25PM +0200, David Herrmann wrote:
  I tried doing the page-replacement in the last 4 days, but honestly,
  it's far more complex than I thought. So if no-one more experienced
  with mm/ comes up with a simple implementation, I'll have to delay
  this for some more weeks.
 
  However, I still wonder why we try to fix this as part of this
  patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
  amount of time. Same is true for network block-devices, NFS, iscsi,
  maybe loop-devices, ... This means, _any_ once mapped page can be
  written to after an arbitrary delay. This can break any feature that
  makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
  sealing, ..).
 
  Shouldn't we try to fix the _cause_ of this?

 I didn't follow this patchset and couldn't find what's your most cocern
 but at a first glance, it seems you have troubled with pinned page.
 If so, it's really big problem for CMA and I think peterz's approach(ie,
 mm_mpin) is really make sense to me.
   Well, his concern are pinned pages (and also pages used for direct IO and
 similar) but not because they are pinned but because they can be modified
 while someone holds reference to them. So I'm not sure Peter's patches will
 help here.

Correct, the problem is not accounting for pinned-pages, but waiting
for them to get released. Furthermore, Peter's patches make VM_PINNED
an optional feature, so we'd still miss all the short-term GUP users.
Sadly, that means we cannot even use it to test for pending GUP users.

Thanks
David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] File Sealing & memfd_create()

2014-06-01 Thread Minchan Kim
Hello,

On Mon, May 19, 2014 at 01:44:25PM +0200, David Herrmann wrote:
> Hi
> 
> On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins  wrote:
> > The aspect which really worries me is this: the maintenance burden.
> > This approach would add some peculiar new code, introducing a rare
> > special case: which we might get right today, but will very easily
> > forget tomorrow when making some other changes to mm.  If we compile
> > a list of danger areas in mm, this would surely belong on that list.
> 
> I tried doing the page-replacement in the last 4 days, but honestly,
> it's far more complex than I thought. So if no-one more experienced
> with mm/ comes up with a simple implementation, I'll have to delay
> this for some more weeks.
> 
> However, I still wonder why we try to fix this as part of this
> patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
> amount of time. Same is true for network block-devices, NFS, iscsi,
> maybe loop-devices, ... This means, _any_ once mapped page can be
> written to after an arbitrary delay. This can break any feature that
> makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
> sealing, ..).
> 
> Shouldn't we try to fix the _cause_ of this?

I didn't follow this patchset and couldn't find what's your most cocern
but at a first glance, it seems you have troubled with pinned page.
If so, it's really big problem for CMA and I think peterz's approach(ie,
mm_mpin) is really make sense to me.

https://lkml.org/lkml/2014/5/26/340


> 
> Isn't there a simple way to lock/mark/.. affected vmas in
> get_user_pages(_fast)() and release them once done? We could increase
> i_mmap_writable on all affected address_space and decrease it on
> release. This would at least prevent sealing and could be check on
> other operations, too (like setting S_IMMUTABLE).
> This should be as easy as checking page_mapping(page) != NULL and then
> adjusting ->i_mmap_writable in
> get_writable_user_pages/put_writable_user_pages, right?
> 
> Thanks
> David
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] File Sealing memfd_create()

2014-06-01 Thread Minchan Kim
Hello,

On Mon, May 19, 2014 at 01:44:25PM +0200, David Herrmann wrote:
 Hi
 
 On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins hu...@google.com wrote:
  The aspect which really worries me is this: the maintenance burden.
  This approach would add some peculiar new code, introducing a rare
  special case: which we might get right today, but will very easily
  forget tomorrow when making some other changes to mm.  If we compile
  a list of danger areas in mm, this would surely belong on that list.
 
 I tried doing the page-replacement in the last 4 days, but honestly,
 it's far more complex than I thought. So if no-one more experienced
 with mm/ comes up with a simple implementation, I'll have to delay
 this for some more weeks.
 
 However, I still wonder why we try to fix this as part of this
 patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
 amount of time. Same is true for network block-devices, NFS, iscsi,
 maybe loop-devices, ... This means, _any_ once mapped page can be
 written to after an arbitrary delay. This can break any feature that
 makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
 sealing, ..).
 
 Shouldn't we try to fix the _cause_ of this?

I didn't follow this patchset and couldn't find what's your most cocern
but at a first glance, it seems you have troubled with pinned page.
If so, it's really big problem for CMA and I think peterz's approach(ie,
mm_mpin) is really make sense to me.

https://lkml.org/lkml/2014/5/26/340


 
 Isn't there a simple way to lock/mark/.. affected vmas in
 get_user_pages(_fast)() and release them once done? We could increase
 i_mmap_writable on all affected address_space and decrease it on
 release. This would at least prevent sealing and could be check on
 other operations, too (like setting S_IMMUTABLE).
 This should be as easy as checking page_mapping(page) != NULL and then
 adjusting -i_mmap_writable in
 get_writable_user_pages/put_writable_user_pages, right?
 
 Thanks
 David
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] File Sealing & memfd_create()

2014-05-30 Thread Hugh Dickins
On Mon, 26 May 2014, David Herrmann wrote:
> 
> (CC migrate.c committers)
> 
> On Tue, May 20, 2014 at 12:11 AM, Hugh Dickins  wrote:
> > On Mon, 19 May 2014, Jan Kara wrote:
> >> On Mon 19-05-14 13:44:25, David Herrmann wrote:
> >> > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins  wrote:
> >> > > The aspect which really worries me is this: the maintenance burden.
> >> > > This approach would add some peculiar new code, introducing a rare
> >> > > special case: which we might get right today, but will very easily
> >> > > forget tomorrow when making some other changes to mm.  If we compile
> >> > > a list of danger areas in mm, this would surely belong on that list.
> >> >
> >> > I tried doing the page-replacement in the last 4 days, but honestly,
> >> > it's far more complex than I thought. So if no-one more experienced
> >
> > To be honest, I'm quite glad to hear that: it is still a solution worth
> > considering, but I'd rather continue the search for a better solution.
> 
> What if we set VM_IO for memory-mappings if a file supports sealing?
> That might be a hack and quite restrictive, but we could also add a
> VM_DONTPIN flag that just prevents any page-pinning like GUP (which is
> also a side-effect of VM_IO). This is basically what we do to protect
> PCI BARs from that race during hotplug (well, VM_PFNMAP ist what
> protects those, but the code is the same). If we mention in the
> man-page that memfd-objects don't support direct-IO, we'd be fine, I
> think. Not sure if that hack is better than the page-replacement,
> though. It'd be definitely much simpler.

I admire your resourcefulness, but three things put me off using
VM_IO.

One is just a great distaste for using VM_IO in mm, when it's about
device and memory regions unknown to mm.  You can indeed get around
that objection by making a new VM_DONTPIN with similar restrictions.

Two is that I'm afraid of those restrictions which you foresee.
Soon after putting this in, we shall have people saying "I can't
do such-and-such on my memfd-object, can you please enable it"
and then we shall have to find another solution.

But three, more importantly, that gives no protection against
get_user_pages_fast() callers: as mentioned before, GUP-fast works
on page tables, not examining vmas at all, so it is blind to VM_IO.

get_user_pages_fast() does check pte access, and pte_special bit,
falling back to get_user_pages() if unsuitable: IIRC, pte_special
support is important for an architecture to support GUP-fast.
But I definitely don't want any shmem to be !vm_normal_page().

> 
> Regarding page-replacement, I tried using migrate_page(), however,
> this obviously fails in page_freeze_refs() due to the elevated
> ref-count and we cannot account for those, as they might vanish
> asynchronously. Now I wonder whether we could just add a new mode
> MIGRATE_PHASE_OUT that avoids freezing the page and forces the copy.
> Existing refs would still operate on the old page, but any new access
> gets the new page. This way, we could collect pages with elevated
> ref-counts in shmem similar to do_move_page_to_node_array() and then
> call migrate_pages(). Now migrate_pages() takes good care to prevent
> any new refs during migration. try_to_unmap(TTU_MIGRATION) marks PTEs
> as 'in-migration', so accesses are delayed. Page-faults wait on the
> page-lock and retry due to mapping==NULL. lru is disabled beforehand.
> Therefore, there cannot be any racing page-lookups as they all stall
> on the migration. Moreover, page_freeze_refs() fails only if the page
> is pinned by independent users (usually some form of I/O).
> Question is what those additional ref-counts might be. Given that
> shmem 'owns' its pages, none of these external references should pass
> those refs around. All they use it for is I/O. Therefore, we shouldn't
> even need an additional try_to_unmap() _after_ MIGRATE_PHASE_OUT as we
> expect those external refs to never pass page-refs around. If that's a
> valid assumption (and I haven't found any offenders so far), we should
> be good with migrate_pages(MIGRATE_PHASE_OUT) as I described.
> 
> Comments?

I have not given these details all the attention that they deserve.
It's Tony's copy-on-seal suggestion, fleshed out to use the migration
infrastructure; but I still feel the same way about it as I did before.

It's complexity (and complexity outside of mm/shmem.c) that I would
prefer to avoid; and I find it hard to predict the consequence of
this copy-on-write-in-reverse, where existing users are left holding
something that used to be part of the object and now is not.  Perhaps
the existing sudden-truncation codepaths would handle it appropriately,
perhaps they would not.

(I wonder if Daniel Phillips's Tux3 "page fork"ing is relevant here?
But don't want to explore that right now.  "Revoke" also comes to mind.)

I do think it may turn out to be a good future way forward, if what
we do now turns out to be too restrictive; but we don't want to turn
mm (slightly) 

Re: [PATCH v2 0/3] File Sealing memfd_create()

2014-05-30 Thread Hugh Dickins
On Mon, 26 May 2014, David Herrmann wrote:
 
 (CC migrate.c committers)
 
 On Tue, May 20, 2014 at 12:11 AM, Hugh Dickins hu...@google.com wrote:
  On Mon, 19 May 2014, Jan Kara wrote:
  On Mon 19-05-14 13:44:25, David Herrmann wrote:
   On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins hu...@google.com wrote:
The aspect which really worries me is this: the maintenance burden.
This approach would add some peculiar new code, introducing a rare
special case: which we might get right today, but will very easily
forget tomorrow when making some other changes to mm.  If we compile
a list of danger areas in mm, this would surely belong on that list.
  
   I tried doing the page-replacement in the last 4 days, but honestly,
   it's far more complex than I thought. So if no-one more experienced
 
  To be honest, I'm quite glad to hear that: it is still a solution worth
  considering, but I'd rather continue the search for a better solution.
 
 What if we set VM_IO for memory-mappings if a file supports sealing?
 That might be a hack and quite restrictive, but we could also add a
 VM_DONTPIN flag that just prevents any page-pinning like GUP (which is
 also a side-effect of VM_IO). This is basically what we do to protect
 PCI BARs from that race during hotplug (well, VM_PFNMAP ist what
 protects those, but the code is the same). If we mention in the
 man-page that memfd-objects don't support direct-IO, we'd be fine, I
 think. Not sure if that hack is better than the page-replacement,
 though. It'd be definitely much simpler.

I admire your resourcefulness, but three things put me off using
VM_IO.

One is just a great distaste for using VM_IO in mm, when it's about
device and memory regions unknown to mm.  You can indeed get around
that objection by making a new VM_DONTPIN with similar restrictions.

Two is that I'm afraid of those restrictions which you foresee.
Soon after putting this in, we shall have people saying I can't
do such-and-such on my memfd-object, can you please enable it
and then we shall have to find another solution.

But three, more importantly, that gives no protection against
get_user_pages_fast() callers: as mentioned before, GUP-fast works
on page tables, not examining vmas at all, so it is blind to VM_IO.

get_user_pages_fast() does check pte access, and pte_special bit,
falling back to get_user_pages() if unsuitable: IIRC, pte_special
support is important for an architecture to support GUP-fast.
But I definitely don't want any shmem to be !vm_normal_page().

 
 Regarding page-replacement, I tried using migrate_page(), however,
 this obviously fails in page_freeze_refs() due to the elevated
 ref-count and we cannot account for those, as they might vanish
 asynchronously. Now I wonder whether we could just add a new mode
 MIGRATE_PHASE_OUT that avoids freezing the page and forces the copy.
 Existing refs would still operate on the old page, but any new access
 gets the new page. This way, we could collect pages with elevated
 ref-counts in shmem similar to do_move_page_to_node_array() and then
 call migrate_pages(). Now migrate_pages() takes good care to prevent
 any new refs during migration. try_to_unmap(TTU_MIGRATION) marks PTEs
 as 'in-migration', so accesses are delayed. Page-faults wait on the
 page-lock and retry due to mapping==NULL. lru is disabled beforehand.
 Therefore, there cannot be any racing page-lookups as they all stall
 on the migration. Moreover, page_freeze_refs() fails only if the page
 is pinned by independent users (usually some form of I/O).
 Question is what those additional ref-counts might be. Given that
 shmem 'owns' its pages, none of these external references should pass
 those refs around. All they use it for is I/O. Therefore, we shouldn't
 even need an additional try_to_unmap() _after_ MIGRATE_PHASE_OUT as we
 expect those external refs to never pass page-refs around. If that's a
 valid assumption (and I haven't found any offenders so far), we should
 be good with migrate_pages(MIGRATE_PHASE_OUT) as I described.
 
 Comments?

I have not given these details all the attention that they deserve.
It's Tony's copy-on-seal suggestion, fleshed out to use the migration
infrastructure; but I still feel the same way about it as I did before.

It's complexity (and complexity outside of mm/shmem.c) that I would
prefer to avoid; and I find it hard to predict the consequence of
this copy-on-write-in-reverse, where existing users are left holding
something that used to be part of the object and now is not.  Perhaps
the existing sudden-truncation codepaths would handle it appropriately,
perhaps they would not.

(I wonder if Daniel Phillips's Tux3 page forking is relevant here?
But don't want to explore that right now.  Revoke also comes to mind.)

I do think it may turn out to be a good future way forward, if what
we do now turns out to be too restrictive; but we don't want to turn
mm (slightly) upside down just to get sealing in.

 
 While skimming over 

Re: [PATCH v2 0/3] File Sealing & memfd_create()

2014-05-26 Thread David Herrmann
Hi

(CC migrate.c committers)

On Tue, May 20, 2014 at 12:11 AM, Hugh Dickins  wrote:
> On Mon, 19 May 2014, Jan Kara wrote:
>> On Mon 19-05-14 13:44:25, David Herrmann wrote:
>> > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins  wrote:
>> > > The aspect which really worries me is this: the maintenance burden.
>> > > This approach would add some peculiar new code, introducing a rare
>> > > special case: which we might get right today, but will very easily
>> > > forget tomorrow when making some other changes to mm.  If we compile
>> > > a list of danger areas in mm, this would surely belong on that list.
>> >
>> > I tried doing the page-replacement in the last 4 days, but honestly,
>> > it's far more complex than I thought. So if no-one more experienced
>
> To be honest, I'm quite glad to hear that: it is still a solution worth
> considering, but I'd rather continue the search for a better solution.

What if we set VM_IO for memory-mappings if a file supports sealing?
That might be a hack and quite restrictive, but we could also add a
VM_DONTPIN flag that just prevents any page-pinning like GUP (which is
also a side-effect of VM_IO). This is basically what we do to protect
PCI BARs from that race during hotplug (well, VM_PFNMAP ist what
protects those, but the code is the same). If we mention in the
man-page that memfd-objects don't support direct-IO, we'd be fine, I
think. Not sure if that hack is better than the page-replacement,
though. It'd be definitely much simpler.

Regarding page-replacement, I tried using migrate_page(), however,
this obviously fails in page_freeze_refs() due to the elevated
ref-count and we cannot account for those, as they might vanish
asynchronously. Now I wonder whether we could just add a new mode
MIGRATE_PHASE_OUT that avoids freezing the page and forces the copy.
Existing refs would still operate on the old page, but any new access
gets the new page. This way, we could collect pages with elevated
ref-counts in shmem similar to do_move_page_to_node_array() and then
call migrate_pages(). Now migrate_pages() takes good care to prevent
any new refs during migration. try_to_unmap(TTU_MIGRATION) marks PTEs
as 'in-migration', so accesses are delayed. Page-faults wait on the
page-lock and retry due to mapping==NULL. lru is disabled beforehand.
Therefore, there cannot be any racing page-lookups as they all stall
on the migration. Moreover, page_freeze_refs() fails only if the page
is pinned by independent users (usually some form of I/O).
Question is what those additional ref-counts might be. Given that
shmem 'owns' its pages, none of these external references should pass
those refs around. All they use it for is I/O. Therefore, we shouldn't
even need an additional try_to_unmap() _after_ MIGRATE_PHASE_OUT as we
expect those external refs to never pass page-refs around. If that's a
valid assumption (and I haven't found any offenders so far), we should
be good with migrate_pages(MIGRATE_PHASE_OUT) as I described.

Comments?

While skimming over migrate.c I noticed two odd behaviors:
1) migration_entry_wait() is used to wait on a migration to finish,
before accessing PTE entries. However, we call get_page() there, which
increases the ref-count of the old page and causes page_freeze_refs()
to fail. There's no way we can know how many tasks wait on a migration
entry when calling page_freeze_refs(). I have no idea how that's
supposed to work? Why don't we store the new page in the migration-swp
entry so any lookups stall on the new page? We don't care for
ref-counts on that page and if the migration fails, new->mapping is
set to NULL and any lookup is retried. remove_migration_pte() can
restore the old page correctly.
2) remove_migration_pte() calls get_page(new) before writing the PTE.
But who releases the ref of the old page?

Thanks
David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] File Sealing memfd_create()

2014-05-26 Thread David Herrmann
Hi

(CC migrate.c committers)

On Tue, May 20, 2014 at 12:11 AM, Hugh Dickins hu...@google.com wrote:
 On Mon, 19 May 2014, Jan Kara wrote:
 On Mon 19-05-14 13:44:25, David Herrmann wrote:
  On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins hu...@google.com wrote:
   The aspect which really worries me is this: the maintenance burden.
   This approach would add some peculiar new code, introducing a rare
   special case: which we might get right today, but will very easily
   forget tomorrow when making some other changes to mm.  If we compile
   a list of danger areas in mm, this would surely belong on that list.
 
  I tried doing the page-replacement in the last 4 days, but honestly,
  it's far more complex than I thought. So if no-one more experienced

 To be honest, I'm quite glad to hear that: it is still a solution worth
 considering, but I'd rather continue the search for a better solution.

What if we set VM_IO for memory-mappings if a file supports sealing?
That might be a hack and quite restrictive, but we could also add a
VM_DONTPIN flag that just prevents any page-pinning like GUP (which is
also a side-effect of VM_IO). This is basically what we do to protect
PCI BARs from that race during hotplug (well, VM_PFNMAP ist what
protects those, but the code is the same). If we mention in the
man-page that memfd-objects don't support direct-IO, we'd be fine, I
think. Not sure if that hack is better than the page-replacement,
though. It'd be definitely much simpler.

Regarding page-replacement, I tried using migrate_page(), however,
this obviously fails in page_freeze_refs() due to the elevated
ref-count and we cannot account for those, as they might vanish
asynchronously. Now I wonder whether we could just add a new mode
MIGRATE_PHASE_OUT that avoids freezing the page and forces the copy.
Existing refs would still operate on the old page, but any new access
gets the new page. This way, we could collect pages with elevated
ref-counts in shmem similar to do_move_page_to_node_array() and then
call migrate_pages(). Now migrate_pages() takes good care to prevent
any new refs during migration. try_to_unmap(TTU_MIGRATION) marks PTEs
as 'in-migration', so accesses are delayed. Page-faults wait on the
page-lock and retry due to mapping==NULL. lru is disabled beforehand.
Therefore, there cannot be any racing page-lookups as they all stall
on the migration. Moreover, page_freeze_refs() fails only if the page
is pinned by independent users (usually some form of I/O).
Question is what those additional ref-counts might be. Given that
shmem 'owns' its pages, none of these external references should pass
those refs around. All they use it for is I/O. Therefore, we shouldn't
even need an additional try_to_unmap() _after_ MIGRATE_PHASE_OUT as we
expect those external refs to never pass page-refs around. If that's a
valid assumption (and I haven't found any offenders so far), we should
be good with migrate_pages(MIGRATE_PHASE_OUT) as I described.

Comments?

While skimming over migrate.c I noticed two odd behaviors:
1) migration_entry_wait() is used to wait on a migration to finish,
before accessing PTE entries. However, we call get_page() there, which
increases the ref-count of the old page and causes page_freeze_refs()
to fail. There's no way we can know how many tasks wait on a migration
entry when calling page_freeze_refs(). I have no idea how that's
supposed to work? Why don't we store the new page in the migration-swp
entry so any lookups stall on the new page? We don't care for
ref-counts on that page and if the migration fails, new-mapping is
set to NULL and any lookup is retried. remove_migration_pte() can
restore the old page correctly.
2) remove_migration_pte() calls get_page(new) before writing the PTE.
But who releases the ref of the old page?

Thanks
David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] File Sealing & memfd_create()

2014-05-19 Thread Hugh Dickins
On Mon, 19 May 2014, Jan Kara wrote:
> On Mon 19-05-14 13:44:25, David Herrmann wrote:
> > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins  wrote:
> > > The aspect which really worries me is this: the maintenance burden.
> > > This approach would add some peculiar new code, introducing a rare
> > > special case: which we might get right today, but will very easily
> > > forget tomorrow when making some other changes to mm.  If we compile
> > > a list of danger areas in mm, this would surely belong on that list.
> > 
> > I tried doing the page-replacement in the last 4 days, but honestly,
> > it's far more complex than I thought. So if no-one more experienced

To be honest, I'm quite glad to hear that: it is still a solution worth
considering, but I'd rather continue the search for a better solution.

> > with mm/ comes up with a simple implementation, I'll have to delay
> > this for some more weeks.
> > 
> > However, I still wonder why we try to fix this as part of this
> > patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
> > amount of time. Same is true for network block-devices, NFS, iscsi,
> > maybe loop-devices, ... This means, _any_ once mapped page can be
> > written to after an arbitrary delay. This can break any feature that
> > makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
> > sealing, ..).

We need to fix it together with your sealing patchset, because your
patchset is all about introducing a new kind of guarantee: a guarantee
which this async i/o issue makes impossible to give, as things stand.

Exasperating for you, I understand; but that's how it is.
A new feature may make new demands on the infrastructure.

I can imagine existing problems, but (I may be out of touch) I have
not heard of them as problems in practice.  Certainly they would not
be recent regressions: mm-page versus fs-file has worked in this way
for as long as I've known them (pages released independently of
unmapping the file, with the understanding that i/o might still
be in progress, so care taken not to free the pages too soon).

> > 
> > Shouldn't we try to fix the _cause_ of this?

Nobody is against fixing the cause: we are all looking for the
simplest way of doing so,

> > 
> > Isn't there a simple way to lock/mark/.. affected vmas in
> > get_user_pages(_fast)() and release them once done? We could increase
> > i_mmap_writable on all affected address_space and decrease it on
> > release. This would at least prevent sealing and could be check on
> > other operations, too (like setting S_IMMUTABLE).
> > This should be as easy as checking page_mapping(page) != NULL and then
> > adjusting ->i_mmap_writable in
> > get_writable_user_pages/put_writable_user_pages, right?
>   Doing this would be quite a bit of work. Currently references returned by
> get_user_pages() are page references like any other and thus are released
> by put_page() or similar. Now you would make them special and they need
> special releasing and there are lots of places in kernel where
> get_user_pages() is used that would need changing.

Lots of places that would need changing, yes; but we have often
wondered in the past whether there should be a put_user_pages().
Though I'm not sure that it would actually solve anything...

> 
> Another aspect is that it could have performance implications - if there
> are several processes using get_user_pages[_fast]() on a file, they would
> start contending on modifying i_mmap_writeable.

Doing extra vma work in get_user_pages() wouldn't be so bad.  But doing
any vma work in get_user_pages_fast() would upset almost all its users:
get_user_pages_fast() is a fast-path which expressly avoids the vmas,
and hates additional cachelines being added to its machinations.

If sealing had appeared before get_user_pages_fast(), maybe we wouldn't
have let get_user_pages_fast() in; but now it's the other way around.

I would be more interested in attacking from the get_user_pages() and
get_user_pages_fast() end, if I could convince myself that they do
actually delimit the problem; maybe they do, but I'm not yet convinced.

> 
> One somewhat crazy idea I have is that maybe we could delay unmapping of a
> page if this was last VMA referencing it until all extra page references of
> pages in there are dropped. That would make i_mmap_writeable reliable for
> you and it would also close those races with remount. Hugh, do you think
> this might be viable?

It is definitely worth pursuing further, but I'm not very hopeful on it.
In a world of free page flags and free struct page fields, maybe.  (And
I don't see sealing as a feature sensibly restricted to 64-bit only.)

I think we would have to set a page flag, maybe bump a count, for every
leftover page that raises i_mmap_writable; and lower it (potentially from
interrupt context) at put_page() time.  Easy to make i_mmap_writable an
atomic rather than guarded by i_mmap_mutex, but we still need to
synchronize on it falling to 0.

And how would 

Re: [PATCH v2 0/3] File Sealing & memfd_create()

2014-05-19 Thread Jan Kara
On Mon 19-05-14 13:44:25, David Herrmann wrote:
> Hi
> 
> On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins  wrote:
> > The aspect which really worries me is this: the maintenance burden.
> > This approach would add some peculiar new code, introducing a rare
> > special case: which we might get right today, but will very easily
> > forget tomorrow when making some other changes to mm.  If we compile
> > a list of danger areas in mm, this would surely belong on that list.
> 
> I tried doing the page-replacement in the last 4 days, but honestly,
> it's far more complex than I thought. So if no-one more experienced
> with mm/ comes up with a simple implementation, I'll have to delay
> this for some more weeks.
> 
> However, I still wonder why we try to fix this as part of this
> patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
> amount of time. Same is true for network block-devices, NFS, iscsi,
> maybe loop-devices, ... This means, _any_ once mapped page can be
> written to after an arbitrary delay. This can break any feature that
> makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
> sealing, ..).
> 
> Shouldn't we try to fix the _cause_ of this?
> 
> Isn't there a simple way to lock/mark/.. affected vmas in
> get_user_pages(_fast)() and release them once done? We could increase
> i_mmap_writable on all affected address_space and decrease it on
> release. This would at least prevent sealing and could be check on
> other operations, too (like setting S_IMMUTABLE).
> This should be as easy as checking page_mapping(page) != NULL and then
> adjusting ->i_mmap_writable in
> get_writable_user_pages/put_writable_user_pages, right?
  Doing this would be quite a bit of work. Currently references returned by
get_user_pages() are page references like any other and thus are released
by put_page() or similar. Now you would make them special and they need
special releasing and there are lots of places in kernel where
get_user_pages() is used that would need changing.

Another aspect is that it could have performance implications - if there
are several processes using get_user_pages[_fast]() on a file, they would
start contending on modifying i_mmap_writeable.

One somewhat crazy idea I have is that maybe we could delay unmapping of a
page if this was last VMA referencing it until all extra page references of
pages in there are dropped. That would make i_mmap_writeable reliable for
you and it would also close those races with remount. Hugh, do you think
this might be viable?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] File Sealing & memfd_create()

2014-05-19 Thread David Herrmann
Hi

On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins  wrote:
> The aspect which really worries me is this: the maintenance burden.
> This approach would add some peculiar new code, introducing a rare
> special case: which we might get right today, but will very easily
> forget tomorrow when making some other changes to mm.  If we compile
> a list of danger areas in mm, this would surely belong on that list.

I tried doing the page-replacement in the last 4 days, but honestly,
it's far more complex than I thought. So if no-one more experienced
with mm/ comes up with a simple implementation, I'll have to delay
this for some more weeks.

However, I still wonder why we try to fix this as part of this
patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
amount of time. Same is true for network block-devices, NFS, iscsi,
maybe loop-devices, ... This means, _any_ once mapped page can be
written to after an arbitrary delay. This can break any feature that
makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
sealing, ..).

Shouldn't we try to fix the _cause_ of this?

Isn't there a simple way to lock/mark/.. affected vmas in
get_user_pages(_fast)() and release them once done? We could increase
i_mmap_writable on all affected address_space and decrease it on
release. This would at least prevent sealing and could be check on
other operations, too (like setting S_IMMUTABLE).
This should be as easy as checking page_mapping(page) != NULL and then
adjusting ->i_mmap_writable in
get_writable_user_pages/put_writable_user_pages, right?

Thanks
David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] File Sealing memfd_create()

2014-05-19 Thread Hugh Dickins
On Mon, 19 May 2014, Jan Kara wrote:
 On Mon 19-05-14 13:44:25, David Herrmann wrote:
  On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins hu...@google.com wrote:
   The aspect which really worries me is this: the maintenance burden.
   This approach would add some peculiar new code, introducing a rare
   special case: which we might get right today, but will very easily
   forget tomorrow when making some other changes to mm.  If we compile
   a list of danger areas in mm, this would surely belong on that list.
  
  I tried doing the page-replacement in the last 4 days, but honestly,
  it's far more complex than I thought. So if no-one more experienced

To be honest, I'm quite glad to hear that: it is still a solution worth
considering, but I'd rather continue the search for a better solution.

  with mm/ comes up with a simple implementation, I'll have to delay
  this for some more weeks.
  
  However, I still wonder why we try to fix this as part of this
  patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
  amount of time. Same is true for network block-devices, NFS, iscsi,
  maybe loop-devices, ... This means, _any_ once mapped page can be
  written to after an arbitrary delay. This can break any feature that
  makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
  sealing, ..).

We need to fix it together with your sealing patchset, because your
patchset is all about introducing a new kind of guarantee: a guarantee
which this async i/o issue makes impossible to give, as things stand.

Exasperating for you, I understand; but that's how it is.
A new feature may make new demands on the infrastructure.

I can imagine existing problems, but (I may be out of touch) I have
not heard of them as problems in practice.  Certainly they would not
be recent regressions: mm-page versus fs-file has worked in this way
for as long as I've known them (pages released independently of
unmapping the file, with the understanding that i/o might still
be in progress, so care taken not to free the pages too soon).

  
  Shouldn't we try to fix the _cause_ of this?

Nobody is against fixing the cause: we are all looking for the
simplest way of doing so,

  
  Isn't there a simple way to lock/mark/.. affected vmas in
  get_user_pages(_fast)() and release them once done? We could increase
  i_mmap_writable on all affected address_space and decrease it on
  release. This would at least prevent sealing and could be check on
  other operations, too (like setting S_IMMUTABLE).
  This should be as easy as checking page_mapping(page) != NULL and then
  adjusting -i_mmap_writable in
  get_writable_user_pages/put_writable_user_pages, right?
   Doing this would be quite a bit of work. Currently references returned by
 get_user_pages() are page references like any other and thus are released
 by put_page() or similar. Now you would make them special and they need
 special releasing and there are lots of places in kernel where
 get_user_pages() is used that would need changing.

Lots of places that would need changing, yes; but we have often
wondered in the past whether there should be a put_user_pages().
Though I'm not sure that it would actually solve anything...

 
 Another aspect is that it could have performance implications - if there
 are several processes using get_user_pages[_fast]() on a file, they would
 start contending on modifying i_mmap_writeable.

Doing extra vma work in get_user_pages() wouldn't be so bad.  But doing
any vma work in get_user_pages_fast() would upset almost all its users:
get_user_pages_fast() is a fast-path which expressly avoids the vmas,
and hates additional cachelines being added to its machinations.

If sealing had appeared before get_user_pages_fast(), maybe we wouldn't
have let get_user_pages_fast() in; but now it's the other way around.

I would be more interested in attacking from the get_user_pages() and
get_user_pages_fast() end, if I could convince myself that they do
actually delimit the problem; maybe they do, but I'm not yet convinced.

 
 One somewhat crazy idea I have is that maybe we could delay unmapping of a
 page if this was last VMA referencing it until all extra page references of
 pages in there are dropped. That would make i_mmap_writeable reliable for
 you and it would also close those races with remount. Hugh, do you think
 this might be viable?

It is definitely worth pursuing further, but I'm not very hopeful on it.
In a world of free page flags and free struct page fields, maybe.  (And
I don't see sealing as a feature sensibly restricted to 64-bit only.)

I think we would have to set a page flag, maybe bump a count, for every
leftover page that raises i_mmap_writable; and lower it (potentially from
interrupt context) at put_page() time.  Easy to make i_mmap_writable an
atomic rather than guarded by i_mmap_mutex, but we still need to
synchronize on it falling to 0.

And how would we recognize the relevant, decrementing, put_page()?
page_count 

Re: [PATCH v2 0/3] File Sealing memfd_create()

2014-05-19 Thread David Herrmann
Hi

On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins hu...@google.com wrote:
 The aspect which really worries me is this: the maintenance burden.
 This approach would add some peculiar new code, introducing a rare
 special case: which we might get right today, but will very easily
 forget tomorrow when making some other changes to mm.  If we compile
 a list of danger areas in mm, this would surely belong on that list.

I tried doing the page-replacement in the last 4 days, but honestly,
it's far more complex than I thought. So if no-one more experienced
with mm/ comes up with a simple implementation, I'll have to delay
this for some more weeks.

However, I still wonder why we try to fix this as part of this
patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
amount of time. Same is true for network block-devices, NFS, iscsi,
maybe loop-devices, ... This means, _any_ once mapped page can be
written to after an arbitrary delay. This can break any feature that
makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
sealing, ..).

Shouldn't we try to fix the _cause_ of this?

Isn't there a simple way to lock/mark/.. affected vmas in
get_user_pages(_fast)() and release them once done? We could increase
i_mmap_writable on all affected address_space and decrease it on
release. This would at least prevent sealing and could be check on
other operations, too (like setting S_IMMUTABLE).
This should be as easy as checking page_mapping(page) != NULL and then
adjusting -i_mmap_writable in
get_writable_user_pages/put_writable_user_pages, right?

Thanks
David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] File Sealing memfd_create()

2014-05-19 Thread Jan Kara
On Mon 19-05-14 13:44:25, David Herrmann wrote:
 Hi
 
 On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins hu...@google.com wrote:
  The aspect which really worries me is this: the maintenance burden.
  This approach would add some peculiar new code, introducing a rare
  special case: which we might get right today, but will very easily
  forget tomorrow when making some other changes to mm.  If we compile
  a list of danger areas in mm, this would surely belong on that list.
 
 I tried doing the page-replacement in the last 4 days, but honestly,
 it's far more complex than I thought. So if no-one more experienced
 with mm/ comes up with a simple implementation, I'll have to delay
 this for some more weeks.
 
 However, I still wonder why we try to fix this as part of this
 patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
 amount of time. Same is true for network block-devices, NFS, iscsi,
 maybe loop-devices, ... This means, _any_ once mapped page can be
 written to after an arbitrary delay. This can break any feature that
 makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
 sealing, ..).
 
 Shouldn't we try to fix the _cause_ of this?
 
 Isn't there a simple way to lock/mark/.. affected vmas in
 get_user_pages(_fast)() and release them once done? We could increase
 i_mmap_writable on all affected address_space and decrease it on
 release. This would at least prevent sealing and could be check on
 other operations, too (like setting S_IMMUTABLE).
 This should be as easy as checking page_mapping(page) != NULL and then
 adjusting -i_mmap_writable in
 get_writable_user_pages/put_writable_user_pages, right?
  Doing this would be quite a bit of work. Currently references returned by
get_user_pages() are page references like any other and thus are released
by put_page() or similar. Now you would make them special and they need
special releasing and there are lots of places in kernel where
get_user_pages() is used that would need changing.

Another aspect is that it could have performance implications - if there
are several processes using get_user_pages[_fast]() on a file, they would
start contending on modifying i_mmap_writeable.

One somewhat crazy idea I have is that maybe we could delay unmapping of a
page if this was last VMA referencing it until all extra page references of
pages in there are dropped. That would make i_mmap_writeable reliable for
you and it would also close those races with remount. Hugh, do you think
this might be viable?

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] File Sealing & memfd_create()

2014-05-14 Thread Hugh Dickins
On Wed, 14 May 2014, Tony Battersby wrote:
> Hugh Dickins wrote:
> > Checking page counts in a GB file prior to sealing does not appeal at
> > all: we'd be lucky ever to find them all accounted for.
> 
> Here is a refinement of that idea: during a seal operation, iterate over
> all the pages in the file and check their refcounts.  On any page that
> has an unexpected extra reference, allocate a new page, copy the data
> over to the new page, and then replace the page having the extra
> reference with the newly-allocated page in the file.  That way you still
> get zero-copy on pages that don't have extra references, and you don't
> have to fail the seal operation if some of the pages are still being
> referenced by something else.

That does seem a more promising idea than any that I'd had: thank you.

But whether it can actually be made to work (safely) is not yet clear
to me.

It would be rather like page migration; but whereas page migration
backs off whenever the page count cannot be fully accounted for
(as does KSM), that is precisely when this would have to act.

Taking action in the case of ignorance does not make me feel very
comfortable.  Page lock and radix tree lock would guard against
many surprises, but not necessarily all.

> 
> The downside of course is the extra memory usage and memcpy overhead if
> something is holding extra references to the pages.  So whether this is
> a good approach depends on:
> 
> *) Whether extra page references would happen frequently or infrequently
> under various kernel configurations and usage scenarios.  I don't know
> enough about the mm system to answer this myself.
> 
> *) Whether or not the extra memory usage and memcpy overhead could be
> considered a DoS attack vector by someone who has found a way to add
> extra references to the pages intentionally.

I may just be too naive on such issues, but neither of those worries
me particularly.  If something can already add an extra pin to many
pages, that is already a concern for memory usage.  The sealing case
would double its scale, but I don't see that as a new issue.

The aspect which really worries me is this: the maintenance burden.
This approach would add some peculiar new code, introducing a rare
special case: which we might get right today, but will very easily
forget tomorrow when making some other changes to mm.  If we compile
a list of danger areas in mm, this would surely belong on that list.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] File Sealing & memfd_create()

2014-05-14 Thread Tony Battersby
Hugh Dickins wrote:
> Checking page counts in a GB file prior to sealing does not appeal at
> all: we'd be lucky ever to find them all accounted for.

Here is a refinement of that idea: during a seal operation, iterate over
all the pages in the file and check their refcounts.  On any page that
has an unexpected extra reference, allocate a new page, copy the data
over to the new page, and then replace the page having the extra
reference with the newly-allocated page in the file.  That way you still
get zero-copy on pages that don't have extra references, and you don't
have to fail the seal operation if some of the pages are still being
referenced by something else.

The downside of course is the extra memory usage and memcpy overhead if
something is holding extra references to the pages.  So whether this is
a good approach depends on:

*) Whether extra page references would happen frequently or infrequently
under various kernel configurations and usage scenarios.  I don't know
enough about the mm system to answer this myself.

*) Whether or not the extra memory usage and memcpy overhead could be
considered a DoS attack vector by someone who has found a way to add
extra references to the pages intentionally.

Tony Battersby
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] File Sealing memfd_create()

2014-05-14 Thread Tony Battersby
Hugh Dickins wrote:
 Checking page counts in a GB file prior to sealing does not appeal at
 all: we'd be lucky ever to find them all accounted for.

Here is a refinement of that idea: during a seal operation, iterate over
all the pages in the file and check their refcounts.  On any page that
has an unexpected extra reference, allocate a new page, copy the data
over to the new page, and then replace the page having the extra
reference with the newly-allocated page in the file.  That way you still
get zero-copy on pages that don't have extra references, and you don't
have to fail the seal operation if some of the pages are still being
referenced by something else.

The downside of course is the extra memory usage and memcpy overhead if
something is holding extra references to the pages.  So whether this is
a good approach depends on:

*) Whether extra page references would happen frequently or infrequently
under various kernel configurations and usage scenarios.  I don't know
enough about the mm system to answer this myself.

*) Whether or not the extra memory usage and memcpy overhead could be
considered a DoS attack vector by someone who has found a way to add
extra references to the pages intentionally.

Tony Battersby
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] File Sealing memfd_create()

2014-05-14 Thread Hugh Dickins
On Wed, 14 May 2014, Tony Battersby wrote:
 Hugh Dickins wrote:
  Checking page counts in a GB file prior to sealing does not appeal at
  all: we'd be lucky ever to find them all accounted for.
 
 Here is a refinement of that idea: during a seal operation, iterate over
 all the pages in the file and check their refcounts.  On any page that
 has an unexpected extra reference, allocate a new page, copy the data
 over to the new page, and then replace the page having the extra
 reference with the newly-allocated page in the file.  That way you still
 get zero-copy on pages that don't have extra references, and you don't
 have to fail the seal operation if some of the pages are still being
 referenced by something else.

That does seem a more promising idea than any that I'd had: thank you.

But whether it can actually be made to work (safely) is not yet clear
to me.

It would be rather like page migration; but whereas page migration
backs off whenever the page count cannot be fully accounted for
(as does KSM), that is precisely when this would have to act.

Taking action in the case of ignorance does not make me feel very
comfortable.  Page lock and radix tree lock would guard against
many surprises, but not necessarily all.

 
 The downside of course is the extra memory usage and memcpy overhead if
 something is holding extra references to the pages.  So whether this is
 a good approach depends on:
 
 *) Whether extra page references would happen frequently or infrequently
 under various kernel configurations and usage scenarios.  I don't know
 enough about the mm system to answer this myself.
 
 *) Whether or not the extra memory usage and memcpy overhead could be
 considered a DoS attack vector by someone who has found a way to add
 extra references to the pages intentionally.

I may just be too naive on such issues, but neither of those worries
me particularly.  If something can already add an extra pin to many
pages, that is already a concern for memory usage.  The sealing case
would double its scale, but I don't see that as a new issue.

The aspect which really worries me is this: the maintenance burden.
This approach would add some peculiar new code, introducing a rare
special case: which we might get right today, but will very easily
forget tomorrow when making some other changes to mm.  If we compile
a list of danger areas in mm, this would surely belong on that list.

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/3] File Sealing & memfd_create()

2014-05-13 Thread Hugh Dickins
On Tue, 15 Apr 2014, David Herrmann wrote:
> 
> This is v2 of the File-Sealing and memfd_create() patches. You can find v1 
> with
> a longer introduction at gmane:
>   http://thread.gmane.org/gmane.comp.video.dri.devel/102241
> An LWN article about memfd+sealing is available, too:
>   https://lwn.net/Articles/593918/

Sorry it's taken so long: at last I managed to set aside a few hours at
the weekend, to read through your memfd+sealing work and let it sink in.

Good stuff.  I've a page of notes which I shall respond with, either
later in the week or at the weekend; but they're pretty much trivia, or
notes to myself, beside the async I/O issue raised by Tony Battersby.

I thought I'd better not wait longer to give warning that I do take
that issue seriously.

> 
> Shortlog of changes since v1:
>  - Dropped the "exclusive reference" idea
>Now sealing is a one-shot operation. Once a given seal is set, you cannot
>remove this seal again, ever. This allows us to drop all the ref-count
>checking and simplifies the code a lot. We also no longer have all the 
> races
>we have to test for.
>  - The i_writecount fix is now upstream (slightly different, by Al Viro) so I
>dropped it from the series.
>  - Change SHMEM_* prefix to F_* to avoid any API-association to shmem.
>  - Sealing is disabled on all files by default (even though we still haven't
>found any DoS attack). You need to pass MFD_ALLOW_SEALING to memfd_create()
>to get an object that supports the sealing API.
>  - Changed F_SET_SEALS to F_ADD_SEALS. This better reflects the API. You can
>never remove seals, you can only add seals. Note that the semantics also
>changed slightly: You can now _always_ call F_ADD_SEALS to add _more_ 
> seals.
>However, a new seal was added which "seals sealing" (F_SEAL_SEAL). So once
>F_SEAL_SEAL is set, F_ADD_SEAL is no longer allowed.
>This feature was requested by the glib developers.
>  - memfd_create() names are now limited to NAME_MAX instead of 256 hardcoded.
>  - Rewrote the test suite
> 
> The biggest change in v2 is the removal of the "exclusive reference" idea. It
> was a nice optimization, but the implementation was ugly and racy regarding
> file-table changes. Linus didn't like it either so we decided to drop it
> entirely. Sealing is a one-shot operation now. A sealed file can never be
> unsealed, even if you're the only holder.
> 
> I also addressed most of the concerns regarding API naming and semantics. I 
> got
> feedback from glib, EFL, wayland, kdbus, ostree, audio developers and we
> discussed many possible use-cases (and also cases that don't make sense). So I
> think we're in a very good state right now.
> 
> People requested to make this interface more generic. I renamed the API to
> reflect that, but I didn't change the implementation. Thing is, seals cannot 
> be
> removed, ever. Therefore, semantics for sealing on non-volatile storage are
> undefined. We don't write them to disc and it is unclear whether a sealed file
> can be unlinked/removed again. There're more issues with this and no-one came 
> up
> with a use-case, hence I didn't bother implementing it.
> There's also an ongoing discussion about an AIO race, but this also affects
> other inode-protections like S_IMMUTABLE/etc. So I don't think we should tie
> the fix to this series.

I disagree on that.

Whatever the bugs or limitations with S_IMMUTABLE, ETXTBSY etc,
we have lived with those without complaint for many years.

You now propose an entirely new kind of guarantee, but that guarantee
is broken by the possibility of outstanding async I/O to a page of the
sealed object.

I don't see how we can add the new feature while knowing it broken.  We
have to devise a solution, but I haven't thought of a good solution yet.

Checking page counts in a GB file prior to sealing does not appeal at
all: we'd be lucky ever to find them all accounted for.  Adding overhead
to get_user_pages_fast() won't appeal to its adherents, and I'm not even
convinced that GUP is the only way in here.

Any ideas?

> Another discussion was about preventing /proc/self/fd/. But again, no-one 
> could
> tell me _why_, so I didn't bother. On the contrary, I even provided several
> use-cases that make use of /proc/self/fd/ to get read-only FDs to pass around.
> 
> If anyone wants to test this, please use 3.15-rc1 as base. The i_writecount
> fixes are required for this series.
> 
> Comments welcome!
> David
> 
> David Herrmann (3):
>   shm: add sealing API
>   shm: add memfd_create() syscall
>   selftests: add memfd_create() + sealing tests
> 
>  arch/x86/syscalls/syscall_32.tbl   |   1 +
>  arch/x86/syscalls/syscall_64.tbl   |   1 +
>  fs/fcntl.c |   5 +
>  include/linux/shmem_fs.h   |  20 +
>  include/linux/syscalls.h   |   1 +
>  include/uapi/linux/fcntl.h |  15 +
>  include/uapi/linux/memfd.h |  10 +
>  

Re: [PATCH v2 0/3] File Sealing memfd_create()

2014-05-13 Thread Hugh Dickins
On Tue, 15 Apr 2014, David Herrmann wrote:
 
 This is v2 of the File-Sealing and memfd_create() patches. You can find v1 
 with
 a longer introduction at gmane:
   http://thread.gmane.org/gmane.comp.video.dri.devel/102241
 An LWN article about memfd+sealing is available, too:
   https://lwn.net/Articles/593918/

Sorry it's taken so long: at last I managed to set aside a few hours at
the weekend, to read through your memfd+sealing work and let it sink in.

Good stuff.  I've a page of notes which I shall respond with, either
later in the week or at the weekend; but they're pretty much trivia, or
notes to myself, beside the async I/O issue raised by Tony Battersby.

I thought I'd better not wait longer to give warning that I do take
that issue seriously.

 
 Shortlog of changes since v1:
  - Dropped the exclusive reference idea
Now sealing is a one-shot operation. Once a given seal is set, you cannot
remove this seal again, ever. This allows us to drop all the ref-count
checking and simplifies the code a lot. We also no longer have all the 
 races
we have to test for.
  - The i_writecount fix is now upstream (slightly different, by Al Viro) so I
dropped it from the series.
  - Change SHMEM_* prefix to F_* to avoid any API-association to shmem.
  - Sealing is disabled on all files by default (even though we still haven't
found any DoS attack). You need to pass MFD_ALLOW_SEALING to memfd_create()
to get an object that supports the sealing API.
  - Changed F_SET_SEALS to F_ADD_SEALS. This better reflects the API. You can
never remove seals, you can only add seals. Note that the semantics also
changed slightly: You can now _always_ call F_ADD_SEALS to add _more_ 
 seals.
However, a new seal was added which seals sealing (F_SEAL_SEAL). So once
F_SEAL_SEAL is set, F_ADD_SEAL is no longer allowed.
This feature was requested by the glib developers.
  - memfd_create() names are now limited to NAME_MAX instead of 256 hardcoded.
  - Rewrote the test suite
 
 The biggest change in v2 is the removal of the exclusive reference idea. It
 was a nice optimization, but the implementation was ugly and racy regarding
 file-table changes. Linus didn't like it either so we decided to drop it
 entirely. Sealing is a one-shot operation now. A sealed file can never be
 unsealed, even if you're the only holder.
 
 I also addressed most of the concerns regarding API naming and semantics. I 
 got
 feedback from glib, EFL, wayland, kdbus, ostree, audio developers and we
 discussed many possible use-cases (and also cases that don't make sense). So I
 think we're in a very good state right now.
 
 People requested to make this interface more generic. I renamed the API to
 reflect that, but I didn't change the implementation. Thing is, seals cannot 
 be
 removed, ever. Therefore, semantics for sealing on non-volatile storage are
 undefined. We don't write them to disc and it is unclear whether a sealed file
 can be unlinked/removed again. There're more issues with this and no-one came 
 up
 with a use-case, hence I didn't bother implementing it.
 There's also an ongoing discussion about an AIO race, but this also affects
 other inode-protections like S_IMMUTABLE/etc. So I don't think we should tie
 the fix to this series.

I disagree on that.

Whatever the bugs or limitations with S_IMMUTABLE, ETXTBSY etc,
we have lived with those without complaint for many years.

You now propose an entirely new kind of guarantee, but that guarantee
is broken by the possibility of outstanding async I/O to a page of the
sealed object.

I don't see how we can add the new feature while knowing it broken.  We
have to devise a solution, but I haven't thought of a good solution yet.

Checking page counts in a GB file prior to sealing does not appeal at
all: we'd be lucky ever to find them all accounted for.  Adding overhead
to get_user_pages_fast() won't appeal to its adherents, and I'm not even
convinced that GUP is the only way in here.

Any ideas?

 Another discussion was about preventing /proc/self/fd/. But again, no-one 
 could
 tell me _why_, so I didn't bother. On the contrary, I even provided several
 use-cases that make use of /proc/self/fd/ to get read-only FDs to pass around.
 
 If anyone wants to test this, please use 3.15-rc1 as base. The i_writecount
 fixes are required for this series.
 
 Comments welcome!
 David
 
 David Herrmann (3):
   shm: add sealing API
   shm: add memfd_create() syscall
   selftests: add memfd_create() + sealing tests
 
  arch/x86/syscalls/syscall_32.tbl   |   1 +
  arch/x86/syscalls/syscall_64.tbl   |   1 +
  fs/fcntl.c |   5 +
  include/linux/shmem_fs.h   |  20 +
  include/linux/syscalls.h   |   1 +
  include/uapi/linux/fcntl.h |  15 +
  include/uapi/linux/memfd.h |  10 +
  kernel/sys_ni.c|   1 +
  mm/shmem.c