Re: [PATCH v2 0/3] File Sealing & memfd_create()
On Mon, Jun 02, 2014 at 06:04:02PM +0200, David Herrmann wrote: > Correct, the problem is not accounting for pinned-pages, but waiting > for them to get released. Furthermore, Peter's patches make VM_PINNED > an optional feature, so we'd still miss all the short-term GUP users. > Sadly, that means we cannot even use it to test for pending GUP users. Right, I'm not bothered about temporary pins, they'll go away quickly and generally not bother reclaim and the like (much). The thing I 'worry' about is the persistent pins, which have unbounded life spans. pgp4UPvaPdgiR.pgp Description: PGP signature
Re: [PATCH v2 0/3] File Sealing memfd_create()
On Mon, Jun 02, 2014 at 06:04:02PM +0200, David Herrmann wrote: Correct, the problem is not accounting for pinned-pages, but waiting for them to get released. Furthermore, Peter's patches make VM_PINNED an optional feature, so we'd still miss all the short-term GUP users. Sadly, that means we cannot even use it to test for pending GUP users. Right, I'm not bothered about temporary pins, they'll go away quickly and generally not bother reclaim and the like (much). The thing I 'worry' about is the persistent pins, which have unbounded life spans. pgp4UPvaPdgiR.pgp Description: PGP signature
Re: [PATCH v2 0/3] File Sealing & memfd_create()
Hi On Mon, Jun 2, 2014 at 11:14 AM, Jan Kara wrote: > On Mon 02-06-14 13:42:59, Minchan Kim wrote: >> On Mon, May 19, 2014 at 01:44:25PM +0200, David Herrmann wrote: >> > I tried doing the page-replacement in the last 4 days, but honestly, >> > it's far more complex than I thought. So if no-one more experienced >> > with mm/ comes up with a simple implementation, I'll have to delay >> > this for some more weeks. >> > >> > However, I still wonder why we try to fix this as part of this >> > patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary >> > amount of time. Same is true for network block-devices, NFS, iscsi, >> > maybe loop-devices, ... This means, _any_ once mapped page can be >> > written to after an arbitrary delay. This can break any feature that >> > makes FS objects read-only (remounting read-only, setting S_IMMUTABLE, >> > sealing, ..). >> > >> > Shouldn't we try to fix the _cause_ of this? >> >> I didn't follow this patchset and couldn't find what's your most cocern >> but at a first glance, it seems you have troubled with pinned page. >> If so, it's really big problem for CMA and I think peterz's approach(ie, >> mm_mpin) is really make sense to me. > Well, his concern are pinned pages (and also pages used for direct IO and > similar) but not because they are pinned but because they can be modified > while someone holds reference to them. So I'm not sure Peter's patches will > help here. Correct, the problem is not accounting for pinned-pages, but waiting for them to get released. Furthermore, Peter's patches make VM_PINNED an optional feature, so we'd still miss all the short-term GUP users. Sadly, that means we cannot even use it to test for pending GUP users. Thanks David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] File Sealing & memfd_create()
On Mon 02-06-14 13:42:59, Minchan Kim wrote: > Hello, > > On Mon, May 19, 2014 at 01:44:25PM +0200, David Herrmann wrote: > > Hi > > > > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins wrote: > > > The aspect which really worries me is this: the maintenance burden. > > > This approach would add some peculiar new code, introducing a rare > > > special case: which we might get right today, but will very easily > > > forget tomorrow when making some other changes to mm. If we compile > > > a list of danger areas in mm, this would surely belong on that list. > > > > I tried doing the page-replacement in the last 4 days, but honestly, > > it's far more complex than I thought. So if no-one more experienced > > with mm/ comes up with a simple implementation, I'll have to delay > > this for some more weeks. > > > > However, I still wonder why we try to fix this as part of this > > patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary > > amount of time. Same is true for network block-devices, NFS, iscsi, > > maybe loop-devices, ... This means, _any_ once mapped page can be > > written to after an arbitrary delay. This can break any feature that > > makes FS objects read-only (remounting read-only, setting S_IMMUTABLE, > > sealing, ..). > > > > Shouldn't we try to fix the _cause_ of this? > > I didn't follow this patchset and couldn't find what's your most cocern > but at a first glance, it seems you have troubled with pinned page. > If so, it's really big problem for CMA and I think peterz's approach(ie, > mm_mpin) is really make sense to me. Well, his concern are pinned pages (and also pages used for direct IO and similar) but not because they are pinned but because they can be modified while someone holds reference to them. So I'm not sure Peter's patches will help here. > https://lkml.org/lkml/2014/5/26/340 Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] File Sealing memfd_create()
On Mon 02-06-14 13:42:59, Minchan Kim wrote: Hello, On Mon, May 19, 2014 at 01:44:25PM +0200, David Herrmann wrote: Hi On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins hu...@google.com wrote: The aspect which really worries me is this: the maintenance burden. This approach would add some peculiar new code, introducing a rare special case: which we might get right today, but will very easily forget tomorrow when making some other changes to mm. If we compile a list of danger areas in mm, this would surely belong on that list. I tried doing the page-replacement in the last 4 days, but honestly, it's far more complex than I thought. So if no-one more experienced with mm/ comes up with a simple implementation, I'll have to delay this for some more weeks. However, I still wonder why we try to fix this as part of this patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary amount of time. Same is true for network block-devices, NFS, iscsi, maybe loop-devices, ... This means, _any_ once mapped page can be written to after an arbitrary delay. This can break any feature that makes FS objects read-only (remounting read-only, setting S_IMMUTABLE, sealing, ..). Shouldn't we try to fix the _cause_ of this? I didn't follow this patchset and couldn't find what's your most cocern but at a first glance, it seems you have troubled with pinned page. If so, it's really big problem for CMA and I think peterz's approach(ie, mm_mpin) is really make sense to me. Well, his concern are pinned pages (and also pages used for direct IO and similar) but not because they are pinned but because they can be modified while someone holds reference to them. So I'm not sure Peter's patches will help here. https://lkml.org/lkml/2014/5/26/340 Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] File Sealing memfd_create()
Hi On Mon, Jun 2, 2014 at 11:14 AM, Jan Kara j...@suse.cz wrote: On Mon 02-06-14 13:42:59, Minchan Kim wrote: On Mon, May 19, 2014 at 01:44:25PM +0200, David Herrmann wrote: I tried doing the page-replacement in the last 4 days, but honestly, it's far more complex than I thought. So if no-one more experienced with mm/ comes up with a simple implementation, I'll have to delay this for some more weeks. However, I still wonder why we try to fix this as part of this patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary amount of time. Same is true for network block-devices, NFS, iscsi, maybe loop-devices, ... This means, _any_ once mapped page can be written to after an arbitrary delay. This can break any feature that makes FS objects read-only (remounting read-only, setting S_IMMUTABLE, sealing, ..). Shouldn't we try to fix the _cause_ of this? I didn't follow this patchset and couldn't find what's your most cocern but at a first glance, it seems you have troubled with pinned page. If so, it's really big problem for CMA and I think peterz's approach(ie, mm_mpin) is really make sense to me. Well, his concern are pinned pages (and also pages used for direct IO and similar) but not because they are pinned but because they can be modified while someone holds reference to them. So I'm not sure Peter's patches will help here. Correct, the problem is not accounting for pinned-pages, but waiting for them to get released. Furthermore, Peter's patches make VM_PINNED an optional feature, so we'd still miss all the short-term GUP users. Sadly, that means we cannot even use it to test for pending GUP users. Thanks David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] File Sealing & memfd_create()
Hello, On Mon, May 19, 2014 at 01:44:25PM +0200, David Herrmann wrote: > Hi > > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins wrote: > > The aspect which really worries me is this: the maintenance burden. > > This approach would add some peculiar new code, introducing a rare > > special case: which we might get right today, but will very easily > > forget tomorrow when making some other changes to mm. If we compile > > a list of danger areas in mm, this would surely belong on that list. > > I tried doing the page-replacement in the last 4 days, but honestly, > it's far more complex than I thought. So if no-one more experienced > with mm/ comes up with a simple implementation, I'll have to delay > this for some more weeks. > > However, I still wonder why we try to fix this as part of this > patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary > amount of time. Same is true for network block-devices, NFS, iscsi, > maybe loop-devices, ... This means, _any_ once mapped page can be > written to after an arbitrary delay. This can break any feature that > makes FS objects read-only (remounting read-only, setting S_IMMUTABLE, > sealing, ..). > > Shouldn't we try to fix the _cause_ of this? I didn't follow this patchset and couldn't find what's your most cocern but at a first glance, it seems you have troubled with pinned page. If so, it's really big problem for CMA and I think peterz's approach(ie, mm_mpin) is really make sense to me. https://lkml.org/lkml/2014/5/26/340 > > Isn't there a simple way to lock/mark/.. affected vmas in > get_user_pages(_fast)() and release them once done? We could increase > i_mmap_writable on all affected address_space and decrease it on > release. This would at least prevent sealing and could be check on > other operations, too (like setting S_IMMUTABLE). > This should be as easy as checking page_mapping(page) != NULL and then > adjusting ->i_mmap_writable in > get_writable_user_pages/put_writable_user_pages, right? > > Thanks > David > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] File Sealing memfd_create()
Hello, On Mon, May 19, 2014 at 01:44:25PM +0200, David Herrmann wrote: Hi On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins hu...@google.com wrote: The aspect which really worries me is this: the maintenance burden. This approach would add some peculiar new code, introducing a rare special case: which we might get right today, but will very easily forget tomorrow when making some other changes to mm. If we compile a list of danger areas in mm, this would surely belong on that list. I tried doing the page-replacement in the last 4 days, but honestly, it's far more complex than I thought. So if no-one more experienced with mm/ comes up with a simple implementation, I'll have to delay this for some more weeks. However, I still wonder why we try to fix this as part of this patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary amount of time. Same is true for network block-devices, NFS, iscsi, maybe loop-devices, ... This means, _any_ once mapped page can be written to after an arbitrary delay. This can break any feature that makes FS objects read-only (remounting read-only, setting S_IMMUTABLE, sealing, ..). Shouldn't we try to fix the _cause_ of this? I didn't follow this patchset and couldn't find what's your most cocern but at a first glance, it seems you have troubled with pinned page. If so, it's really big problem for CMA and I think peterz's approach(ie, mm_mpin) is really make sense to me. https://lkml.org/lkml/2014/5/26/340 Isn't there a simple way to lock/mark/.. affected vmas in get_user_pages(_fast)() and release them once done? We could increase i_mmap_writable on all affected address_space and decrease it on release. This would at least prevent sealing and could be check on other operations, too (like setting S_IMMUTABLE). This should be as easy as checking page_mapping(page) != NULL and then adjusting -i_mmap_writable in get_writable_user_pages/put_writable_user_pages, right? Thanks David -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] File Sealing & memfd_create()
On Mon, 26 May 2014, David Herrmann wrote: > > (CC migrate.c committers) > > On Tue, May 20, 2014 at 12:11 AM, Hugh Dickins wrote: > > On Mon, 19 May 2014, Jan Kara wrote: > >> On Mon 19-05-14 13:44:25, David Herrmann wrote: > >> > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins wrote: > >> > > The aspect which really worries me is this: the maintenance burden. > >> > > This approach would add some peculiar new code, introducing a rare > >> > > special case: which we might get right today, but will very easily > >> > > forget tomorrow when making some other changes to mm. If we compile > >> > > a list of danger areas in mm, this would surely belong on that list. > >> > > >> > I tried doing the page-replacement in the last 4 days, but honestly, > >> > it's far more complex than I thought. So if no-one more experienced > > > > To be honest, I'm quite glad to hear that: it is still a solution worth > > considering, but I'd rather continue the search for a better solution. > > What if we set VM_IO for memory-mappings if a file supports sealing? > That might be a hack and quite restrictive, but we could also add a > VM_DONTPIN flag that just prevents any page-pinning like GUP (which is > also a side-effect of VM_IO). This is basically what we do to protect > PCI BARs from that race during hotplug (well, VM_PFNMAP ist what > protects those, but the code is the same). If we mention in the > man-page that memfd-objects don't support direct-IO, we'd be fine, I > think. Not sure if that hack is better than the page-replacement, > though. It'd be definitely much simpler. I admire your resourcefulness, but three things put me off using VM_IO. One is just a great distaste for using VM_IO in mm, when it's about device and memory regions unknown to mm. You can indeed get around that objection by making a new VM_DONTPIN with similar restrictions. Two is that I'm afraid of those restrictions which you foresee. Soon after putting this in, we shall have people saying "I can't do such-and-such on my memfd-object, can you please enable it" and then we shall have to find another solution. But three, more importantly, that gives no protection against get_user_pages_fast() callers: as mentioned before, GUP-fast works on page tables, not examining vmas at all, so it is blind to VM_IO. get_user_pages_fast() does check pte access, and pte_special bit, falling back to get_user_pages() if unsuitable: IIRC, pte_special support is important for an architecture to support GUP-fast. But I definitely don't want any shmem to be !vm_normal_page(). > > Regarding page-replacement, I tried using migrate_page(), however, > this obviously fails in page_freeze_refs() due to the elevated > ref-count and we cannot account for those, as they might vanish > asynchronously. Now I wonder whether we could just add a new mode > MIGRATE_PHASE_OUT that avoids freezing the page and forces the copy. > Existing refs would still operate on the old page, but any new access > gets the new page. This way, we could collect pages with elevated > ref-counts in shmem similar to do_move_page_to_node_array() and then > call migrate_pages(). Now migrate_pages() takes good care to prevent > any new refs during migration. try_to_unmap(TTU_MIGRATION) marks PTEs > as 'in-migration', so accesses are delayed. Page-faults wait on the > page-lock and retry due to mapping==NULL. lru is disabled beforehand. > Therefore, there cannot be any racing page-lookups as they all stall > on the migration. Moreover, page_freeze_refs() fails only if the page > is pinned by independent users (usually some form of I/O). > Question is what those additional ref-counts might be. Given that > shmem 'owns' its pages, none of these external references should pass > those refs around. All they use it for is I/O. Therefore, we shouldn't > even need an additional try_to_unmap() _after_ MIGRATE_PHASE_OUT as we > expect those external refs to never pass page-refs around. If that's a > valid assumption (and I haven't found any offenders so far), we should > be good with migrate_pages(MIGRATE_PHASE_OUT) as I described. > > Comments? I have not given these details all the attention that they deserve. It's Tony's copy-on-seal suggestion, fleshed out to use the migration infrastructure; but I still feel the same way about it as I did before. It's complexity (and complexity outside of mm/shmem.c) that I would prefer to avoid; and I find it hard to predict the consequence of this copy-on-write-in-reverse, where existing users are left holding something that used to be part of the object and now is not. Perhaps the existing sudden-truncation codepaths would handle it appropriately, perhaps they would not. (I wonder if Daniel Phillips's Tux3 "page fork"ing is relevant here? But don't want to explore that right now. "Revoke" also comes to mind.) I do think it may turn out to be a good future way forward, if what we do now turns out to be too restrictive; but we don't want to turn mm (slightly)
Re: [PATCH v2 0/3] File Sealing memfd_create()
On Mon, 26 May 2014, David Herrmann wrote: (CC migrate.c committers) On Tue, May 20, 2014 at 12:11 AM, Hugh Dickins hu...@google.com wrote: On Mon, 19 May 2014, Jan Kara wrote: On Mon 19-05-14 13:44:25, David Herrmann wrote: On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins hu...@google.com wrote: The aspect which really worries me is this: the maintenance burden. This approach would add some peculiar new code, introducing a rare special case: which we might get right today, but will very easily forget tomorrow when making some other changes to mm. If we compile a list of danger areas in mm, this would surely belong on that list. I tried doing the page-replacement in the last 4 days, but honestly, it's far more complex than I thought. So if no-one more experienced To be honest, I'm quite glad to hear that: it is still a solution worth considering, but I'd rather continue the search for a better solution. What if we set VM_IO for memory-mappings if a file supports sealing? That might be a hack and quite restrictive, but we could also add a VM_DONTPIN flag that just prevents any page-pinning like GUP (which is also a side-effect of VM_IO). This is basically what we do to protect PCI BARs from that race during hotplug (well, VM_PFNMAP ist what protects those, but the code is the same). If we mention in the man-page that memfd-objects don't support direct-IO, we'd be fine, I think. Not sure if that hack is better than the page-replacement, though. It'd be definitely much simpler. I admire your resourcefulness, but three things put me off using VM_IO. One is just a great distaste for using VM_IO in mm, when it's about device and memory regions unknown to mm. You can indeed get around that objection by making a new VM_DONTPIN with similar restrictions. Two is that I'm afraid of those restrictions which you foresee. Soon after putting this in, we shall have people saying I can't do such-and-such on my memfd-object, can you please enable it and then we shall have to find another solution. But three, more importantly, that gives no protection against get_user_pages_fast() callers: as mentioned before, GUP-fast works on page tables, not examining vmas at all, so it is blind to VM_IO. get_user_pages_fast() does check pte access, and pte_special bit, falling back to get_user_pages() if unsuitable: IIRC, pte_special support is important for an architecture to support GUP-fast. But I definitely don't want any shmem to be !vm_normal_page(). Regarding page-replacement, I tried using migrate_page(), however, this obviously fails in page_freeze_refs() due to the elevated ref-count and we cannot account for those, as they might vanish asynchronously. Now I wonder whether we could just add a new mode MIGRATE_PHASE_OUT that avoids freezing the page and forces the copy. Existing refs would still operate on the old page, but any new access gets the new page. This way, we could collect pages with elevated ref-counts in shmem similar to do_move_page_to_node_array() and then call migrate_pages(). Now migrate_pages() takes good care to prevent any new refs during migration. try_to_unmap(TTU_MIGRATION) marks PTEs as 'in-migration', so accesses are delayed. Page-faults wait on the page-lock and retry due to mapping==NULL. lru is disabled beforehand. Therefore, there cannot be any racing page-lookups as they all stall on the migration. Moreover, page_freeze_refs() fails only if the page is pinned by independent users (usually some form of I/O). Question is what those additional ref-counts might be. Given that shmem 'owns' its pages, none of these external references should pass those refs around. All they use it for is I/O. Therefore, we shouldn't even need an additional try_to_unmap() _after_ MIGRATE_PHASE_OUT as we expect those external refs to never pass page-refs around. If that's a valid assumption (and I haven't found any offenders so far), we should be good with migrate_pages(MIGRATE_PHASE_OUT) as I described. Comments? I have not given these details all the attention that they deserve. It's Tony's copy-on-seal suggestion, fleshed out to use the migration infrastructure; but I still feel the same way about it as I did before. It's complexity (and complexity outside of mm/shmem.c) that I would prefer to avoid; and I find it hard to predict the consequence of this copy-on-write-in-reverse, where existing users are left holding something that used to be part of the object and now is not. Perhaps the existing sudden-truncation codepaths would handle it appropriately, perhaps they would not. (I wonder if Daniel Phillips's Tux3 page forking is relevant here? But don't want to explore that right now. Revoke also comes to mind.) I do think it may turn out to be a good future way forward, if what we do now turns out to be too restrictive; but we don't want to turn mm (slightly) upside down just to get sealing in. While skimming over
Re: [PATCH v2 0/3] File Sealing & memfd_create()
Hi (CC migrate.c committers) On Tue, May 20, 2014 at 12:11 AM, Hugh Dickins wrote: > On Mon, 19 May 2014, Jan Kara wrote: >> On Mon 19-05-14 13:44:25, David Herrmann wrote: >> > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins wrote: >> > > The aspect which really worries me is this: the maintenance burden. >> > > This approach would add some peculiar new code, introducing a rare >> > > special case: which we might get right today, but will very easily >> > > forget tomorrow when making some other changes to mm. If we compile >> > > a list of danger areas in mm, this would surely belong on that list. >> > >> > I tried doing the page-replacement in the last 4 days, but honestly, >> > it's far more complex than I thought. So if no-one more experienced > > To be honest, I'm quite glad to hear that: it is still a solution worth > considering, but I'd rather continue the search for a better solution. What if we set VM_IO for memory-mappings if a file supports sealing? That might be a hack and quite restrictive, but we could also add a VM_DONTPIN flag that just prevents any page-pinning like GUP (which is also a side-effect of VM_IO). This is basically what we do to protect PCI BARs from that race during hotplug (well, VM_PFNMAP ist what protects those, but the code is the same). If we mention in the man-page that memfd-objects don't support direct-IO, we'd be fine, I think. Not sure if that hack is better than the page-replacement, though. It'd be definitely much simpler. Regarding page-replacement, I tried using migrate_page(), however, this obviously fails in page_freeze_refs() due to the elevated ref-count and we cannot account for those, as they might vanish asynchronously. Now I wonder whether we could just add a new mode MIGRATE_PHASE_OUT that avoids freezing the page and forces the copy. Existing refs would still operate on the old page, but any new access gets the new page. This way, we could collect pages with elevated ref-counts in shmem similar to do_move_page_to_node_array() and then call migrate_pages(). Now migrate_pages() takes good care to prevent any new refs during migration. try_to_unmap(TTU_MIGRATION) marks PTEs as 'in-migration', so accesses are delayed. Page-faults wait on the page-lock and retry due to mapping==NULL. lru is disabled beforehand. Therefore, there cannot be any racing page-lookups as they all stall on the migration. Moreover, page_freeze_refs() fails only if the page is pinned by independent users (usually some form of I/O). Question is what those additional ref-counts might be. Given that shmem 'owns' its pages, none of these external references should pass those refs around. All they use it for is I/O. Therefore, we shouldn't even need an additional try_to_unmap() _after_ MIGRATE_PHASE_OUT as we expect those external refs to never pass page-refs around. If that's a valid assumption (and I haven't found any offenders so far), we should be good with migrate_pages(MIGRATE_PHASE_OUT) as I described. Comments? While skimming over migrate.c I noticed two odd behaviors: 1) migration_entry_wait() is used to wait on a migration to finish, before accessing PTE entries. However, we call get_page() there, which increases the ref-count of the old page and causes page_freeze_refs() to fail. There's no way we can know how many tasks wait on a migration entry when calling page_freeze_refs(). I have no idea how that's supposed to work? Why don't we store the new page in the migration-swp entry so any lookups stall on the new page? We don't care for ref-counts on that page and if the migration fails, new->mapping is set to NULL and any lookup is retried. remove_migration_pte() can restore the old page correctly. 2) remove_migration_pte() calls get_page(new) before writing the PTE. But who releases the ref of the old page? Thanks David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] File Sealing memfd_create()
Hi (CC migrate.c committers) On Tue, May 20, 2014 at 12:11 AM, Hugh Dickins hu...@google.com wrote: On Mon, 19 May 2014, Jan Kara wrote: On Mon 19-05-14 13:44:25, David Herrmann wrote: On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins hu...@google.com wrote: The aspect which really worries me is this: the maintenance burden. This approach would add some peculiar new code, introducing a rare special case: which we might get right today, but will very easily forget tomorrow when making some other changes to mm. If we compile a list of danger areas in mm, this would surely belong on that list. I tried doing the page-replacement in the last 4 days, but honestly, it's far more complex than I thought. So if no-one more experienced To be honest, I'm quite glad to hear that: it is still a solution worth considering, but I'd rather continue the search for a better solution. What if we set VM_IO for memory-mappings if a file supports sealing? That might be a hack and quite restrictive, but we could also add a VM_DONTPIN flag that just prevents any page-pinning like GUP (which is also a side-effect of VM_IO). This is basically what we do to protect PCI BARs from that race during hotplug (well, VM_PFNMAP ist what protects those, but the code is the same). If we mention in the man-page that memfd-objects don't support direct-IO, we'd be fine, I think. Not sure if that hack is better than the page-replacement, though. It'd be definitely much simpler. Regarding page-replacement, I tried using migrate_page(), however, this obviously fails in page_freeze_refs() due to the elevated ref-count and we cannot account for those, as they might vanish asynchronously. Now I wonder whether we could just add a new mode MIGRATE_PHASE_OUT that avoids freezing the page and forces the copy. Existing refs would still operate on the old page, but any new access gets the new page. This way, we could collect pages with elevated ref-counts in shmem similar to do_move_page_to_node_array() and then call migrate_pages(). Now migrate_pages() takes good care to prevent any new refs during migration. try_to_unmap(TTU_MIGRATION) marks PTEs as 'in-migration', so accesses are delayed. Page-faults wait on the page-lock and retry due to mapping==NULL. lru is disabled beforehand. Therefore, there cannot be any racing page-lookups as they all stall on the migration. Moreover, page_freeze_refs() fails only if the page is pinned by independent users (usually some form of I/O). Question is what those additional ref-counts might be. Given that shmem 'owns' its pages, none of these external references should pass those refs around. All they use it for is I/O. Therefore, we shouldn't even need an additional try_to_unmap() _after_ MIGRATE_PHASE_OUT as we expect those external refs to never pass page-refs around. If that's a valid assumption (and I haven't found any offenders so far), we should be good with migrate_pages(MIGRATE_PHASE_OUT) as I described. Comments? While skimming over migrate.c I noticed two odd behaviors: 1) migration_entry_wait() is used to wait on a migration to finish, before accessing PTE entries. However, we call get_page() there, which increases the ref-count of the old page and causes page_freeze_refs() to fail. There's no way we can know how many tasks wait on a migration entry when calling page_freeze_refs(). I have no idea how that's supposed to work? Why don't we store the new page in the migration-swp entry so any lookups stall on the new page? We don't care for ref-counts on that page and if the migration fails, new-mapping is set to NULL and any lookup is retried. remove_migration_pte() can restore the old page correctly. 2) remove_migration_pte() calls get_page(new) before writing the PTE. But who releases the ref of the old page? Thanks David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] File Sealing & memfd_create()
On Mon, 19 May 2014, Jan Kara wrote: > On Mon 19-05-14 13:44:25, David Herrmann wrote: > > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins wrote: > > > The aspect which really worries me is this: the maintenance burden. > > > This approach would add some peculiar new code, introducing a rare > > > special case: which we might get right today, but will very easily > > > forget tomorrow when making some other changes to mm. If we compile > > > a list of danger areas in mm, this would surely belong on that list. > > > > I tried doing the page-replacement in the last 4 days, but honestly, > > it's far more complex than I thought. So if no-one more experienced To be honest, I'm quite glad to hear that: it is still a solution worth considering, but I'd rather continue the search for a better solution. > > with mm/ comes up with a simple implementation, I'll have to delay > > this for some more weeks. > > > > However, I still wonder why we try to fix this as part of this > > patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary > > amount of time. Same is true for network block-devices, NFS, iscsi, > > maybe loop-devices, ... This means, _any_ once mapped page can be > > written to after an arbitrary delay. This can break any feature that > > makes FS objects read-only (remounting read-only, setting S_IMMUTABLE, > > sealing, ..). We need to fix it together with your sealing patchset, because your patchset is all about introducing a new kind of guarantee: a guarantee which this async i/o issue makes impossible to give, as things stand. Exasperating for you, I understand; but that's how it is. A new feature may make new demands on the infrastructure. I can imagine existing problems, but (I may be out of touch) I have not heard of them as problems in practice. Certainly they would not be recent regressions: mm-page versus fs-file has worked in this way for as long as I've known them (pages released independently of unmapping the file, with the understanding that i/o might still be in progress, so care taken not to free the pages too soon). > > > > Shouldn't we try to fix the _cause_ of this? Nobody is against fixing the cause: we are all looking for the simplest way of doing so, > > > > Isn't there a simple way to lock/mark/.. affected vmas in > > get_user_pages(_fast)() and release them once done? We could increase > > i_mmap_writable on all affected address_space and decrease it on > > release. This would at least prevent sealing and could be check on > > other operations, too (like setting S_IMMUTABLE). > > This should be as easy as checking page_mapping(page) != NULL and then > > adjusting ->i_mmap_writable in > > get_writable_user_pages/put_writable_user_pages, right? > Doing this would be quite a bit of work. Currently references returned by > get_user_pages() are page references like any other and thus are released > by put_page() or similar. Now you would make them special and they need > special releasing and there are lots of places in kernel where > get_user_pages() is used that would need changing. Lots of places that would need changing, yes; but we have often wondered in the past whether there should be a put_user_pages(). Though I'm not sure that it would actually solve anything... > > Another aspect is that it could have performance implications - if there > are several processes using get_user_pages[_fast]() on a file, they would > start contending on modifying i_mmap_writeable. Doing extra vma work in get_user_pages() wouldn't be so bad. But doing any vma work in get_user_pages_fast() would upset almost all its users: get_user_pages_fast() is a fast-path which expressly avoids the vmas, and hates additional cachelines being added to its machinations. If sealing had appeared before get_user_pages_fast(), maybe we wouldn't have let get_user_pages_fast() in; but now it's the other way around. I would be more interested in attacking from the get_user_pages() and get_user_pages_fast() end, if I could convince myself that they do actually delimit the problem; maybe they do, but I'm not yet convinced. > > One somewhat crazy idea I have is that maybe we could delay unmapping of a > page if this was last VMA referencing it until all extra page references of > pages in there are dropped. That would make i_mmap_writeable reliable for > you and it would also close those races with remount. Hugh, do you think > this might be viable? It is definitely worth pursuing further, but I'm not very hopeful on it. In a world of free page flags and free struct page fields, maybe. (And I don't see sealing as a feature sensibly restricted to 64-bit only.) I think we would have to set a page flag, maybe bump a count, for every leftover page that raises i_mmap_writable; and lower it (potentially from interrupt context) at put_page() time. Easy to make i_mmap_writable an atomic rather than guarded by i_mmap_mutex, but we still need to synchronize on it falling to 0. And how would
Re: [PATCH v2 0/3] File Sealing & memfd_create()
On Mon 19-05-14 13:44:25, David Herrmann wrote: > Hi > > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins wrote: > > The aspect which really worries me is this: the maintenance burden. > > This approach would add some peculiar new code, introducing a rare > > special case: which we might get right today, but will very easily > > forget tomorrow when making some other changes to mm. If we compile > > a list of danger areas in mm, this would surely belong on that list. > > I tried doing the page-replacement in the last 4 days, but honestly, > it's far more complex than I thought. So if no-one more experienced > with mm/ comes up with a simple implementation, I'll have to delay > this for some more weeks. > > However, I still wonder why we try to fix this as part of this > patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary > amount of time. Same is true for network block-devices, NFS, iscsi, > maybe loop-devices, ... This means, _any_ once mapped page can be > written to after an arbitrary delay. This can break any feature that > makes FS objects read-only (remounting read-only, setting S_IMMUTABLE, > sealing, ..). > > Shouldn't we try to fix the _cause_ of this? > > Isn't there a simple way to lock/mark/.. affected vmas in > get_user_pages(_fast)() and release them once done? We could increase > i_mmap_writable on all affected address_space and decrease it on > release. This would at least prevent sealing and could be check on > other operations, too (like setting S_IMMUTABLE). > This should be as easy as checking page_mapping(page) != NULL and then > adjusting ->i_mmap_writable in > get_writable_user_pages/put_writable_user_pages, right? Doing this would be quite a bit of work. Currently references returned by get_user_pages() are page references like any other and thus are released by put_page() or similar. Now you would make them special and they need special releasing and there are lots of places in kernel where get_user_pages() is used that would need changing. Another aspect is that it could have performance implications - if there are several processes using get_user_pages[_fast]() on a file, they would start contending on modifying i_mmap_writeable. One somewhat crazy idea I have is that maybe we could delay unmapping of a page if this was last VMA referencing it until all extra page references of pages in there are dropped. That would make i_mmap_writeable reliable for you and it would also close those races with remount. Hugh, do you think this might be viable? Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] File Sealing & memfd_create()
Hi On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins wrote: > The aspect which really worries me is this: the maintenance burden. > This approach would add some peculiar new code, introducing a rare > special case: which we might get right today, but will very easily > forget tomorrow when making some other changes to mm. If we compile > a list of danger areas in mm, this would surely belong on that list. I tried doing the page-replacement in the last 4 days, but honestly, it's far more complex than I thought. So if no-one more experienced with mm/ comes up with a simple implementation, I'll have to delay this for some more weeks. However, I still wonder why we try to fix this as part of this patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary amount of time. Same is true for network block-devices, NFS, iscsi, maybe loop-devices, ... This means, _any_ once mapped page can be written to after an arbitrary delay. This can break any feature that makes FS objects read-only (remounting read-only, setting S_IMMUTABLE, sealing, ..). Shouldn't we try to fix the _cause_ of this? Isn't there a simple way to lock/mark/.. affected vmas in get_user_pages(_fast)() and release them once done? We could increase i_mmap_writable on all affected address_space and decrease it on release. This would at least prevent sealing and could be check on other operations, too (like setting S_IMMUTABLE). This should be as easy as checking page_mapping(page) != NULL and then adjusting ->i_mmap_writable in get_writable_user_pages/put_writable_user_pages, right? Thanks David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] File Sealing memfd_create()
On Mon, 19 May 2014, Jan Kara wrote: On Mon 19-05-14 13:44:25, David Herrmann wrote: On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins hu...@google.com wrote: The aspect which really worries me is this: the maintenance burden. This approach would add some peculiar new code, introducing a rare special case: which we might get right today, but will very easily forget tomorrow when making some other changes to mm. If we compile a list of danger areas in mm, this would surely belong on that list. I tried doing the page-replacement in the last 4 days, but honestly, it's far more complex than I thought. So if no-one more experienced To be honest, I'm quite glad to hear that: it is still a solution worth considering, but I'd rather continue the search for a better solution. with mm/ comes up with a simple implementation, I'll have to delay this for some more weeks. However, I still wonder why we try to fix this as part of this patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary amount of time. Same is true for network block-devices, NFS, iscsi, maybe loop-devices, ... This means, _any_ once mapped page can be written to after an arbitrary delay. This can break any feature that makes FS objects read-only (remounting read-only, setting S_IMMUTABLE, sealing, ..). We need to fix it together with your sealing patchset, because your patchset is all about introducing a new kind of guarantee: a guarantee which this async i/o issue makes impossible to give, as things stand. Exasperating for you, I understand; but that's how it is. A new feature may make new demands on the infrastructure. I can imagine existing problems, but (I may be out of touch) I have not heard of them as problems in practice. Certainly they would not be recent regressions: mm-page versus fs-file has worked in this way for as long as I've known them (pages released independently of unmapping the file, with the understanding that i/o might still be in progress, so care taken not to free the pages too soon). Shouldn't we try to fix the _cause_ of this? Nobody is against fixing the cause: we are all looking for the simplest way of doing so, Isn't there a simple way to lock/mark/.. affected vmas in get_user_pages(_fast)() and release them once done? We could increase i_mmap_writable on all affected address_space and decrease it on release. This would at least prevent sealing and could be check on other operations, too (like setting S_IMMUTABLE). This should be as easy as checking page_mapping(page) != NULL and then adjusting -i_mmap_writable in get_writable_user_pages/put_writable_user_pages, right? Doing this would be quite a bit of work. Currently references returned by get_user_pages() are page references like any other and thus are released by put_page() or similar. Now you would make them special and they need special releasing and there are lots of places in kernel where get_user_pages() is used that would need changing. Lots of places that would need changing, yes; but we have often wondered in the past whether there should be a put_user_pages(). Though I'm not sure that it would actually solve anything... Another aspect is that it could have performance implications - if there are several processes using get_user_pages[_fast]() on a file, they would start contending on modifying i_mmap_writeable. Doing extra vma work in get_user_pages() wouldn't be so bad. But doing any vma work in get_user_pages_fast() would upset almost all its users: get_user_pages_fast() is a fast-path which expressly avoids the vmas, and hates additional cachelines being added to its machinations. If sealing had appeared before get_user_pages_fast(), maybe we wouldn't have let get_user_pages_fast() in; but now it's the other way around. I would be more interested in attacking from the get_user_pages() and get_user_pages_fast() end, if I could convince myself that they do actually delimit the problem; maybe they do, but I'm not yet convinced. One somewhat crazy idea I have is that maybe we could delay unmapping of a page if this was last VMA referencing it until all extra page references of pages in there are dropped. That would make i_mmap_writeable reliable for you and it would also close those races with remount. Hugh, do you think this might be viable? It is definitely worth pursuing further, but I'm not very hopeful on it. In a world of free page flags and free struct page fields, maybe. (And I don't see sealing as a feature sensibly restricted to 64-bit only.) I think we would have to set a page flag, maybe bump a count, for every leftover page that raises i_mmap_writable; and lower it (potentially from interrupt context) at put_page() time. Easy to make i_mmap_writable an atomic rather than guarded by i_mmap_mutex, but we still need to synchronize on it falling to 0. And how would we recognize the relevant, decrementing, put_page()? page_count
Re: [PATCH v2 0/3] File Sealing memfd_create()
Hi On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins hu...@google.com wrote: The aspect which really worries me is this: the maintenance burden. This approach would add some peculiar new code, introducing a rare special case: which we might get right today, but will very easily forget tomorrow when making some other changes to mm. If we compile a list of danger areas in mm, this would surely belong on that list. I tried doing the page-replacement in the last 4 days, but honestly, it's far more complex than I thought. So if no-one more experienced with mm/ comes up with a simple implementation, I'll have to delay this for some more weeks. However, I still wonder why we try to fix this as part of this patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary amount of time. Same is true for network block-devices, NFS, iscsi, maybe loop-devices, ... This means, _any_ once mapped page can be written to after an arbitrary delay. This can break any feature that makes FS objects read-only (remounting read-only, setting S_IMMUTABLE, sealing, ..). Shouldn't we try to fix the _cause_ of this? Isn't there a simple way to lock/mark/.. affected vmas in get_user_pages(_fast)() and release them once done? We could increase i_mmap_writable on all affected address_space and decrease it on release. This would at least prevent sealing and could be check on other operations, too (like setting S_IMMUTABLE). This should be as easy as checking page_mapping(page) != NULL and then adjusting -i_mmap_writable in get_writable_user_pages/put_writable_user_pages, right? Thanks David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] File Sealing memfd_create()
On Mon 19-05-14 13:44:25, David Herrmann wrote: Hi On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins hu...@google.com wrote: The aspect which really worries me is this: the maintenance burden. This approach would add some peculiar new code, introducing a rare special case: which we might get right today, but will very easily forget tomorrow when making some other changes to mm. If we compile a list of danger areas in mm, this would surely belong on that list. I tried doing the page-replacement in the last 4 days, but honestly, it's far more complex than I thought. So if no-one more experienced with mm/ comes up with a simple implementation, I'll have to delay this for some more weeks. However, I still wonder why we try to fix this as part of this patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary amount of time. Same is true for network block-devices, NFS, iscsi, maybe loop-devices, ... This means, _any_ once mapped page can be written to after an arbitrary delay. This can break any feature that makes FS objects read-only (remounting read-only, setting S_IMMUTABLE, sealing, ..). Shouldn't we try to fix the _cause_ of this? Isn't there a simple way to lock/mark/.. affected vmas in get_user_pages(_fast)() and release them once done? We could increase i_mmap_writable on all affected address_space and decrease it on release. This would at least prevent sealing and could be check on other operations, too (like setting S_IMMUTABLE). This should be as easy as checking page_mapping(page) != NULL and then adjusting -i_mmap_writable in get_writable_user_pages/put_writable_user_pages, right? Doing this would be quite a bit of work. Currently references returned by get_user_pages() are page references like any other and thus are released by put_page() or similar. Now you would make them special and they need special releasing and there are lots of places in kernel where get_user_pages() is used that would need changing. Another aspect is that it could have performance implications - if there are several processes using get_user_pages[_fast]() on a file, they would start contending on modifying i_mmap_writeable. One somewhat crazy idea I have is that maybe we could delay unmapping of a page if this was last VMA referencing it until all extra page references of pages in there are dropped. That would make i_mmap_writeable reliable for you and it would also close those races with remount. Hugh, do you think this might be viable? Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] File Sealing & memfd_create()
On Wed, 14 May 2014, Tony Battersby wrote: > Hugh Dickins wrote: > > Checking page counts in a GB file prior to sealing does not appeal at > > all: we'd be lucky ever to find them all accounted for. > > Here is a refinement of that idea: during a seal operation, iterate over > all the pages in the file and check their refcounts. On any page that > has an unexpected extra reference, allocate a new page, copy the data > over to the new page, and then replace the page having the extra > reference with the newly-allocated page in the file. That way you still > get zero-copy on pages that don't have extra references, and you don't > have to fail the seal operation if some of the pages are still being > referenced by something else. That does seem a more promising idea than any that I'd had: thank you. But whether it can actually be made to work (safely) is not yet clear to me. It would be rather like page migration; but whereas page migration backs off whenever the page count cannot be fully accounted for (as does KSM), that is precisely when this would have to act. Taking action in the case of ignorance does not make me feel very comfortable. Page lock and radix tree lock would guard against many surprises, but not necessarily all. > > The downside of course is the extra memory usage and memcpy overhead if > something is holding extra references to the pages. So whether this is > a good approach depends on: > > *) Whether extra page references would happen frequently or infrequently > under various kernel configurations and usage scenarios. I don't know > enough about the mm system to answer this myself. > > *) Whether or not the extra memory usage and memcpy overhead could be > considered a DoS attack vector by someone who has found a way to add > extra references to the pages intentionally. I may just be too naive on such issues, but neither of those worries me particularly. If something can already add an extra pin to many pages, that is already a concern for memory usage. The sealing case would double its scale, but I don't see that as a new issue. The aspect which really worries me is this: the maintenance burden. This approach would add some peculiar new code, introducing a rare special case: which we might get right today, but will very easily forget tomorrow when making some other changes to mm. If we compile a list of danger areas in mm, this would surely belong on that list. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] File Sealing & memfd_create()
Hugh Dickins wrote: > Checking page counts in a GB file prior to sealing does not appeal at > all: we'd be lucky ever to find them all accounted for. Here is a refinement of that idea: during a seal operation, iterate over all the pages in the file and check their refcounts. On any page that has an unexpected extra reference, allocate a new page, copy the data over to the new page, and then replace the page having the extra reference with the newly-allocated page in the file. That way you still get zero-copy on pages that don't have extra references, and you don't have to fail the seal operation if some of the pages are still being referenced by something else. The downside of course is the extra memory usage and memcpy overhead if something is holding extra references to the pages. So whether this is a good approach depends on: *) Whether extra page references would happen frequently or infrequently under various kernel configurations and usage scenarios. I don't know enough about the mm system to answer this myself. *) Whether or not the extra memory usage and memcpy overhead could be considered a DoS attack vector by someone who has found a way to add extra references to the pages intentionally. Tony Battersby -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] File Sealing memfd_create()
Hugh Dickins wrote: Checking page counts in a GB file prior to sealing does not appeal at all: we'd be lucky ever to find them all accounted for. Here is a refinement of that idea: during a seal operation, iterate over all the pages in the file and check their refcounts. On any page that has an unexpected extra reference, allocate a new page, copy the data over to the new page, and then replace the page having the extra reference with the newly-allocated page in the file. That way you still get zero-copy on pages that don't have extra references, and you don't have to fail the seal operation if some of the pages are still being referenced by something else. The downside of course is the extra memory usage and memcpy overhead if something is holding extra references to the pages. So whether this is a good approach depends on: *) Whether extra page references would happen frequently or infrequently under various kernel configurations and usage scenarios. I don't know enough about the mm system to answer this myself. *) Whether or not the extra memory usage and memcpy overhead could be considered a DoS attack vector by someone who has found a way to add extra references to the pages intentionally. Tony Battersby -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] File Sealing memfd_create()
On Wed, 14 May 2014, Tony Battersby wrote: Hugh Dickins wrote: Checking page counts in a GB file prior to sealing does not appeal at all: we'd be lucky ever to find them all accounted for. Here is a refinement of that idea: during a seal operation, iterate over all the pages in the file and check their refcounts. On any page that has an unexpected extra reference, allocate a new page, copy the data over to the new page, and then replace the page having the extra reference with the newly-allocated page in the file. That way you still get zero-copy on pages that don't have extra references, and you don't have to fail the seal operation if some of the pages are still being referenced by something else. That does seem a more promising idea than any that I'd had: thank you. But whether it can actually be made to work (safely) is not yet clear to me. It would be rather like page migration; but whereas page migration backs off whenever the page count cannot be fully accounted for (as does KSM), that is precisely when this would have to act. Taking action in the case of ignorance does not make me feel very comfortable. Page lock and radix tree lock would guard against many surprises, but not necessarily all. The downside of course is the extra memory usage and memcpy overhead if something is holding extra references to the pages. So whether this is a good approach depends on: *) Whether extra page references would happen frequently or infrequently under various kernel configurations and usage scenarios. I don't know enough about the mm system to answer this myself. *) Whether or not the extra memory usage and memcpy overhead could be considered a DoS attack vector by someone who has found a way to add extra references to the pages intentionally. I may just be too naive on such issues, but neither of those worries me particularly. If something can already add an extra pin to many pages, that is already a concern for memory usage. The sealing case would double its scale, but I don't see that as a new issue. The aspect which really worries me is this: the maintenance burden. This approach would add some peculiar new code, introducing a rare special case: which we might get right today, but will very easily forget tomorrow when making some other changes to mm. If we compile a list of danger areas in mm, this would surely belong on that list. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] File Sealing & memfd_create()
On Tue, 15 Apr 2014, David Herrmann wrote: > > This is v2 of the File-Sealing and memfd_create() patches. You can find v1 > with > a longer introduction at gmane: > http://thread.gmane.org/gmane.comp.video.dri.devel/102241 > An LWN article about memfd+sealing is available, too: > https://lwn.net/Articles/593918/ Sorry it's taken so long: at last I managed to set aside a few hours at the weekend, to read through your memfd+sealing work and let it sink in. Good stuff. I've a page of notes which I shall respond with, either later in the week or at the weekend; but they're pretty much trivia, or notes to myself, beside the async I/O issue raised by Tony Battersby. I thought I'd better not wait longer to give warning that I do take that issue seriously. > > Shortlog of changes since v1: > - Dropped the "exclusive reference" idea >Now sealing is a one-shot operation. Once a given seal is set, you cannot >remove this seal again, ever. This allows us to drop all the ref-count >checking and simplifies the code a lot. We also no longer have all the > races >we have to test for. > - The i_writecount fix is now upstream (slightly different, by Al Viro) so I >dropped it from the series. > - Change SHMEM_* prefix to F_* to avoid any API-association to shmem. > - Sealing is disabled on all files by default (even though we still haven't >found any DoS attack). You need to pass MFD_ALLOW_SEALING to memfd_create() >to get an object that supports the sealing API. > - Changed F_SET_SEALS to F_ADD_SEALS. This better reflects the API. You can >never remove seals, you can only add seals. Note that the semantics also >changed slightly: You can now _always_ call F_ADD_SEALS to add _more_ > seals. >However, a new seal was added which "seals sealing" (F_SEAL_SEAL). So once >F_SEAL_SEAL is set, F_ADD_SEAL is no longer allowed. >This feature was requested by the glib developers. > - memfd_create() names are now limited to NAME_MAX instead of 256 hardcoded. > - Rewrote the test suite > > The biggest change in v2 is the removal of the "exclusive reference" idea. It > was a nice optimization, but the implementation was ugly and racy regarding > file-table changes. Linus didn't like it either so we decided to drop it > entirely. Sealing is a one-shot operation now. A sealed file can never be > unsealed, even if you're the only holder. > > I also addressed most of the concerns regarding API naming and semantics. I > got > feedback from glib, EFL, wayland, kdbus, ostree, audio developers and we > discussed many possible use-cases (and also cases that don't make sense). So I > think we're in a very good state right now. > > People requested to make this interface more generic. I renamed the API to > reflect that, but I didn't change the implementation. Thing is, seals cannot > be > removed, ever. Therefore, semantics for sealing on non-volatile storage are > undefined. We don't write them to disc and it is unclear whether a sealed file > can be unlinked/removed again. There're more issues with this and no-one came > up > with a use-case, hence I didn't bother implementing it. > There's also an ongoing discussion about an AIO race, but this also affects > other inode-protections like S_IMMUTABLE/etc. So I don't think we should tie > the fix to this series. I disagree on that. Whatever the bugs or limitations with S_IMMUTABLE, ETXTBSY etc, we have lived with those without complaint for many years. You now propose an entirely new kind of guarantee, but that guarantee is broken by the possibility of outstanding async I/O to a page of the sealed object. I don't see how we can add the new feature while knowing it broken. We have to devise a solution, but I haven't thought of a good solution yet. Checking page counts in a GB file prior to sealing does not appeal at all: we'd be lucky ever to find them all accounted for. Adding overhead to get_user_pages_fast() won't appeal to its adherents, and I'm not even convinced that GUP is the only way in here. Any ideas? > Another discussion was about preventing /proc/self/fd/. But again, no-one > could > tell me _why_, so I didn't bother. On the contrary, I even provided several > use-cases that make use of /proc/self/fd/ to get read-only FDs to pass around. > > If anyone wants to test this, please use 3.15-rc1 as base. The i_writecount > fixes are required for this series. > > Comments welcome! > David > > David Herrmann (3): > shm: add sealing API > shm: add memfd_create() syscall > selftests: add memfd_create() + sealing tests > > arch/x86/syscalls/syscall_32.tbl | 1 + > arch/x86/syscalls/syscall_64.tbl | 1 + > fs/fcntl.c | 5 + > include/linux/shmem_fs.h | 20 + > include/linux/syscalls.h | 1 + > include/uapi/linux/fcntl.h | 15 + > include/uapi/linux/memfd.h | 10 + >
Re: [PATCH v2 0/3] File Sealing memfd_create()
On Tue, 15 Apr 2014, David Herrmann wrote: This is v2 of the File-Sealing and memfd_create() patches. You can find v1 with a longer introduction at gmane: http://thread.gmane.org/gmane.comp.video.dri.devel/102241 An LWN article about memfd+sealing is available, too: https://lwn.net/Articles/593918/ Sorry it's taken so long: at last I managed to set aside a few hours at the weekend, to read through your memfd+sealing work and let it sink in. Good stuff. I've a page of notes which I shall respond with, either later in the week or at the weekend; but they're pretty much trivia, or notes to myself, beside the async I/O issue raised by Tony Battersby. I thought I'd better not wait longer to give warning that I do take that issue seriously. Shortlog of changes since v1: - Dropped the exclusive reference idea Now sealing is a one-shot operation. Once a given seal is set, you cannot remove this seal again, ever. This allows us to drop all the ref-count checking and simplifies the code a lot. We also no longer have all the races we have to test for. - The i_writecount fix is now upstream (slightly different, by Al Viro) so I dropped it from the series. - Change SHMEM_* prefix to F_* to avoid any API-association to shmem. - Sealing is disabled on all files by default (even though we still haven't found any DoS attack). You need to pass MFD_ALLOW_SEALING to memfd_create() to get an object that supports the sealing API. - Changed F_SET_SEALS to F_ADD_SEALS. This better reflects the API. You can never remove seals, you can only add seals. Note that the semantics also changed slightly: You can now _always_ call F_ADD_SEALS to add _more_ seals. However, a new seal was added which seals sealing (F_SEAL_SEAL). So once F_SEAL_SEAL is set, F_ADD_SEAL is no longer allowed. This feature was requested by the glib developers. - memfd_create() names are now limited to NAME_MAX instead of 256 hardcoded. - Rewrote the test suite The biggest change in v2 is the removal of the exclusive reference idea. It was a nice optimization, but the implementation was ugly and racy regarding file-table changes. Linus didn't like it either so we decided to drop it entirely. Sealing is a one-shot operation now. A sealed file can never be unsealed, even if you're the only holder. I also addressed most of the concerns regarding API naming and semantics. I got feedback from glib, EFL, wayland, kdbus, ostree, audio developers and we discussed many possible use-cases (and also cases that don't make sense). So I think we're in a very good state right now. People requested to make this interface more generic. I renamed the API to reflect that, but I didn't change the implementation. Thing is, seals cannot be removed, ever. Therefore, semantics for sealing on non-volatile storage are undefined. We don't write them to disc and it is unclear whether a sealed file can be unlinked/removed again. There're more issues with this and no-one came up with a use-case, hence I didn't bother implementing it. There's also an ongoing discussion about an AIO race, but this also affects other inode-protections like S_IMMUTABLE/etc. So I don't think we should tie the fix to this series. I disagree on that. Whatever the bugs or limitations with S_IMMUTABLE, ETXTBSY etc, we have lived with those without complaint for many years. You now propose an entirely new kind of guarantee, but that guarantee is broken by the possibility of outstanding async I/O to a page of the sealed object. I don't see how we can add the new feature while knowing it broken. We have to devise a solution, but I haven't thought of a good solution yet. Checking page counts in a GB file prior to sealing does not appeal at all: we'd be lucky ever to find them all accounted for. Adding overhead to get_user_pages_fast() won't appeal to its adherents, and I'm not even convinced that GUP is the only way in here. Any ideas? Another discussion was about preventing /proc/self/fd/. But again, no-one could tell me _why_, so I didn't bother. On the contrary, I even provided several use-cases that make use of /proc/self/fd/ to get read-only FDs to pass around. If anyone wants to test this, please use 3.15-rc1 as base. The i_writecount fixes are required for this series. Comments welcome! David David Herrmann (3): shm: add sealing API shm: add memfd_create() syscall selftests: add memfd_create() + sealing tests arch/x86/syscalls/syscall_32.tbl | 1 + arch/x86/syscalls/syscall_64.tbl | 1 + fs/fcntl.c | 5 + include/linux/shmem_fs.h | 20 + include/linux/syscalls.h | 1 + include/uapi/linux/fcntl.h | 15 + include/uapi/linux/memfd.h | 10 + kernel/sys_ni.c| 1 + mm/shmem.c