Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Wed, May 11, 2005 at 05:53:36PM -0500, Timur Tabi wrote: > Andrea Arcangeli wrote: > > >If the problem appears again even after the last fix for the COW I did > >last year, than it means we've another yet another bug to fix. > > All of my memory pinning test cases pass when I use get_user_pages() with > kernels 2.6.7 and later. Well then your problem was the cow bug, that was corrupting userland with O_DIRECT too... ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Andrea Arcangeli wrote: If the problem appears again even after the last fix for the COW I did last year, than it means we've another yet another bug to fix. All of my memory pinning test cases pass when I use get_user_pages() with kernels 2.6.7 and later. -- Timur Tabi Staff Software Engineer [EMAIL PROTECTED] One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Wed, May 11, 2005 at 09:42:24PM +0100, Hugh Dickins wrote: > proposed patches) there is no such migration of pages; that we'd prefer > to implement migration in such a way that mlock does not inhibit it > (though there might prove to be strong arguments defeating that); > and that get_user_pages _must_ prevent migration (and if there > were already such migration, I'd be saying it _does_ prevent it). Indeed, mlock is a virtual pin and as such it won't be guaranteed to always prevent migration. While get_user_pages is a physical pin on the physical page so it has to prevent migration. I think for him the physical pin is better since I guess IB would break (at least unless you've some method to call to stop IB, adjust the IB dma tracking, and restart IB, that hotplug can call). For the short term using only get_user_pages sounds simpler IMHO. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Wed, May 11, 2005 at 04:12:41PM -0400, William Jordan wrote: > If I am reading you correctly, you are saying that mlock currently > prevents pages from migrating around to unfragment memory, but > get_user_pages does not prevent this? If this is the case, this could This is not the case. Infact get_user_pages is a stronger pin than mlock. But if you call it by hand and you plan to write to the page, you have to use the "write=1" flag, this is fundamental if you want to write to the physical page from userland while it's being tracked by IB dma. In short you should not use mlock and you should use only get_user_pages(write=1). If the problem appears again even after the last fix for the COW I did last year, than it means we've another yet another bug to fix. Using mlock for this is unnecessary. mlock is a "virtual" pin and it provides weaker guarantees than what you need. You need _physical_ pin and get_user_pages(write=1) is the only one that will give it to you. write=0 is ok too if you're never ever going to write to the page with the cpu from userland. In the old days there was the concept that get_user_pages wasn't a "pte-pin", but that was infact broken in the way COW was working with threads, but this is fixed now that is really a "pte-pin" again (like in 2.2 which never had the corruption cow bug!) even though the pte may temporarily be set to swapcache or null. In current 2.6 you're guaranteed that despite the pte may be temporarly be set to not-present, the next minor fault will bring into memory the very same physical page that was there before. At least unless you map the thing writeprotect (i.e. write=0) and you write to it from userland.. ;). ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Wed, 11 May 2005, William Jordan wrote: > On 5/7/05, Hugh Dickins <[EMAIL PROTECTED]> wrote: > > > My understanding is that mlock() could in theory allow the page to be > > > moved, > > > but that currently nothing in the kernel would actually move it. However, > > > that could change in the future to allow hot-swapping of RAM. > > > > That's my understanding too, that nothing currently does so. Aside from > > hot-swapping RAM, there's also a need to be able to migrate pages around > > RAM, either to unfragment memory allowing higher-order allocations to > > succeed more often, or to get around extreme dmamem/normal-mem/highmem > > imbalances without dedicating huge reserves. Those would more often > > succeed if uninhibited by mlock. > > If I am reading you correctly, you are saying that mlock currently > prevents pages from migrating around to unfragment memory, but > get_user_pages does not prevent this? No, not what I meant at all. I'm saying that currently (aside from proposed patches) there is no such migration of pages; that we'd prefer to implement migration in such a way that mlock does not inhibit it (though there might prove to be strong arguments defeating that); and that get_user_pages _must_ prevent migration (and if there were already such migration, I'd be saying it _does_ prevent it). Hugh ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On 5/7/05, Hugh Dickins <[EMAIL PROTECTED]> wrote: > > My understanding is that mlock() could in theory allow the page to be moved, > > but that currently nothing in the kernel would actually move it. However, > > that could change in the future to allow hot-swapping of RAM. > > That's my understanding too, that nothing currently does so. Aside from > hot-swapping RAM, there's also a need to be able to migrate pages around > RAM, either to unfragment memory allowing higher-order allocations to > succeed more often, or to get around extreme dmamem/normal-mem/highmem > imbalances without dedicating huge reserves. Those would more often > succeed if uninhibited by mlock. Hugh, If I am reading you correctly, you are saying that mlock currently prevents pages from migrating around to unfragment memory, but get_user_pages does not prevent this? If this is the case, this could very easily be the problem Timur was experiencing. He is using get_user_pages to lock pages long term (for the life of the process, beyond the bounds of a single system call). If it is possible for a page to be migrated in physical memory during extreme virtual memory pressure while the reference count is held with get_user_pages, that would cause the problem where the hardware is no longer mapped to the same page as the application. BTW: In earlier kernels, I experienced the same issues in our IB drivers when trying to pin pages using only get_user_pages. -- Bill Jordan InfiniCon Systems ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Sat, 7 May 2005, Timur Tabi wrote: > > > Oh, well, maybe, but what is the real problem? > > Are you sure that copy-on-write doesn't come into it? > > No, but I do know that my test case doesn't call fork(), so it's reproducible > without involving COW. Of course, I'm sure someone's going to tell me now > that COW comes into effect even without fork(). If so, please explain. I'll try. COW comes into effect whenever you're sharing a page and then need to make private changes to it. Fork is one way of sharing (with ancestor and descendant processes). Using the empty zero page is another way of sharing (with all other processes and parts of your own address space with a readonly page full of zeroes). Using a file page from the page cache is another way of sharing. None of those is actually your case, but our test for whether a page is shared has been inadequate: oversimplifying, if page_count is more than 1 then we have to assume it is shared and do the copy-on-write (if the modifications are to be private). But there are various places where the page_count is temporarily raised (e.g. while paging out), which we cannot distinguish, so occasionally we'll copy on write even when it's not necessary, but we lack the information to tell us so. In particular, of course, get_user_pages raises page_count to pin the page: so making a page appear shared when it's not shared at all. > The short answer: under "extreme" memory pressure, the data inside a page > pinned by get_user_pages() is swapped out, moved, or deleted (I'm not sure > which). Some other data is placed into that physical location. > > By extreme memory pressure, I mean having the process allocate and touch as > much memory as possible. Something like this: > > num_bytes = get_amount_of_physical_ram(); > char *p = malloc(num_bytes); > for (i=0; i p[i] = 0; > > The above over-simplified code fails on earlier 2.6 kernels (or earlier > versions of glibc that accompany most distros the use the earlier 2.6 > kernels). Either malloc() returns NULL, or the p[i]=0 part causes a segfault. > I haven't bothered to trace down why. But when it does work, the page pinned > by get_user_pages() changes. Which has to be a bug with get_user_pages, which has no other purpose than to pin the pages. I cannot criticize you for working around it to get your app working on lots of releases, but what _we_ have to do is fix get_user_pages - and it appears that Andrea did so a year ago. I'm surprised if it's as simple as you describe (you do say over- simplified, maybe the critical points have fallen out), since GUP users would have complained long ago if it wasn't doing the job in normal cases of memory pressure. Andrea's case does involve the process independently trying to touch a page it has pinned for I/O with get_user_pages. Or (and I've only just thought of this, suspect it might be exactly your case) not touch, but apply get_user_pages again to a page already so pinned (while memory pressure has caused try_to_unmap_one temporarily to detach it from the user address space - the aspect of the problem that Andrea's fix attacks). > My understanding is that mlock() could in theory allow the page to be moved, > but that currently nothing in the kernel would actually move it. However, > that could change in the future to allow hot-swapping of RAM. That's my understanding too, that nothing currently does so. Aside from hot-swapping RAM, there's also a need to be able to migrate pages around RAM, either to unfragment memory allowing higher-order allocations to succeed more often, or to get around extreme dmamem/normal-mem/highmem imbalances without dedicating huge reserves. Those would more often succeed if uninhibited by mlock. > So I need to take into account distro vendors that use an earlier kernel, like > 2.6.5, and back-port the patch from 2.6.7. The distro vendor will keep the > 2.6.5 version number, which is why I can't rely on it. > > I need to know exactly what the fix is, so that when I scan mm/rmap.c, I know > what to look for. Currently, I look for this regex: > > try_to_unmap_one.*vm_area_struct > > which seems to work. However, now I think it's just a coincidence. Perhaps any release based on 2.6.7 or above, or any release which mentions "get_user_pages" in its mm/rmap.c or mm/objrmap.c? > > By the way, please don't be worried when soon the try_to_unmap_one > > comment and code that you identified above disappear. When I'm > > back in patch submission mode, I'll be sending Andrew a patch which > > removes it, instead reworking can_share_swap_page to rely on the > > page_mapcount instead of page_count, which avoids the ironical > > behaviour my comment refers to, and allows an awkward page migration > > case to proceed (once unpinned). Andrea and I now both prefer this > > page_mapcount approach. > > Ugh, that means my regex is probably going to break. Not only that, but I > don't understand what you're saying either. Tryin
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Hugh Dickins wrote: Oh, well, maybe, but what is the real problem? Are you sure that copy-on-write doesn't come into it? No, but I do know that my test case doesn't call fork(), so it's reproducible without involving COW. Of course, I'm sure someone's going to tell me now that COW comes into effect even without fork(). If so, please explain. I haven't reread through the whole thread, but my recollection is that you never quite said what the real problem is: you'd found some time ago that get_user_pages sometimes failed to pin the pages for your complex app, so were forced to mlock too; but couldn't provide any simple test case for the failure (which can indeed be a lot of work to devise), so we were all in the dark as to what went wrong. The short answer: under "extreme" memory pressure, the data inside a page pinned by get_user_pages() is swapped out, moved, or deleted (I'm not sure which). Some other data is placed into that physical location. By extreme memory pressure, I mean having the process allocate and touch as much memory as possible. Something like this: num_bytes = get_amount_of_physical_ram(); char *p = malloc(num_bytes); for (i=0; i The above over-simplified code fails on earlier 2.6 kernels (or earlier versions of glibc that accompany most distros the use the earlier 2.6 kernels). Either malloc() returns NULL, or the p[i]=0 part causes a segfault. I haven't bothered to trace down why. But when it does work, the page pinned by get_user_pages() changes. But you've now found that 2.6.7 and later kernels allow your app to work correctly without mlock, good. get_user_pages is certainly the right tool to use for such pinning. (On the question of whether mlock guarantees that user virtual addresses map to the same physical addresses, I prefer Arjan's view that it does not; but accept that there might prove to be difficulties in holding that position.) My understanding is that mlock() could in theory allow the page to be moved, but that currently nothing in the kernel would actually move it. However, that could change in the future to allow hot-swapping of RAM. So, it works now, you've exonerated today's get_user_pages, and you've identified at least one get_user_pages fix which went in at that time: do we really need to chase this further? My driver needs to support all 2.4 and 2.6 kernel versions. My makefile scans the kernel source tree with 'grep' to identify various characterists, and I use #ifdefs to conditionally compile code depending on what features are present in the kernel. I can't use the kernel version number, because that's not reliable - distros will incorporate patches from future kernels without changing the version ID. So I need to take into account distro vendors that use an earlier kernel, like 2.6.5, and back-port the patch from 2.6.7. The distro vendor will keep the 2.6.5 version number, which is why I can't rely on it. I need to know exactly what the fix is, so that when I scan mm/rmap.c, I know what to look for. Currently, I look for this regex: try_to_unmap_one.*vm_area_struct which seems to work. However, now I think it's just a coincidence. By the way, please don't be worried when soon the try_to_unmap_one comment and code that you identified above disappear. When I'm back in patch submission mode, I'll be sending Andrew a patch which removes it, instead reworking can_share_swap_page to rely on the page_mapcount instead of page_count, which avoids the ironical behaviour my comment refers to, and allows an awkward page migration case to proceed (once unpinned). Andrea and I now both prefer this page_mapcount approach. Ugh, that means my regex is probably going to break. Not only that, but I don't understand what you're saying either. Trying to understand the VM is really hard. I guess in this specific case, it doesn't really matter, because calling mlock() when I should be calling get_user_pages() is not a bad thing. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Sorry for not replying earlier (indeed, sorry for not joining in the wider RDMA pinning discussion), concentrating on other stuff at present. On Fri, 6 May 2005, Timur Tabi wrote: > Timur Tabi wrote: > > > I haven't gotten a reply to this question, but I've done my own research, > > and I think I found the answer. Using my own test of get_user_pages(), > > it appears that the fix was placed in 2.6.7. However, I would like to > > know specifically what the fix is. Unfortunately, tracking this stuff > > down is beyond my understanding of the Linux VM. > > I'm also still waiting for a reply to this question. Anyone > > Upon doing some more research, I think the fix might be those code instead: I believe you're right this time - I was rather puzzled by your earlier choice, then unhelpfully forgot to reply and point you a few lines further down to this comment, which does shout "get_user_pages fix" quite loudly. > /* > * Don't pull an anonymous page out from under get_user_pages. > * GUP carefully breaks COW and raises page count (while holding > * page_table_lock, as we have here) to make sure that the page > * cannot be freed. If we unmap that page here, a user write > * access to the virtual address will bring back the page, but > * its raised count will (ironically) be taken to mean it's not > * an exclusive swap page, do_wp_page will replace it by a copy > * page, and the user never get to see the data GUP was holding > * the original page for. > */ > if (PageSwapCache(page) && > page_count(page) != page->mapcount + 2) { > ret = SWAP_FAIL; > goto out_unmap; > } > > Both this change and the other one I mentioned are new to 2.6.7. I suppose I > could try applying these patches to the 2.6.6 kernel and see if anything > improves, but that won't help me understand what's really going on. There's a lot of change in the rmap area between 2.6.6 and 2.6.7, but you're right that this is an isolated fix, which could in principle be applied to earlier releases. Though I don't see it's worth doing now. > The above comment makes sounds almost like it's a fix, Almost? Sorry if my comment doesn't make it obvious it's a fix for a get_user_pages issue - I rewrote Andrea Arcangeli's original commment. The analysis and fix are his. > but it talks about copy-on-write, > which is has nothing to do with the real problem. Oh, well, maybe, but what is the real problem? Are you sure that copy-on-write doesn't come into it? I haven't reread through the whole thread, but my recollection is that you never quite said what the real problem is: you'd found some time ago that get_user_pages sometimes failed to pin the pages for your complex app, so were forced to mlock too; but couldn't provide any simple test case for the failure (which can indeed be a lot of work to devise), so we were all in the dark as to what went wrong. But you've now found that 2.6.7 and later kernels allow your app to work correctly without mlock, good. get_user_pages is certainly the right tool to use for such pinning. (On the question of whether mlock guarantees that user virtual addresses map to the same physical addresses, I prefer Arjan's view that it does not; but accept that there might prove to be difficulties in holding that position.) So, it works now, you've exonerated today's get_user_pages, and you've identified at least one get_user_pages fix which went in at that time: do we really need to chase this further? Oh, in writing of copy-on-write, I've just remembered another fix for get_user_pages which I made in 2.6.7 (though I've not heard of anyone seeing the problem fixed): call to do_wp_page in do_swap_page. get_user_pages assumes that the write fault it generates will break copy-on-write i.e. will make a private copy page when necessary, before returning to the caller; but that wasn't happening in the do_swap_page case. By the way, please don't be worried when soon the try_to_unmap_one comment and code that you identified above disappear. When I'm back in patch submission mode, I'll be sending Andrew a patch which removes it, instead reworking can_share_swap_page to rely on the page_mapcount instead of page_count, which avoids the ironical behaviour my comment refers to, and allows an awkward page migration case to proceed (once unpinned). Andrea and I now both prefer this page_mapcount approach. Hugh ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Timur Tabi wrote: I haven't gotten a reply to this question, but I've done my own research, and I think I found the answer. Using my own test of get_user_pages(), it appears that the fix was placed in 2.6.7. However, I would like to know specifically what the fix is. Unfortunately, tracking this stuff down is beyond my understanding of the Linux VM. I'm also still waiting for a reply to this question. Anyone Upon doing some more research, I think the fix might be those code instead: /* * Don't pull an anonymous page out from under get_user_pages. * GUP carefully breaks COW and raises page count (while holding * page_table_lock, as we have here) to make sure that the page * cannot be freed. If we unmap that page here, a user write * access to the virtual address will bring back the page, but * its raised count will (ironically) be taken to mean it's not * an exclusive swap page, do_wp_page will replace it by a copy * page, and the user never get to see the data GUP was holding * the original page for. */ if (PageSwapCache(page) && page_count(page) != page->mapcount + 2) { ret = SWAP_FAIL; goto out_unmap; } Both this change and the other one I mentioned are new to 2.6.7. I suppose I could try applying these patches to the 2.6.6 kernel and see if anything improves, but that won't help me understand what's really going on. The above comment makes sounds almost like it's a fix, but it talks about copy-on-write, which is has nothing to do with the real problem. -- Timur Tabi Staff Software Engineer [EMAIL PROTECTED] One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Wed, May 04, 2005 at 01:27:54PM -0500, Timur Tabi wrote: > Libor Michalek wrote: > > > The program opens the charcter device file descriptor, pins the pages > > and waits for a signal, before checking the pages, which is sent to the > > process after running some other program which exercises the VM. On older > > kernels the check fails, on my 2.6.11 kernel the check succeeds. So > > mlock is not needed on top of get_user_pages() as it was before. > > When you say "older", what exactly do you mean? I have different test > that normally fails with just get_user_pages(), but it works with 2.6.9 > and above. I haven't been able to get any kernel earlier than 2.6.9 to > compile or boot properly, so I'm having a hard time narrowing down the > actual point when get_user_pages() started working. The older kernel I tried was one of the 2.4.21 RHEL 3 kernels. I hadn't spent much time investigating the issue since this was a new kernel, so it was a natural one for me to try. -Libor ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Timur Tabi wrote: When you say "older", what exactly do you mean? I have different test that normally fails with just get_user_pages(), but it works with 2.6.9 and above. I haven't been able to get any kernel earlier than 2.6.9 to compile or boot properly, so I'm having a hard time narrowing down the actual point when get_user_pages() started working. I haven't gotten a reply to this question, but I've done my own research, and I think I found the answer. Using my own test of get_user_pages(), it appears that the fix was placed in 2.6.7. However, I would like to know specifically what the fix is. Unfortunately, tracking this stuff down is beyond my understanding of the Linux VM. Assuming that the fix is in try_to_unmap_one(), the only significant change I see between 2.6.6 and 2.6.7 is the addition of this code: pgd = pgd_offset(mm, address); if (!pgd_present(*pgd)) goto out_unlock; pmd = pmd_offset(pgd, address); if (!pmd_present(*pmd)) goto out_unlock; pte = pte_offset_map(pmd, address); if (!pte_present(*pte)) goto out_unmap; if (page_to_pfn(page) != pte_pfn(*pte)) goto out_unmap; Can anyone tell me if this is the actual fix, or at least a major part of the actual fix? -- Timur Tabi Staff Software Engineer [EMAIL PROTECTED] One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Wed, May 04, 2005 at 09:27:21PM -0400, Rik van Riel wrote: > On Wed, 4 May 2005, William Jordan wrote: > > On 5/3/05, Andy Isaacson <[EMAIL PROTECTED]> wrote: > > > Rather than replacing the fully-registered pages with pages of zeros, > > > you could simply unmap them. > > > > I don't like this option. It is nearly free to map all of the pages to > > the zero-page. You never have to allocate a page if the user never > > writes to it. > > Unmapping should work fine, as long as the VMA flags are > set appropriately. The page fault handler can take care > of the rest... I think there may be a difference in terminology here. What I originally proposed (and what I think Bill was reacting to) is the equivalent of sys_munmap() on the range of registered pages. That has the downsides that he mentioned; an address that was valid in the parent will now result in SIGSEGV or SIGBUS in the child, and it's explicitly endorsed by the userland APIs (such as MPI2) that it's valid to register stack addresses (for example). What I think you're proposing, Rik, is that VMA get destroyed (or split, if only part of it had been registered) and replaced with an anonymous one. That's a very low-overhead way of going about it, I think. Then as you say, the page fault handler will automatically give a zero page to the process when it faults on those addresses. Did I understand your suggestion correctly? I think I agree with Bill that having the child fault on pages which happened to have been registered by the parent would be a bad thing. This would, if I understand correctly, be visible in /proc/$$/maps. Which is OK, if a little bit suprising; but the alternatives are worse. -andy ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Wed, 4 May 2005, William Jordan wrote: > On 5/3/05, Andy Isaacson <[EMAIL PROTECTED]> wrote: > > Rather than replacing the fully-registered pages with pages of zeros, > > you could simply unmap them. > > I don't like this option. It is nearly free to map all of the pages to > the zero-page. You never have to allocate a page if the user never > writes to it. Unmapping should work fine, as long as the VMA flags are set appropriately. The page fault handler can take care of the rest... -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On 5/3/05, Andy Isaacson <[EMAIL PROTECTED]> wrote: > Rather than replacing the fully-registered pages with pages of zeros, > you could simply unmap them. I don't like this option. It is nearly free to map all of the pages to the zero-page. You never have to allocate a page if the user never writes to it. Buf if you unmap the page, there could be issues. The memory region could be on the stack, or malloc'ed. In these cases, the child should be able to return from the function, or free the memory without setting a timebomb. -- Bill Jordan InfiniCon Systems ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Libor Michalek wrote: The program opens the charcter device file descriptor, pins the pages and waits for a signal, before checking the pages, which is sent to the process after running some other program which exercises the VM. On older kernels the check fails, on my 2.6.11 kernel the check succeeds. So mlock is not needed on top of get_user_pages() as it was before. Libor, When you say "older", what exactly do you mean? I have different test that normally fails with just get_user_pages(), but it works with 2.6.9 and above. I haven't been able to get any kernel earlier than 2.6.9 to compile or boot properly, so I'm having a hard time narrowing down the actual point when get_user_pages() started working. -- Timur Tabi Staff Software Engineer [EMAIL PROTECTED] One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Tue, May 03, 2005 at 11:43:25AM -0700, Andy Isaacson wrote: > [1] You might want to allow the child to start a completely new RDMA > context, but I don't see that as necessary. Some people use a hybrid OpenMP+MPI model, so some MPI implementations want to fork and have the child open a new communications context. This is fairly rare, but unfortunately is a checklist item for most MPI implementations, and this will get worse over time. -- greg ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On 5/3/05, Andy Isaacson <[EMAIL PROTECTED]> wrote: > > A consistent statement would be > > After fork(2), any regions which were registered are UNDEFINED. > Region boundaries are byte-accurate; a registration can cover just > part of a page, in which case the non-registered part of the page > has normal fork COW semantics. > That is a reasonable approach. > > Obviously, calling *any* RDMA-userland-stuff in the child is completely > undefined [1]. One place where I can see a potential problem is in > atexit()-type handlers registered by the RDMA library. Since those > aren't performance-critical they can and should do sanity checks with > getpid() and/or checking with the kernel driver. > That is also reasonable. None of the RDMA libraries I have worked on bothered to use an atexit()-type hook because the user was theoretically *required* to close the rnic, and driver code was already reuqired to clean up in case of a total process failure. Adding an intermediate safety-net for applications that exited cleanly but forget to close just didn't seem worthwhile. If the application wants the cleanup performed optimally then it can close the rnic, otherwise it can't complain about forcing the RNIC vendor to clean up in the driver code. > [1] You might want to allow the child to start a completely new RDMA > context, but I don't see that as necessary. > That should be allowed. It is actually more normal to use the parent as a dispatcher and to actually manage the connection in a child process. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Fri, Apr 29, 2005 at 05:31:44PM -0700, Caitlin Bestler wrote: > Attempting to provide *any* support for applications that fork children > after doing RDMA registrations is a ratshole best avoided. The general > rule that application developers should follow is to do RDMA *only* > in the child processes. I think it's unreasonable to *prohibit* fork-after-registration; for one thing, there's lots of code that forks under the covers. Setuid helpers like getpty just assume that they're going to be able to fork. Even stuff like get*by*(3) can potentially fork. And with site-configured stuff like PAM, you end up with things that work on the developer's system but break in deployment. I think it's exceedingly reasonable to say "RDMA doesn't work in children". But the child should get a sane memory image: at least zeros in fully-registered pages, and preferably copies of partially-registered pages. Differentiating between fully-registered and partially-registered pages avoids (I think) the pathological case of having to copy a GB of data just to system("/bin/ls > /tmp/tmpfile"). You can still go pathological if you've partially-registered gigabytes of address space (for example a linked list where each node is allocated with malloc and then registered) but that's a case of "Well, don't do that then". Rather than replacing the fully-registered pages with pages of zeros, you could simply unmap them. A consistent statement would be After fork(2), any regions which were registered are UNDEFINED. Region boundaries are byte-accurate; a registration can cover just part of a page, in which case the non-registered part of the page has normal fork COW semantics. Probably the most sane solution is to simply unmap the fully-registered pages at fork time, and copy any partially-registered pages. But the statement above does not require this. > Keep in mind that it is not only the memory regions that must be dealt > with, but control data invisible to the user (the QP context, etc.). This > data frequently is interlinked between kernel residente and user resident > data (such as a QP context has the PD ID somewhere on-chip or in > kernel, which the Send Queue ring needs to be in user memory). Having > two different user processes that both think they have the user half to > this type of split data structure is just asking for trouble, even if you > manage to get the copy on write bit timing problems all solved. Obviously, calling *any* RDMA-userland-stuff in the child is completely undefined [1]. One place where I can see a potential problem is in atexit()-type handlers registered by the RDMA library. Since those aren't performance-critical they can and should do sanity checks with getpid() and/or checking with the kernel driver. [1] You might want to allow the child to start a completely new RDMA context, but I don't see that as necessary. -andy ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On 4/29/05, Libor Michalek <[EMAIL PROTECTED]> wrote: > > However, you have a potential problem with registered buffers that > do not begin or end on a page boundary, which is common with malloc. > If the buffer resides on a portion of a page, and you mark the vm > which contains that entire page VM_DONTCOPY, to ensure that the parent > has access to the exact physical page after the fork, the child will > not be able to access anything on that entire page. So if the child > expects to access data on the same page that happens to contain the > registered buffer it will get a segment violation. > > The four situations we've discussed are: > > 1) Physical page does not get used for anything else. > 2) Processes virtual to physical mapping remains fixed. > 3) Same virtual to physical mapping after forking a child. > 4) Forked child has access to all non-registered memory of > the parent. > > The first two are now taken care of with get_user_pages, (we use to > use VM_LOCKED for the second case) third case is handled by setting > the vm to VM_DONTCOPY, and on the fourth case we've always punted, > but the real answer is to break partial pages into seperate vms and > mark them ALWAYS_COPY. > > -Libor > > Attempting to provide *any* support for applications that fork children after doing RDMA registrations is a ratshole best avoided. The general rule that application developers should follow is to do RDMA *only* in the child processes. Keep in mind that it is not only the memory regions that must be dealt with, but control data invisible to the user (the QP context, etc.). This data frequently is interlinked between kernel residente and user resident data (such as a QP context has the PD ID somewhere on-chip or in kernel, which the Send Queue ring needs to be in user memory). Having two different user processes that both think they have the user half to this type of split data structure is just asking for trouble, even if you manage to get the copy on write bit timing problems all solved. All of this can be avoided by a simple rule: don't fork after opening an RDMA device. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: RDMA memory registration (was: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation)
On 4/29/05, Roland Dreier <[EMAIL PROTECTED]> wrote: > b) (maybe someday?) Add a VM_ALWAYSCOPY flag and extend mprotect() > with PROT_ALWAYSCOPY so processes can mark pages to be > pre-copied into child processes, to handle the case where only > half a page is registered. Are you suggesting making the partial pages their own VMA, or marking the entire buffer with this flag? I originally thought the entire buffer should be copy on fork (instead of copy on write), and I believe this is the path Mellanox was pursing with the VM_NO_COW flag. However, if applications are registering gigs of ram, it would be very bad to have the entire area copied on fork. On the other hand, I've always wondered about the choice to leave holes in the child process's address space. I would have chosen to map the zero page instead. -- Bill Jordan InfiniCon Systems ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: RDMA memory registration (was: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation)
On Fri, Apr 29, 2005 at 09:45:50AM -0700, Roland Dreier wrote: > Is there anything wrong with the following plan? > > 1) For memory registration, use get_user_pages() in the kernel. Use >locked_vm and RLIMIT_MEMLOCK to limit the amount of memory pinned >by a given process. One disadvantage of this is that the >accounting will overestimate the amount of pinned memory if a >process pins the same page twice, but this doesn't seem that bad to >me -- it errs on the side of safety. I think the overestimate will be fine in practice. If a process is locking a lot of memory it will most likely be in big chunks, so not much page overlap there. If the process is locking lots of tiny buffers with lots of page overlap, the total locked amount will most likely be small. Although it is odd that you could end up with a total locked amount larger then the number of physical pages in the system... > 2) For fork() support: > >a) Extend mprotect() with PROT_DONTCOPY so processes can avoid > copy-on-write problems. > >b) (maybe someday?) Add a VM_ALWAYSCOPY flag and extend mprotect() > with PROT_ALWAYSCOPY so processes can mark pages to be > pre-copied into child processes, to handle the case where only > half a page is registered. > > I believe this puts the code that must be trusted into the kernel and > gives userspace primitives that let apps handle the rest. I'm assuming that for libibverbs memory registration you plan on hiding the mprotect in the library? Without reference counting at the kernel level this could yield unexpected results in a perfectly legitimate app. For example if the app is managing a buffer it will pass to another device, but also want's to move data in/out with RDMA hardware, the user marks it themselves with DONTCOPY, registers with libibverbs, performs IO, unregisters with libibverbs. At this point the user expects the buffer to have DONTCOPY set, but it does not because of the unregister... Not that it's likely, but it's a valid thing to do. However, since I don't have a better suggestion, I'm in favour of using mprotect as you outlined. -Libor ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Fri, Apr 29, 2005 at 08:56:20AM -0700, Caitlin Bestler wrote: > On 4/29/05, Bill Jordan <[EMAIL PROTECTED]> wrote: > > On 4/26/05, Andrew Morton <[EMAIL PROTECTED]> wrote: > > > > > Our point is that contemporary microprocessors cannot electrically > > > do what you want them to do! > > > > > > Now, conceeiveably the kernel could keep track of the state of the > > > pages down to the byte level, and could keep track of all COWed pages and > > > could look at faulting addresses at the byte level and could copy sub-page > > > ranges by hand from one process's address space into another process's > > > after I/O completion. I don't think we want to do that. > > > > > > Methinks your specification is busted. > > > > I agree in principal. However, I expect this issue will come up with > > more and more new specifications, and if it isn't addressed once in > > the linux kernel, it will be kludged and broken many times in many > > drivers. > > > > I believe we need an kernel level interface that will pin user pages, > > and lock the user vma in a single step. The interface should be used > > by drivers when the hardware mappings are done. If the process is > > split into a user operation to lock the memory, and a driver operation > > to map the hardware, there will always be opportunity for abuse. > > > > Reference counting needs to be done by this interface to allow > > different hardware to interoperate. > > > > The interface can't overload the VM_LOCKED flag, or rely on any other > > attributes that the user can tinker with via any other interface. > > > > And as much as I hate to admit it, I think on a fork, we will need to > > copy parts of pages at the beginning or end of user I/O buffers. > > > > I agree with all but the last part, in my opinion there is no need to deal > with fork issues as long as solutions do not result in failures. There is > *no* basis for a child process to expect that it will inherit RDMA resources. > A child process that uses such resources will get undefined results, nothing > further needs to be stated, and no heroic efforts are required to avoid them. However, you have a potential problem with registered buffers that do not begin or end on a page boundary, which is common with malloc. If the buffer resides on a portion of a page, and you mark the vm which contains that entire page VM_DONTCOPY, to ensure that the parent has access to the exact physical page after the fork, the child will not be able to access anything on that entire page. So if the child expects to access data on the same page that happens to contain the registered buffer it will get a segment violation. The four situations we've discussed are: 1) Physical page does not get used for anything else. 2) Processes virtual to physical mapping remains fixed. 3) Same virtual to physical mapping after forking a child. 4) Forked child has access to all non-registered memory of the parent. The first two are now taken care of with get_user_pages, (we use to use VM_LOCKED for the second case) third case is handled by setting the vm to VM_DONTCOPY, and on the fourth case we've always punted, but the real answer is to break partial pages into seperate vms and mark them ALWAYS_COPY. -Libor ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RDMA memory registration (was: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation)
Is there anything wrong with the following plan? 1) For memory registration, use get_user_pages() in the kernel. Use locked_vm and RLIMIT_MEMLOCK to limit the amount of memory pinned by a given process. One disadvantage of this is that the accounting will overestimate the amount of pinned memory if a process pins the same page twice, but this doesn't seem that bad to me -- it errs on the side of safety. 2) For fork() support: a) Extend mprotect() with PROT_DONTCOPY so processes can avoid copy-on-write problems. b) (maybe someday?) Add a VM_ALWAYSCOPY flag and extend mprotect() with PROT_ALWAYSCOPY so processes can mark pages to be pre-copied into child processes, to handle the case where only half a page is registered. I believe this puts the code that must be trusted into the kernel and gives userspace primitives that let apps handle the rest. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On 4/29/05, Bill Jordan <[EMAIL PROTECTED]> wrote: > On 4/26/05, Andrew Morton <[EMAIL PROTECTED]> wrote: > > > Our point is that contemporary microprocessors cannot electrically do what > > you want them to do! > > > > Now, conceeiveably the kernel could keep track of the state of the > > pages down to the byte level, and could keep track of all COWed pages and > > could look at faulting addresses at the byte level and could copy sub-page > > ranges by hand from one process's address space into another process's > > after I/O completion. I don't think we want to do that. > > > > Methinks your specification is busted. > > I agree in principal. However, I expect this issue will come up with > more and more new specifications, and if it isn't addressed once in > the linux kernel, it will be kludged and broken many times in many > drivers. > > I believe we need an kernel level interface that will pin user pages, > and lock the user vma in a single step. The interface should be used > by drivers when the hardware mappings are done. If the process is > split into a user operation to lock the memory, and a driver operation > to map the hardware, there will always be opportunity for abuse. > > Reference counting needs to be done by this interface to allow > different hardware to interoperate. > > The interface can't overload the VM_LOCKED flag, or rely on any other > attributes that the user can tinker with via any other interface. > > And as much as I hate to admit it, I think on a fork, we will need to > copy parts of pages at the beginning or end of user I/O buffers. > I agree with all but the last part, in my opinion there is no need to deal with fork issues as long as solutions do not result in failures. There is *no* basis for a child process to expect that it will inherit RDMA resources. A child process that uses such resources will get undefined results, nothing further needs to be stated, and no heroic efforts are required to avoid them. What is definitely needed is kernel counting of locks on user pages. Finer granularity is not expected, it is the RDMA hardware that works at finer granularity. All it needs is to know what bus address a given virtual page maps to -- and it needs to know that said mapping will not change without advance notice. Further, any revocation of an existing mapping (to deal with hot page swapping or whatever) cannot expect the RDMA hardware to respond any faster than it would to invalidating a memory region. The RDMA hardware has an inherent need to cache translations. That is why it cannot guarantee that it will cease updating a memory region the nanosecond that a request is made to invalidate an STag. Instead it is allowed to block on such a request and only guarantees to have ceased access when the invalidate request completes. The same need for a delay exists for any interface that moves memory around, or requests to reclaim memory from the application. This also applies on process death. The hardware cannot stop on a dime. The best it can do is stop promptly, and given an unambiguous indication to the OS as to when it has stopped. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On 4/26/05, Andrew Morton <[EMAIL PROTECTED]> wrote: > Our point is that contemporary microprocessors cannot electrically do what > you want them to do! > > Now, conceeiveably the kernel could keep track of the state of the > pages down to the byte level, and could keep track of all COWed pages and > could look at faulting addresses at the byte level and could copy sub-page > ranges by hand from one process's address space into another process's > after I/O completion. I don't think we want to do that. > > Methinks your specification is busted. I agree in principal. However, I expect this issue will come up with more and more new specifications, and if it isn't addressed once in the linux kernel, it will be kludged and broken many times in many drivers. I believe we need an kernel level interface that will pin user pages, and lock the user vma in a single step. The interface should be used by drivers when the hardware mappings are done. If the process is split into a user operation to lock the memory, and a driver operation to map the hardware, there will always be opportunity for abuse. Reference counting needs to be done by this interface to allow different hardware to interoperate. The interface can't overload the VM_LOCKED flag, or rely on any other attributes that the user can tinker with via any other interface. And as much as I hate to admit it, I think on a fork, we will need to copy parts of pages at the beginning or end of user I/O buffers. -- Bill Jordan InfiniCon Systems ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Arjan van de Ven <[EMAIL PROTECTED]> wrote: > > > Why do you call mlock() and get_user_pages()? In our code, we only call > > mlock(), and the > > memory is pinned. > > this is a myth; linux is free to move the page about in physical memory > even if it's mlock()ed!! eh? I guess the kernel _is_ free to move the page about, but it doesn't. We might do at some time in the future for memory hotplug, I guess. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Andrew> The kernel can simply register and unregister ranges for Andrew> RDMA. So effectively a particular page is in either the Andrew> registered or unregistered state. Kernel accounting Andrew> counts the number of registered pages and compares this Andrew> with rlimits. Andrew> On top of all that, your userspace library needs to keep Andrew> track of when pages should really be registered and Andrew> unregistered with the kernel. Using overlap logic and Andrew> per-page refcounting or whatever. This is OK as long as userspace is trusted. However I don't see how this works when we don't trust userspace. The problem is that for an RDMA device (IB HCA or iWARP RNIC), a process can create many memory regions, each of which a separate virtual to physical translation map. For example, an app can do: a) register 0x through 0x and get memory handle 1 b) register 0x through 0x and get memory handle 2 c) use memory handle 1 for communication with remote app A d) use memory handle 2 for communication with remote app B Even though memory handles 1 and 2 both refer to exactly the same memory, they may have different lifetimes, might be attached to different connections, and so on. Clearly the memory at 0x must stay pinned as long as the RDMA device thinks either memory handle 1 or memory handle 2 is valid. Furthermore, the kernel must be the one keeping track of how many regions refer to a given page because we can't allow userspace to be able to tell a device to go DMA to memory it doesn't own any more. Creation and destruction of these memory handles will always go through the kernel driver, so this isn't so bad. And get_user_pages() is almost exactly what we need: it stacks perfectly, since it operates on the page_count rather than just setting a bit in vm_flags. The main problem is that it doesn't check against RLIMIT_MEMLOCK. The most reasonable thing to do would seem to be having the IB kernel memory region code update current->mm->locked_vm and check it against RLIMIT_MEMLOCK. I guess it would be good to figure out an appropriate abstraction to export rather than monkeying with current->mm directly. We could also put this directly in get_user_pages(), but I'd be worried about messing with current users. I just don't see a way to make VM_KERNEL_LOCKED work. It would also be nice to have a way for apps to set VM_DONTCOPY appropriately. Christoph's suggestion of extending mmap() and mprotect() with PROT_DONTCOPY seems good to me, especially since it means we don't have to export do_mlock() functionality to modules. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On 4/26/05, Andrew Morton <[EMAIL PROTECTED]> wrote: > > > > However I don't see how to make it work if I put the reference > > counting for overlapping regions in userspace but when I want mlock() > > accounting in the kernel. If a buggy/malicious app does: > > > > a) register from 0x to 0x2fff > > b) register from 0x1000 to 0x1fff > > c) unregister from 0x to 0x2fff > > As far as the kernel is concerned, step b) should be a no-op. (The kernel > might choose to split the vma, but that's not significant). > If "register" and "unregister" is meant in the RDMA sense then the above sequence is totally reasonable. The b) registration could be for a different protection domain that did not require access to all of the larger region. Unless a full counting lock is available from the kernel, the responsibility of the collective RDMA components would be to a) pin 0x to 0x2fff, b) nothing c) unpin 0x000 to 0x0fff and 0x2000 to 0x2fff ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On 4/26/05, Andrew Morton <[EMAIL PROTECTED]> wrote: > Roland Dreier <[EMAIL PROTECTED]> wrote: > > > > Libor> Do you mean that the set/clear parameters to do_mlock() > > Libor> are the actual flags which are set/cleared by the caller? > > Libor> Also, the issue remains that the flags are not reference > > Libor> counted which is a problem if you are dealing with > > Libor> overlapping memory region, or even if one region ends and > > Libor> another begins on the same page. Since the desire is to be > > Libor> able to pin any memory that a user can malloc this is a > > Libor> likely scenario. > > > > Good point... we need to figure out how to handle: > > > > a) app registers 0x through 0x17ff > > b) app registers 0x1800 through 0x2fff > > c) app unregisters 0x through 0x17ff > > d) the page at 0x1000 must stay pinned > > The userspace library should be able to track the tree and the overlaps, > etc. Things might become interesting when the memory is MAP_SHARED > pagecache and multiple independent processes are involved, although I guess > that'd work OK. > > But afaict the problem wherein part of a page needs VM_DONTCOPY and the > other part does not cannot be solved. > Which portion of the userspace library? HCA-dependent code, or common code? The HCA-dependent code would fail to count when the same memory was registered to different HCAs (for example to the internal network device and the external network device). The vendor-independent code *could* do it, but only by maintaining a complete list of all registrations that had been issued but not cancelled. That data would be redundant with data kept at the verb layer, and by the kernel. It *would' work, but maintaining highly redundant data at multiple layers is something that I generally try to avoid. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Roland Dreier <[EMAIL PROTECTED]> wrote: > > Andrew> Well I was vaguely proposing that the userspace library > Andrew> keep track of the byteranges and the underlying page > Andrew> states. So in the above scenario userspace would leave > Andrew> the page at 0x1000 registered until all registrations > Andrew> against that page have been undone. > > OK, I already have code in userspace that keeps reference counts for > overlapping regions, etc. However I'm not sure how to tie this in > with reliable accounting of pinned memory -- we don't want malicious > userspace code to be able fool the accounting, right? > > So I'm still trying to puzzle out what to do. I don't want to keep a > complicated data structure in the kernel keeping track of what memory > has been registered. Right now, I just keep a list of structs, one > for each region, and when a process dies, I just go through region by > region and do a put_page() to balance off the get_user_pages(). > > However I don't see how to make it work if I put the reference > counting for overlapping regions in userspace but when I want mlock() > accounting in the kernel. If a buggy/malicious app does: > > a) register from 0x to 0x2fff > b) register from 0x1000 to 0x1fff > c) unregister from 0x to 0x2fff As far as the kernel is concerned, step b) should be a no-op. (The kernel might choose to split the vma, but that's not significant). > then it seems the kernel is screwed unless it counts how many times a > vma has been pinned. And adding a pin_count member to vm_struct seems > like a pretty damn major step. > > We definitely have to make sure that userspace is never able to either > unpin a page that is still registered with RDMA hardware, because that > can lead to DMA to into memory that someone else owns. On the other > hand, we don't want userspace to be able to defeat resource accounting > by tricking the kernel into keeping page_count elevated after it > credits the memory back to a process's limit on locked pages. The kernel can simply register and unregister ranges for RDMA. So effectively a particular page is in either the registered or unregistered state. Kernel accounting counts the number of registered pages and compares this with rlimits. On top of all that, your userspace library needs to keep track of when pages should really be registered and unregistered with the kernel. Using overlap logic and per-page refcounting or whatever. No? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Andrew> Well I was vaguely proposing that the userspace library Andrew> keep track of the byteranges and the underlying page Andrew> states. So in the above scenario userspace would leave Andrew> the page at 0x1000 registered until all registrations Andrew> against that page have been undone. OK, I already have code in userspace that keeps reference counts for overlapping regions, etc. However I'm not sure how to tie this in with reliable accounting of pinned memory -- we don't want malicious userspace code to be able fool the accounting, right? So I'm still trying to puzzle out what to do. I don't want to keep a complicated data structure in the kernel keeping track of what memory has been registered. Right now, I just keep a list of structs, one for each region, and when a process dies, I just go through region by region and do a put_page() to balance off the get_user_pages(). However I don't see how to make it work if I put the reference counting for overlapping regions in userspace but when I want mlock() accounting in the kernel. If a buggy/malicious app does: a) register from 0x to 0x2fff b) register from 0x1000 to 0x1fff c) unregister from 0x to 0x2fff then it seems the kernel is screwed unless it counts how many times a vma has been pinned. And adding a pin_count member to vm_struct seems like a pretty damn major step. We definitely have to make sure that userspace is never able to either unpin a page that is still registered with RDMA hardware, because that can lead to DMA to into memory that someone else owns. On the other hand, we don't want userspace to be able to defeat resource accounting by tricking the kernel into keeping page_count elevated after it credits the memory back to a process's limit on locked pages. The limit on the number of locked pages seems like a natural thing to check against, but perhaps we need a different limit for the number of pages pinned for use by RDMA hardware. Sort of the same way that there's a separate limit on the number of in-flight aios. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Timur Tabi <[EMAIL PROTECTED]> wrote: > > Roland Dreier wrote: > > > Yes, I agree. If an app wants to register half a page and pass the > > other half to a child process, I think the only answer is "don't do > > that then." > > How can the app know that, though? It would have to allocate I/O buffers > with knowledge > of page boundaries. Today, the apps just malloc() a bunch of memory and pay > no attention > to whether the beginning or the end of the buffer shares a page with some > other, unrelated > object. We may as well tell the app that it needs to page-align all I/O > buffers. > > My point is that we can't just simply say, "Don't do that". Some entity > (the kernel, > libraries, whatever) should be able to tell the app that its usage of memory > is going to > break in some unpredictable way. Our point is that contemporary microprocessors cannot electrically do what you want them to do! Now, conceeiveably the kernel could keep track of the state of the pages down to the byte level, and could keep track of all COWed pages and could look at faulting addresses at the byte level and could copy sub-page ranges by hand from one process's address space into another process's after I/O completion. I don't think we want to do that. Methinks your specification is busted. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Roland Dreier <[EMAIL PROTECTED]> wrote: > > Roland> a) app registers 0x through 0x17ff > Roland> b) app registers 0x1800 through 0x2fff > Roland> c) app unregisters 0x through 0x17ff > Roland> d) the page at 0x1000 must stay pinned > > Andrew> The userspace library should be able to track the tree and > Andrew> the overlaps, etc. Things might become interesting when > Andrew> the memory is MAP_SHARED pagecache and multiple > Andrew> independent processes are involved, although I guess > Andrew> that'd work OK. > > I used to think I knew how to handle this, but in your scheme where > the kernel is doing accounting for pinned memory by marking vmas with > VM_KERNEL_LOCKED, at step c), I don't see why the kernel won't unlock > vmas covering 0x through 0x1fff and credit 8K back to the > process's pinning count. > > Sorry to be so dense but can you spell out what you think should > happen at steps a), b) and c) above? Well I was vaguely proposing that the userspace library keep track of the byteranges and the underlying page states. So in the above scenario userspace would leave the page at 0x1000 registered until all registrations against that page have been undone. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Roland Dreier wrote: Yes, I agree. If an app wants to register half a page and pass the other half to a child process, I think the only answer is "don't do that then." How can the app know that, though? It would have to allocate I/O buffers with knowledge of page boundaries. Today, the apps just malloc() a bunch of memory and pay no attention to whether the beginning or the end of the buffer shares a page with some other, unrelated object. We may as well tell the app that it needs to page-align all I/O buffers. My point is that we can't just simply say, "Don't do that". Some entity (the kernel, libraries, whatever) should be able to tell the app that its usage of memory is going to break in some unpredictable way. -- Timur Tabi Staff Software Engineer [EMAIL PROTECTED] One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Roland> a) app registers 0x through 0x17ff Roland> b) app registers 0x1800 through 0x2fff Roland> c) app unregisters 0x through 0x17ff Roland> d) the page at 0x1000 must stay pinned Andrew> The userspace library should be able to track the tree and Andrew> the overlaps, etc. Things might become interesting when Andrew> the memory is MAP_SHARED pagecache and multiple Andrew> independent processes are involved, although I guess Andrew> that'd work OK. I used to think I knew how to handle this, but in your scheme where the kernel is doing accounting for pinned memory by marking vmas with VM_KERNEL_LOCKED, at step c), I don't see why the kernel won't unlock vmas covering 0x through 0x1fff and credit 8K back to the process's pinning count. Sorry to be so dense but can you spell out what you think should happen at steps a), b) and c) above? Andrew> But afaict the problem wherein part of a page needs Andrew> VM_DONTCOPY and the other part does not cannot be solved. Yes, I agree. If an app wants to register half a page and pass the other half to a child process, I think the only answer is "don't do that then." - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On 4/26/05, Andrew Morton <[EMAIL PROTECTED]> wrote: > But afaict the problem wherein part of a page needs VM_DONTCOPY and the > other part does not cannot be solved. There may be an opportunity to create a solution where we can mark the page as "copy on fork" so the child has a page with a copy of the contents (at the time of the fork) instead of marking the page copy-on-write. -- Bill Jordan InfiniCon Systems ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Roland Dreier <[EMAIL PROTECTED]> wrote: > > Libor> Do you mean that the set/clear parameters to do_mlock() > Libor> are the actual flags which are set/cleared by the caller? > Libor> Also, the issue remains that the flags are not reference > Libor> counted which is a problem if you are dealing with > Libor> overlapping memory region, or even if one region ends and > Libor> another begins on the same page. Since the desire is to be > Libor> able to pin any memory that a user can malloc this is a > Libor> likely scenario. > > Good point... we need to figure out how to handle: > > a) app registers 0x through 0x17ff > b) app registers 0x1800 through 0x2fff > c) app unregisters 0x through 0x17ff > d) the page at 0x1000 must stay pinned The userspace library should be able to track the tree and the overlaps, etc. Things might become interesting when the memory is MAP_SHARED pagecache and multiple independent processes are involved, although I guess that'd work OK. But afaict the problem wherein part of a page needs VM_DONTCOPY and the other part does not cannot be solved. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Libor> Do you mean that the set/clear parameters to do_mlock() Libor> are the actual flags which are set/cleared by the caller? Libor> Also, the issue remains that the flags are not reference Libor> counted which is a problem if you are dealing with Libor> overlapping memory region, or even if one region ends and Libor> another begins on the same page. Since the desire is to be Libor> able to pin any memory that a user can malloc this is a Libor> likely scenario. Good point... we need to figure out how to handle: a) app registers 0x through 0x17ff b) app registers 0x1800 through 0x2fff c) app unregisters 0x through 0x17ff d) the page at 0x1000 must stay pinned hmm... - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Tue, Apr 26, 2005 at 08:31:32AM -0700, Roland Dreier wrote: > Andrew> umm, how about we > > Andrew> - force the special pages into a separate vma > > Andrew> - run get_user_pages() against it all > > Andrew> - use RLIMIT_MEMLOCK accounting to check whether the user > Andrew> is allowed to do this thing > > Andrew> - undo the RMLIMIT_MEMLOCK accounting in ->release > > Andrew> This will all interact with user-initiated mlock/munlock > Andrew> in messy ways. Maybe a new kernel-internal vma->vm_flag > Andrew> which works like VM_LOCKED but is unaffected by > Andrew> mlock/munlock activity is needed. > > Andrew> A bit of generalisation in do_mlock() should suit? > > Yes, it seems that modifying do_mlock() to something like > > int do_mlock(unsigned long start, size_t len, >unsigned int set, unsigned int clear) > > and then exporting a function along the lines of > > int do_mem_pin(unsigned long start, size_t len, int on) > > that sets/clears (VM_LOCKED_KERNEL | VM_DONTCOPY) according to the on > flag. Do you mean that the set/clear parameters to do_mlock() are the actual flags which are set/cleared by the caller? Also, the issue remains that the flags are not reference counted which is a problem if you are dealing with overlapping memory region, or even if one region ends and another begins on the same page. Since the desire is to be able to pin any memory that a user can malloc this is a likely scenario. -Libor ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Andrew> umm, how about we Andrew> - force the special pages into a separate vma Andrew> - run get_user_pages() against it all Andrew> - use RLIMIT_MEMLOCK accounting to check whether the user Andrew> is allowed to do this thing Andrew> - undo the RMLIMIT_MEMLOCK accounting in ->release Andrew> This will all interact with user-initiated mlock/munlock Andrew> in messy ways. Maybe a new kernel-internal vma->vm_flag Andrew> which works like VM_LOCKED but is unaffected by Andrew> mlock/munlock activity is needed. Andrew> A bit of generalisation in do_mlock() should suit? Yes, it seems that modifying do_mlock() to something like int do_mlock(unsigned long start, size_t len, unsigned int set, unsigned int clear) and then exporting a function along the lines of int do_mem_pin(unsigned long start, size_t len, int on) that sets/clears (VM_LOCKED_KERNEL | VM_DONTCOPY) according to the on flag. Seem reasonable? If so I can code this up. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Christoph Hellwig wrote: What doesn't work with that design are the braindead designed by comittee APIs in the RDMA world - but I don't think we should care about them too much. I think you should. The whole point behind RDMA is that these APIs exist and are being used by real-world applications. You can't just ignore them because they're inconvenient. If you're not willing to cater to these API's needs, then you may as well tell all the RDMA developers to forgot about Linux and port everything to Windows instead. The APIs are here to stay, and the whole point behind this thread is to discuss how Linux can support them. -- Timur Tabi Staff Software Engineer [EMAIL PROTECTED] One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Andrew Morton wrote: Precisely. That's why I suggested that we have an alternative vma->vm_flag bit which behaves in a similar manner to VM_LOCKED (say, VM_LOCKED_KERNEL), only userspace cannot alter it. How about calling it VM_PINNED? That way, we can define Locked - won't be swapped to disk, but can be moved around in memory Pinned - won't be swapped to disk or moved around in memory ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On 4/25/05, Christoph Hellwig <[EMAIL PROTECTED]> wrote: > On Mon, Apr 25, 2005 at 05:02:36PM -0700, Roland Dreier wrote: > > The idea is that applications manage the lifetime of pinned memory > > regions. They can do things like post multiple I/O operations without > > any page-walking overhead, or pass a buffer descriptor to a remote > > host who will send data at some indeterminate time in the future. In > > addition, InfiniBand has the notion of atomic operations, so a cluster > > application may be using some memory region to implement a global lock. > > > > This might not be the most kernel-friendly design but it is pretty > > deeply ingrained in the design of RDMA transports like InfiniBand and > > iWARP (RDMA over IP). > > Actuallky, no it isn't. All these transports would work just fine with > the mmap a character device to hand out memory from the kernel approach > I told you to use multiple times and Andrew mentioned in this thread aswell. > What doesn't work with that design are the braindead designed by comittee > APIs in the RDMA world - but I don't think we should care about them too > much. > RDMA registers and uses the memory the user specifies. That is why byte granularity and multiple redundant registrations are explicitly specified. The mechanism by which this requirement is implemented is of course OS dependent. But the requirements are that the application specifies what portion of their memory they want registered (or what set of physical pages if they have sufficient privilege) and that request is either honored or refused by a resource manager (one preferably as integrated with general OS resource management as possible). The other aspect is that remotely enabled memory regions and memory windows most be enabled for hardware access for the duration of the region or window -- indefinitely until process death or explicit termination by the application layer. Theoretically there is nothing in the wire protocols that requires source buffers to be pinned indefinitely, but that is the only way any RDMA interface has ever worked -- so "brain death" must be pretty widespread. The fact that this problem must be solved for remotely accessible buffers, and that for cluster applications like MPI there is no distinction between buffers used for inbound messages and outbound messages, might have something to do with this. User verbs needs to deal with these actual Memory Registration requirements, including the very real application need for Memory Windows. The solution should map to existing OS controls as much as possible. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Mon, Apr 25, 2005 at 05:02:36PM -0700, Roland Dreier wrote: > The idea is that applications manage the lifetime of pinned memory > regions. They can do things like post multiple I/O operations without > any page-walking overhead, or pass a buffer descriptor to a remote > host who will send data at some indeterminate time in the future. In > addition, InfiniBand has the notion of atomic operations, so a cluster > application may be using some memory region to implement a global lock. > > This might not be the most kernel-friendly design but it is pretty > deeply ingrained in the design of RDMA transports like InfiniBand and > iWARP (RDMA over IP). Actuallky, no it isn't. All these transports would work just fine with the mmap a character device to hand out memory from the kernel approach I told you to use multiple times and Andrew mentioned in this thread aswell. What doesn't work with that design are the braindead designed by comittee APIs in the RDMA world - but I don't think we should care about them too much. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Timur Tabi <[EMAIL PROTECTED]> wrote: > > Andrew Morton wrote: > > > I'm referring to an application which uses your syscalls to obtain pinned > > memory and uses munlock() so that it may then use your syscalls to obtain > > evem more pinned memory. With the objective of taking the machine down. > > I'm in favor of having drivers call do_mlock() and do_munlock() on behalf of > the > application. All we need to do is export those functions, and my driver can > call them. > However, that still doesn't prevent an app from calling munlock(). Precisely. That's why I suggested that we have an alternative vma->vm_flag bit which behaves in a similar manner to VM_LOCKED (say, VM_LOCKED_KERNEL), only userspace cannot alter it. > But I don't understand the distinction between having the driver call > do_mlock() vs. the > application calling mlock(). Won't we still have the same problems? A > malicious app can > just call our driver instead of calling mlock() or munlock(). The driver > won't know the > difference between an authorized app and an unauthorized one. The driver will set VM_LOCKED_KERNEL, not VM_LOCKED. > Besides, isn't the whole point behind RLIMIT_MEMLOCK to limit how much one > process can lock? Sure. The internal setting of VM_LOCKED_KERNEL should still use RLIMIT_MEMLOCK accounting. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Mon, Apr 25, 2005 at 04:24:05PM -0700, Andrew Morton wrote: > Libor Michalek <[EMAIL PROTECTED]> wrote: > > On Mon, Apr 25, 2005 at 03:35:42PM -0700, Andrew Morton wrote: > > > > > Yes, we expect that all the pages which get_user_pages() pinned > > > will become unpinned within the context of the syscall which pinned > > > the pages. Or shortly after, in the case of async I/O. > > > > When a network protocol is making use of async I/O the amount of time > > between posting the read request and getting the completion for that > > request is unbounded since it depends on the other half of the connection > > sending some data. In this case the buffer that was pinned during the > > io_submit() may be pinned, and holding the pages, for a long time. > > Sure. > > > During > > this time the process might fork, at this point any data received will be > > placed into the wrong spot. > > Well the data is placed in _a_ spot. That's only the "wrong" spot because > you've defined it to be wrong! > > IOW: what behaviour are you actually looking for here, and why, and does it > matter? For example a network server app has an open connection on which it uses async IO to submit two buffers for a read operation. Both buffers are pinned using get_user_pages() and the connection waits for data to arrive. The connection received data, it is written into the first buffer, the app is notified using async IO, and it retreives the async IO completion. The app reads the buffer which happens to contain a command to spawn a child, the app forks a child. Now there is still a buffer posted for read and if more data arrives on the connection that data is copied to the pages which were saved when the buffer was pinned. The app is notified, retrieves the async IO completion, but when it goes to read that buffer it will not have the new data. > > > This is because there is no file descriptor or anything else associated > > > with the pages which permits the kernel to clean stuff up on unclean > > > application exit. Also there are the obvious issues with permitting > > > pinning of unbounded amounts of memory. > > > > Correct, the driver must be able to determine that the process has died > > and clean up after it, so the pinned region in most implementations is > > associated with an open file descriptor. > > How is that association created? The kernel module which pinned the memory is responsible for unpinning it if the file descriptor, which was used to deliver the command that resulted in the pinning, is closed. -Libor ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Andrew Morton wrote: I'm referring to an application which uses your syscalls to obtain pinned memory and uses munlock() so that it may then use your syscalls to obtain evem more pinned memory. With the objective of taking the machine down. I'm in favor of having drivers call do_mlock() and do_munlock() on behalf of the application. All we need to do is export those functions, and my driver can call them. However, that still doesn't prevent an app from calling munlock(). But I don't understand the distinction between having the driver call do_mlock() vs. the application calling mlock(). Won't we still have the same problems? A malicious app can just call our driver instead of calling mlock() or munlock(). The driver won't know the difference between an authorized app and an unauthorized one. Besides, isn't the whole point behind RLIMIT_MEMLOCK to limit how much one process can lock? I haven't even thought about memory hotswap. Surely it'll fail if the pages are pinned by get_user_pages()? Any memory registered for RDMA devices obviously can't be swapped out. Technically, the driver could detect that, and reject any attempt to transfer data to those regions until everything is remapped to other RAM. But that's opening a whole new can of worms. I don't know how the memory hotswap mechanism works, so I can't guess what recovery mechanisms can be implemented in the driver. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Tue, Apr 12, 2005 at 06:04:47PM -0700, Libor Michalek wrote: > On Mon, Apr 11, 2005 at 05:13:47PM -0700, Andrew Morton wrote: > > Roland Dreier <[EMAIL PROTECTED]> wrote: > > > > > > Troy> Do we even need the mlock in userspace then? > > > > > > Yes, because the kernel may go through and unmap pages from userspace > > > while trying to swap. Since we have the page locked in the kernel, > > > the physical page won't go anywhere, but userspace might end up with a > > > different page mapped at the same virtual address. > > With the last few kernels I haven't had a chance to retest the problem > that pushed us in the direction of using mlock. I will go back and do > so with the latest kernel. Below I've given a quick description of the > issue. > > > That shouldn't happen. If get_user_pages() has elevated the refcount on a > > page then the following can happen: > > > > - The VM may decide to add the page to swapcache (if it's not mmapped > > from a file). > > > > - Once the page is backed by either swapcache of a (mmapped) file, the VM > > may decide the unmap the application's pte's. A later minor fault by the > > app will cause the same physical page to be remapped. > > The driver did use get_user_pages() to elevated the refcount on all the > pages it was going to use for IO, as well as call set_page_dirty() since > the pages were going to have data written to them from the device. > > The problem we were seeing is that the minor fault by the app resulted > in a new physical page getting mapped for the application. The page that > had the elevated refcount was still waiting for the data to be written > to by the driver at the time that the app accessed the page causing the > minor fault. Obviously since the app had a new mapping the data written > by the driver was lost. > > It looks like code was added to try_to_unmap_one() to address this, so > hopefully it's no longer an issue... I wrote a quick test module and program to confirm that the problem we saw in older kernels with get_user_pages() no longer exists. The module creates a character device with three different ioctl commands: - Pin the pages of a buffer using get_user_pages() - Check the pages by calling get_user_pages() a second time and comparing the new and original page list. - Relase the pages using put_page() The program opens the charcter device file descriptor, pins the pages and waits for a signal, before checking the pages, which is sent to the process after running some other program which exercises the VM. On older kernels the check fails, on my 2.6.11 kernel the check succeeds. So mlock is not needed on top of get_user_pages() as it was before. Thanks for the heads up. Module and program attached. -Libor /* * Copyright (c) 2005 Topspin Communications. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * * $Id: $ */ #include #include #include #include #include #include #include #include #include #include #include #include MODULE_AUTHOR("Libor Michalek"); MODULE_DESCRIPTION("Get pages test"); MODULE_LICENSE("GPL"); enum { TEST_MAJOR = 232, TEST_MINOR = 255 }; #define TEST_DEV MKDEV(TEST_MAJOR, TEST_MINOR) enum { TEST_CMD_REGISTER = 1, TEST_CMD_UNREGISTER = 2, TEST_CMD_CHECK = 3 }; struct ioctl_arg { __u64 addr; __u64 size; }; struct region_root { struct semaphore mutex; struct list_head regions; /* list of pending events. */ struct file *filp; int nr_region; }; struct test_region { unsigned long user; unsigned long addr; unsigned long size;
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Timur Tabi <[EMAIL PROTECTED]> wrote: > > Andrew Morton wrote: > > > RLIMIT_MEMLOCK sounds like the appropriate mechanism. We cannot rely upon > > userspace running mlock(), so perhaps it is appropriate to run sys_mlock() > > in-kernel because that gives us the appropriate RLIMIT_MEMLOCK checking. > > I don't see what's wrong with relying on userspace to call mlock(). First > all, all RDMA > apps call a third-party API, like DAPL or MPI, to register memory. The > memory needs to be > registered in order for the driver and adapter to know where it is. During > this > registration, the memory is also pinned. That's when we call mlock(). All the above refers to well-behaved applications. Now think about how the syscalls which you provide may be used by applications which are *designed* to cripple or to compromise the machine. > > > > However an hostile app can just go and run munlock() and then allocate > > some more pinned-by-get_user_pages() memory. > > Isn't mlock() on a per-process basis anyway? How can one process call > munlock() on > another process' memory? I'm referring to an application which uses your syscalls to obtain pinned memory and uses munlock() so that it may then use your syscalls to obtain evem more pinned memory. With the objective of taking the machine down. > > umm, how about we > > > > - force the special pages into a separate vma > > > > - run get_user_pages() against it all > > > > - use RLIMIT_MEMLOCK accounting to check whether the user is allowed to > > do this thing > > > > - undo the RMLIMIT_MEMLOCK accounting in ->release > > Isn't this kinda what mlock() does already? Create a new VMA and then > VM_LOCK it? kinda. But applications can undo the mlock which the kernel did. > > This will all interact with user-initiated mlock/munlock in messy ways. > > Maybe a new kernel-internal vma->vm_flag which works like VM_LOCKED but is > > unaffected by mlock/munlock activity is needed. > > > > A bit of generalisation in do_mlock() should suit? > > Yes, but do_mlock() needs to prevent pages from being moved during memory > hotswap. I haven't even thought about memory hotswap. Surely it'll fail if the pages are pinned by get_user_pages()? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
I don't think that we should jump to the conclusion that in the long term HPC users cannot benefit from support of mechanisms such as hotremoval of memory or other forms of page migration in physical memory. In an earlier exchange on the openib-general list Mike Krause sent the message quoted below on very much the same topic. On the other hand I am willing to accept that there is practical value to implementations which are not (yet) sophisticated to enough to support the migration functions. Steve Langdon Michael Krause wrote: At 05:35 PM 3/14/2005, Caitlin Bestler wrote: > -Original Message- > From: Troy Benjegerdes [ mailto:[EMAIL PROTECTED] > Sent: Monday, March 14, 2005 5:06 PM > To: Caitlin Bestler > Cc: openib-general@openib.org > Subject: Re: [openib-general] Getting rid of pinned memory requirement > > > > > The key is that the entire operation either has to be fast > > enough so that no connection or application session layer > > time-outs occur, or an end-to-end agreement to suspend the > > connetion is a requirement. The first option seems more > > plausible to me, the second essentially > > reuqires extending the CM protocol. That's a tall order even for > > InfiniBand, and it's even worse for iWARP where the CM > > functionality typically ends when the connection is established. > > I'll buy the good network design argument. I and others designed InfiniBand RNR (Receiver not ready) operations to allow one to adjust V-to-P mappings (not change the address that was advertised) in order to allow an OS to safely play some games with memory and not drop a connection. The time values associated with RNR allow a solution to tolerate up to infinite amount of time to perform such operations but the envisioned goal was to do this on the order of a handful or milliseconds in the worse case. For iWARP, there was no support for defining RNR functionality as indeed many people claimed one could just drop in-bound segments and allow the retransmission protocol to deal with the delay (even if this has performance implications due to back-off algorithms though some claim SACK would minimize this to a large extent). Again, the idea was to minimize the worse case to milliseconds of down time. BTW, all of this assumed that the OS would not perform these types of changes that often so the long-term impact on an application would be minimum. > > I suppose if the kernel wants to revoke a card's pinned > memory, we should be able to guarantee that it gets new > pinned memory within a bounded time. What sort of timing do > we need? Milliseconds? > Microseconds? > > In the case of iWarp, isn't this just TCP underneath? If so, > can't we just drop any packets in the pipe on the floor and > let them get retransmitted? (I suppose the same argument goes > for infiniband.. > what sort of a time window do we have for retransmission?) > > What are the limits on end-to-end flow control in IB and iWarp? > >From the RDMA Provider's perspective, the short answer is "quick enough so that I don't have to do anything heroic to keep the connection alive." It should not require anything heroic. What is does require is a local method to suspend the local QP(s) so that it cannot place or read memory in the effected area. That can take some time depending upon the implementation. There is then the time to over write the mappings which again depending upon the implementation and the number of mappings could be milliseconds in length. With TCP you also have to add "and healthy". If you've ever had a long download that got effectively stalled by a burst of noise and you just hit the 'reload' button on your browser then you know what I'm talking about. But in transport neutral terms I would think that one RTT is definitely safe -- that much data could have been dropped by one switch failure or one nasty spike in inbound noise. > > > > Yes, there are limits on how much memory you can mlock, or even > > allocate. Applications are required to reqister memory precisely > > because the required guarantess are not there by default. > Eliminating > > those guarantees *is* effectively rewriting every RDMA application > > without even letting them know. > > Some of this argument is a policy issue, which I would argue > shouldn't be hard-coded in the code or in the network hardware. > > At least in my view, the guarantees are only there to make > applications go fast. We are getting low latency and high > performance with infiniband by making memory registration go > really really slow. If, to make big HPC simulation > applications work, we wind up doing memcpy() to put the data > into a registered buffer because we can't register half of > physical memory, the application isn't going very fast. > What you are looking for is a distinction between registering memory to *enable* the RNIC to optimize local access and registering memory to enable its being advertised to the remote end. Early im
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Andrew Morton wrote: RLIMIT_MEMLOCK sounds like the appropriate mechanism. We cannot rely upon userspace running mlock(), so perhaps it is appropriate to run sys_mlock() in-kernel because that gives us the appropriate RLIMIT_MEMLOCK checking. I don't see what's wrong with relying on userspace to call mlock(). First all, all RDMA apps call a third-party API, like DAPL or MPI, to register memory. The memory needs to be registered in order for the driver and adapter to know where it is. During this registration, the memory is also pinned. That's when we call mlock(). However an hostile app can just go and run munlock() and then allocate some more pinned-by-get_user_pages() memory. Isn't mlock() on a per-process basis anyway? How can one process call munlock() on another process' memory? umm, how about we - force the special pages into a separate vma - run get_user_pages() against it all - use RLIMIT_MEMLOCK accounting to check whether the user is allowed to do this thing - undo the RMLIMIT_MEMLOCK accounting in ->release Isn't this kinda what mlock() does already? Create a new VMA and then VM_LOCK it? This will all interact with user-initiated mlock/munlock in messy ways. Maybe a new kernel-internal vma->vm_flag which works like VM_LOCKED but is unaffected by mlock/munlock activity is needed. A bit of generalisation in do_mlock() should suit? Yes, but do_mlock() needs to prevent pages from being moved during memory hotswap. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
IWAMOTO Toshihiro wrote: If such memory were allocated by a driver, the memory could be placed in non-hotremovable areas to avoid the above problems. How can the driver allocated 3GB of pinned memory on a system with 3.5GB of RAM? Can vmalloc() or get_free_pages() allocate that much memory? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
At Mon, 25 Apr 2005 16:58:03 -0700, Roland Dreier wrote: > Andrew> It would be better to obtain this memory via a mmap() of > Andrew> some special device node, so we can perform appropriate > Andrew> permission checking and clean everything up on unclean > Andrew> application exit. > > This seems to interact poorly with how applications want to use RDMA, > ie typically through a library interface such as MPI. People doing > HPC don't want to recode their apps to use a new allocator, they just > want to link to a new MPI library and have the app go fast. Such HPC users cannot use the memory hotremoval feature, and something needs to be implemented so that the NUMA migration can handle such memory properly, but I see your point. If such memory were allocated by a driver, the memory could be placed in non-hotremovable areas to avoid the above problems. -- IWAMOTO Toshihiro ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Roland Dreier <[EMAIL PROTECTED]> wrote: > > Andrew> How does the driver detect process exit? > > I already answered earlier but just to be clear: registration goes > through a character device, and all regions are cleaned up in the > ->release() of that device. yup. > I don't currently have any code accounting against RLIMIT_MEMLOCK or > testing CAP_FOO, but I have no problem adding whatever is thought > appropriate. Userspace also has control over the permissions and > owner/group of the /dev node. I guess device node permissions won't be appropriate here, if only because it sounds like everyone will go and set them to 0666. RLIMIT_MEMLOCK sounds like the appropriate mechanism. We cannot rely upon userspace running mlock(), so perhaps it is appropriate to run sys_mlock() in-kernel because that gives us the appropriate RLIMIT_MEMLOCK checking. However an hostile app can just go and run munlock() and then allocate some more pinned-by-get_user_pages() memory. umm, how about we - force the special pages into a separate vma - run get_user_pages() against it all - use RLIMIT_MEMLOCK accounting to check whether the user is allowed to do this thing - undo the RMLIMIT_MEMLOCK accounting in ->release This will all interact with user-initiated mlock/munlock in messy ways. Maybe a new kernel-internal vma->vm_flag which works like VM_LOCKED but is unaffected by mlock/munlock activity is needed. A bit of generalisation in do_mlock() should suit? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Andrew> How does the driver detect process exit? I already answered earlier but just to be clear: registration goes through a character device, and all regions are cleaned up in the ->release() of that device. I don't currently have any code accounting against RLIMIT_MEMLOCK or testing CAP_FOO, but I have no problem adding whatever is thought appropriate. Userspace also has control over the permissions and owner/group of the /dev node. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Roland Dreier <[EMAIL PROTECTED]> wrote: > > Andrew> ug. What stops the memory from leaking if the process > Andrew> exits? > > Andrew> I hope this is a privileged operation? > > I don't think it has to be privileged. In my implementation, the > driver keeps a per-process list of registered memory regions and > unpins/cleans up on process exit. How does the driver detect process exit? > Andrew> It would be better to obtain this memory via a mmap() of > Andrew> some special device node, so we can perform appropriate > Andrew> permission checking and clean everything up on unclean > Andrew> application exit. > > This seems to interact poorly with how applications want to use RDMA, > ie typically through a library interface such as MPI. People doing > HPC don't want to recode their apps to use a new allocator, they just > want to link to a new MPI library and have the app go fast. Fair enough. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Caitlin Bestler <[EMAIL PROTECTED]> wrote: > > > > > > > This is because there is no file descriptor or anything else associated > > > > with the pages which permits the kernel to clean stuff up on unclean > > > > application exit. Also there are the obvious issues with permitting > > > > pinning of unbounded amounts of memory. > > > > > > Correct, the driver must be able to determine that the process has died > > > and clean up after it, so the pinned region in most implementations is > > > associated with an open file descriptor. > > > > How is that association created? > > > There is not a file descrptor, but there is an rnic handle. Both DAPL > and IT-API require that process death will result in the handle and all > of its dependent objects being released. What's an "rnic handle", in Linux terms? > The rnic handle can always be declared to be a "file descriptor" if > that makes it follow normal OS conventions more precisiely. Does that mean that the code has not yet been implemented? Yes, a Linux fd is appropriate. But we don't have any sane way right now of saying "you need to run put_page() against all these pages in the ->release() handler". That'll need to be coded by yourselves. > There is also a need for some form of resource manager to approve > creation of Memory Regions. Obviously you cannot have multiple > applications claiming half of physical memory. The kernel already has considerable resource management capabilities. Please consider using/extending/generalising those before inventing anything new. RLIMIT_MEMLOCK would be a starting point. > But if you merely require the user to have root privileges in order > to create a Memory Region, and then take a first-come first-served > attitude, I don't think you end up with something that is truly a > general purpose capability. We don't want code in the kernel which will permit hostile unprivileged users to trivially cause the box to lock up. RLIMIT_MEMLOCK and, if necessary, CAP_IPC_LOCK sound appropriate here. > A general purpose RDMA capability requires the ability to indefinitely > pin large portions of user memory. It makes sense to integrate that > with OS policy control over resource utilization and to integrate it with > memory suspend/resume capabilities so that hotplug memory can > be supported. What you can't do is downgrade a Memory Region so > that it is no longer a memory region. Doing that means that you are > not truly supporting RDMA. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Timur> If you look at the Infiniband code that was recently Timur> submitted, I think you'll see it does exactly that: after Timur> calling mlock(), the driver calls get_user_pages(), and it Timur> stores the page mappings for future use. Andrew> Where? The code isn't merged yet. I sent a version to lkml for review -- in fact it was this very thread that we're in now. The code in question is in http://lkml.org/lkml/2005/4/4/266 This implements a "userspace verbs" character device that memory registration goes through. This means the kernel has a device node that will be closed when a process dies, and so the memory can be cleaned up. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Caitlin> Every RDMA related interface specification that I know of Caitlin> specifically excludes support of RDMA resources being Caitlin> inherited by child processes, with the warning that Caitlin> excellent implementations will give the child process an Caitlin> error for attempting to use the parent's RDMA resources. Caitlin> More streamlined implementations will simply be Caitlin> unpredictable. Caitlin> As for forking while the parent has a pending read: since Caitlin> the parent has not reaped the completion at the time of Caitlin> the fork the buffers in question are undefined. The Caitlin> child's buffers will be consistent, that is they are Caitlin> undefined. I think you've missed the point: unless a process sets VM_DONTCOPY on its RDMA memory regions, then incorrect memory mappings may be used if the app does something as simple as calling system("ls"). - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Andrew> Whoa, hang on. Andrew> The way we expect get_user_pages() to be used is that the Andrew> kernel will use get_user_pages() once per application I/O Andrew> request. Andrew> Are you saying that RDMA clients will semi-permanently own Andrew> pages which were pinned by get_user_pages()? That those Andrew> pages will be used for multiple separate I/O operations? Andrew> If so, then that's a significant design departure and it Andrew> would be good to hear why it is necessary. The idea is that applications manage the lifetime of pinned memory regions. They can do things like post multiple I/O operations without any page-walking overhead, or pass a buffer descriptor to a remote host who will send data at some indeterminate time in the future. In addition, InfiniBand has the notion of atomic operations, so a cluster application may be using some memory region to implement a global lock. This might not be the most kernel-friendly design but it is pretty deeply ingrained in the design of RDMA transports like InfiniBand and iWARP (RDMA over IP). I'm also not opposed to implementing some other mechanism to make this work, but the combiniation of get_user_pages() in the kernel and extending mprotect() to allow setting VM_DONTCOPY seems to work fine. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Andrew> ug. What stops the memory from leaking if the process Andrew> exits? Andrew> I hope this is a privileged operation? I don't think it has to be privileged. In my implementation, the driver keeps a per-process list of registered memory regions and unpins/cleans up on process exit. Andrew> It would be better to obtain this memory via a mmap() of Andrew> some special device node, so we can perform appropriate Andrew> permission checking and clean everything up on unclean Andrew> application exit. This seems to interact poorly with how applications want to use RDMA, ie typically through a library interface such as MPI. People doing HPC don't want to recode their apps to use a new allocator, they just want to link to a new MPI library and have the app go fast. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On 4/25/05, Andrew Morton <[EMAIL PROTECTED]> wrote: > > > > This is because there is no file descriptor or anything else associated > > > with the pages which permits the kernel to clean stuff up on unclean > > > application exit. Also there are the obvious issues with permitting > > > pinning of unbounded amounts of memory. > > > > Correct, the driver must be able to determine that the process has died > > and clean up after it, so the pinned region in most implementations is > > associated with an open file descriptor. > > How is that association created? There is not a file descrptor, but there is an rnic handle. Both DAPL and IT-API require that process death will result in the handle and all of its dependent objects being released. The rnic handle can always be declared to be a "file descriptor" if that makes it follow normal OS conventions more precisiely. There is also a need for some form of resource manager to approve creation of Memory Regions. Obviously you cannot have multiple applications claiming half of physical memory. But if you merely require the user to have root privileges in order to create a Memory Region, and then take a first-come first-served attitude, I don't think you end up with something that is truly a general purpose capability. A general purpose RDMA capability requires the ability to indefinitely pin large portions of user memory. It makes sense to integrate that with OS policy control over resource utilization and to integrate it with memory suspend/resume capabilities so that hotplug memory can be supported. What you can't do is downgrade a Memory Region so that it is no longer a memory region. Doing that means that you are not truly supporting RDMA. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Timur Tabi <[EMAIL PROTECTED]> wrote: > > FYI, our driver detects the process termination and cleans up everything > itself. How is this implemented? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Libor Michalek <[EMAIL PROTECTED]> wrote: > > On Mon, Apr 25, 2005 at 03:35:42PM -0700, Andrew Morton wrote: > > Timur Tabi <[EMAIL PROTECTED]> wrote: > > > > > > Andrew Morton wrote: > > > > > > > The way we expect get_user_pages() to be used is that the kernel will > > > > use > > > > get_user_pages() once per application I/O request. > > > > > > Are you saying that the mapping obtained by get_user_pages() is valid > > > only within the > > > context of the IOCtl call? That once the driver returns from the IOCtl, > > > the mapping > > > should no longer be used? > > > > Yes, we expect that all the pages which get_user_pages() pinned will become > > unpinned within the context of the syscall which pinned the pages. Or > > shortly after, in the case of async I/O. > > When a network protocol is making use of async I/O the amount of time > between posting the read request and getting the completion for that > request is unbounded since it depends on the other half of the connection > sending some data. In this case the buffer that was pinned during the > io_submit() may be pinned, and holding the pages, for a long time. Sure. > During > this time the process might fork, at this point any data received will be > placed into the wrong spot. Well the data is placed in _a_ spot. That's only the "wrong" spot because you've defined it to be wrong! IOW: what behaviour are you actually looking for here, and why, and does it matter? > > This is because there is no file descriptor or anything else associated > > with the pages which permits the kernel to clean stuff up on unclean > > application exit. Also there are the obvious issues with permitting > > pinning of unbounded amounts of memory. > > Correct, the driver must be able to determine that the process has died > and clean up after it, so the pinned region in most implementations is > associated with an open file descriptor. How is that association created? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Andrew Morton wrote: They are permanent until someone runs put_page() against all the pages. What I'm saying is that all current callers of get_user_pages() _do_ run put_page() within the same syscall or upon I/O termination. Oh, okay then. I guess I'll get back to work! Actually, with RDMA, "I/O termination" technically doesn't happen until the memory is deregistered. When the memory is registered, all that means is that it's should be pinned and the virtual-to-physical should be stored. No actual I/O occurs at that point. If you look at the Infiniband code that was recently submitted, I think you'll see it does exactly that: after calling mlock(), the driver calls get_user_pages(), and it stores the page mappings for future > Where? I was talking about the code that Roland mentioned in the first message of this thread - the user-space verbs support. He said the code calls mlock() and get_user_pages(). FYI, our driver detects the process termination and cleans up everything itself. -- Timur Tabi Staff Software Engineer [EMAIL PROTECTED] One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Mon, Apr 25, 2005 at 03:35:42PM -0700, Andrew Morton wrote: > Timur Tabi <[EMAIL PROTECTED]> wrote: > > > > Andrew Morton wrote: > > > > > The way we expect get_user_pages() to be used is that the kernel will use > > > get_user_pages() once per application I/O request. > > > > Are you saying that the mapping obtained by get_user_pages() is valid only > > within the > > context of the IOCtl call? That once the driver returns from the IOCtl, > > the mapping > > should no longer be used? > > Yes, we expect that all the pages which get_user_pages() pinned will become > unpinned within the context of the syscall which pinned the pages. Or > shortly after, in the case of async I/O. When a network protocol is making use of async I/O the amount of time between posting the read request and getting the completion for that request is unbounded since it depends on the other half of the connection sending some data. In this case the buffer that was pinned during the io_submit() may be pinned, and holding the pages, for a long time. During this time the process might fork, at this point any data received will be placed into the wrong spot. > This is because there is no file descriptor or anything else associated > with the pages which permits the kernel to clean stuff up on unclean > application exit. Also there are the obvious issues with permitting > pinning of unbounded amounts of memory. Correct, the driver must be able to determine that the process has died and clean up after it, so the pinned region in most implementations is associated with an open file descriptor. -Libor ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Timur Tabi <[EMAIL PROTECTED]> wrote: > > Andrew Morton wrote: > > > This is because there is no file descriptor or anything else associated > > with the pages which permits the kernel to clean stuff up on unclean > > application exit. Also there are the obvious issues with permitting > > pinning of unbounded amounts of memory. > > Then that might explain the "bug" that we're seeing with get_user_pages(). > We've been > assuming that get_user_pages() mappings are permanent. They are permanent until someone runs put_page() against all the pages. What I'm saying is that all current callers of get_user_pages() _do_ run put_page() within the same syscall or upon I/O termination. > Well, I was just about to re-implement get_user_pages() support in our driver > to > demonstrate the bug. I guess I'll hold off on that. > > If you look at the Infiniband code that was recently submitted, I think > you'll see it does > exactly that: after calling mlock(), the driver calls get_user_pages(), and > it stores the > page mappings for future use. Where? bix:/usr/src/linux-2.6.12-rc3> grep -rl get_user_pages . ./arch/i386/lib/usercopy.c ./arch/sparc64/kernel/ptrace.c ./drivers/video/pvr2fb.c ./drivers/media/video/video-buf.c ./drivers/scsi/sg.c ./drivers/scsi/st.c ./include/asm-ia64/pgtable.h ./include/linux/mm.h ./include/asm-um/archparam-i386.h ./include/asm-i386/fixmap.h ./fs/nfs/direct.c ./fs/aio.c ./fs/binfmt_elf.c ./fs/bio.c ./fs/direct-io.c ./kernel/futex.c ./kernel/ptrace.c ./mm/memory.c ./mm/nommu.c ./mm/rmap.c ./mm/mempolicy.c ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Andrew Morton wrote: This is because there is no file descriptor or anything else associated with the pages which permits the kernel to clean stuff up on unclean application exit. Also there are the obvious issues with permitting pinning of unbounded amounts of memory. Then that might explain the "bug" that we're seeing with get_user_pages(). We've been assuming that get_user_pages() mappings are permanent. Well, I was just about to re-implement get_user_pages() support in our driver to demonstrate the bug. I guess I'll hold off on that. If you look at the Infiniband code that was recently submitted, I think you'll see it does exactly that: after calling mlock(), the driver calls get_user_pages(), and it stores the page mappings for future use. -- Timur Tabi Staff Software Engineer [EMAIL PROTECTED] One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Timur Tabi <[EMAIL PROTECTED]> wrote: > > Andrew Morton wrote: > > > The way we expect get_user_pages() to be used is that the kernel will use > > get_user_pages() once per application I/O request. > > Are you saying that the mapping obtained by get_user_pages() is valid only > within the > context of the IOCtl call? That once the driver returns from the IOCtl, the > mapping > should no longer be used? Yes, we expect that all the pages which get_user_pages() pinned will become unpinned within the context of the syscall which pinned the pages. Or shortly after, in the case of async I/O. This is because there is no file descriptor or anything else associated with the pages which permits the kernel to clean stuff up on unclean application exit. Also there are the obvious issues with permitting pinning of unbounded amounts of memory. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Timur Tabi <[EMAIL PROTECTED]> wrote: > > Andrew Morton wrote: > > > The way we expect get_user_pages() to be used is that the kernel will use > > get_user_pages() once per application I/O request. > > > > Are you saying that RDMA clients will semi-permanently own pages which were > > pinned by get_user_pages()? That those pages will be used for multiple > > separate I/O operations? > > Yes, absolutely! > > The memory buffer is allocated by the process (usually just via malloc) and > registed/pinned by the driver. It then stays pinned for the life of the > process (typically). ug. What stops the memory from leaking if the process exits? I hope this is a privileged operation? > > If so, then that's a significant design departure and it would be good to > > hear why it is necessary. > > That's just how RMDA works. Once the memory is pinned, if the app wants to > send data to > another node, it does two things: > > 1) Puts the data into its buffer > 2) Sends a "work request" to the driver with (among other things) the offset > and length of > the data. > > This is a time-critical operation. It must occurs as fast as possible, which > means the > memory must have already been pinned. It would be better to obtain this memory via a mmap() of some special device node, so we can perform appropriate permission checking and clean everything up on unclean application exit. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Andrew Morton wrote: The way we expect get_user_pages() to be used is that the kernel will use get_user_pages() once per application I/O request. Are you saying that the mapping obtained by get_user_pages() is valid only within the context of the IOCtl call? That once the driver returns from the IOCtl, the mapping should no longer be used? -- Timur Tabi Staff Software Engineer [EMAIL PROTECTED] One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Andrew Morton wrote: The way we expect get_user_pages() to be used is that the kernel will use get_user_pages() once per application I/O request. Are you saying that RDMA clients will semi-permanently own pages which were pinned by get_user_pages()? That those pages will be used for multiple separate I/O operations? Yes, absolutely! The memory buffer is allocated by the process (usually just via malloc) and registed/pinned by the driver. It then stays pinned for the life of the process (typically). If so, then that's a significant design departure and it would be good to hear why it is necessary. That's just how RMDA works. Once the memory is pinned, if the app wants to send data to another node, it does two things: 1) Puts the data into its buffer 2) Sends a "work request" to the driver with (among other things) the offset and length of the data. This is a time-critical operation. It must occurs as fast as possible, which means the memory must have already been pinned. -- Timur Tabi Staff Software Engineer [EMAIL PROTECTED] One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Roland Dreier <[EMAIL PROTECTED]> wrote: > > Andrew> Do we care about that? A straightforward scenario under > Andrew> which this can happen is: > > Andrew> a) app starts some read I/O in an asynchronous manner > Andrew> b) app forks > Andrew> c) child writes to one of the pages which is still under read I/O > Andrew> d) the read I/O completes > Andrew> e) the child is left with the old data plus the child's > modification instead > Andrew>of the new data > > Andrew> which is a very silly application which is giving itself > Andrew> unpredictable memory contents anyway. > > Andrew> I assume there's a more sensible scenario? > > You're right, that is a silly scenario ;) In fact if we mark vmas > with VM_DONTCOPY, then the child just crashes with a seg fault. > > The type of thing I'm worried about is something like, for example: > > a) app registers memory region with RDMA hardware -- in other words, >loads the device's translation table for future I/O Whoa, hang on. The way we expect get_user_pages() to be used is that the kernel will use get_user_pages() once per application I/O request. Are you saying that RDMA clients will semi-permanently own pages which were pinned by get_user_pages()? That those pages will be used for multiple separate I/O operations? If so, then that's a significant design departure and it would be good to hear why it is necessary. > b) app forks > c) app writes to the registered memory region, and the kernel breaks >the COW for the (now read-only) page by mapping a new page > d) app starts an I/O that will do a DMA read from the region > e) device reads using the wrong, old mapping Sure. But such an app could be declared to be buggy... > This can be pretty insiduous because for example fork() + immediate > exec() or just using system() still leaves the parent with PTEs marked > read-only. If an application does overlapping memory registrations so > get_user_pages() is called a lot, then as far as I can see > can_share_swap_page() will always return 0 and the COW will happen > even if the child process has thrown out its original vmas. > > Or if the counts are in the correct range, then there's a small window > between fork() and exec() where the parent process can screw itself > up, so most of the time the app works, until it doesn't. > > - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On 4/25/05, Roland Dreier <[EMAIL PROTECTED]> wrote: > Andrew> Do we care about that? A straightforward scenario under > Andrew> which this can happen is: > > Andrew> a) app starts some read I/O in an asynchronous manner > Andrew> b) app forks > Andrew> c) child writes to one of the pages which is still under read I/O > Andrew> d) the read I/O completes > Andrew> e) the child is left with the old data plus the child's > modification instead > Andrew>of the new data > > Andrew> which is a very silly application which is giving itself > Andrew> unpredictable memory contents anyway. > > Andrew> I assume there's a more sensible scenario? > > You're right, that is a silly scenario ;) In fact if we mark vmas > with VM_DONTCOPY, then the child just crashes with a seg fault. > > The type of thing I'm worried about is something like, for example: > > a) app registers memory region with RDMA hardware -- in other words, >loads the device's translation table for future I/O > b) app forks > c) app writes to the registered memory region, and the kernel breaks >the COW for the (now read-only) page by mapping a new page > d) app starts an I/O that will do a DMA read from the region > e) device reads using the wrong, old mapping > > This can be pretty insiduous because for example fork() + immediate > exec() or just using system() still leaves the parent with PTEs marked > read-only. If an application does overlapping memory registrations so > get_user_pages() is called a lot, then as far as I can see > can_share_swap_page() will always return 0 and the COW will happen > even if the child process has thrown out its original vmas. > > Or if the counts are in the correct range, then there's a small window > between fork() and exec() where the parent process can screw itself > up, so most of the time the app works, until it doesn't. > Every RDMA related interface specification that I know of specifically excludes support of RDMA resources being inherited by child processes, with the warning that excellent implementations will give the child process an error for attempting to use the parent's RDMA resources. More streamlined implementations will simply be unpredictable. As for forking while the parent has a pending read: since the parent has not reaped the completion at the time of the fork the buffers in question are undefined. The child's buffers will be consistent, that is they are undefined. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Roland Dreier <[EMAIL PROTECTED]> wrote: > > Timur> With mlock(), we don't need to use get_user_pages() at all. > Timur> Arjan tells me the only time an mlocked page can move is > Timur> with hot (un)plug of memory, but that isn't supported on > Timur> the systems that we support. We actually prefer mlock() > Timur> over get_user_pages(), because if the process dies, the > Timur> locks automatically go away too. > > There actually is another way pages can move, with both > get_user_pages() and mlock(): copy-on-write after a fork(). If > userspace does a fork(), then all PTEs are marked read-only, and if > the original process touches the page after the fork(), a new page > will be allocated and mapped at the original virtual address. Do we care about that? A straightforward scenario under which this can happen is: a) app starts some read I/O in an asynchronous manner b) app forks c) child writes to one of the pages which is still under read I/O d) the read I/O completes e) the child is left with the old data plus the child's modification instead of the new data which is a very silly application which is giving itself unpredictable memory contents anyway. I assume there's a more sensible scenario? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Andrew> Do we care about that? A straightforward scenario under Andrew> which this can happen is: Andrew> a) app starts some read I/O in an asynchronous manner Andrew> b) app forks Andrew> c) child writes to one of the pages which is still under read I/O Andrew> d) the read I/O completes Andrew> e) the child is left with the old data plus the child's modification instead Andrew>of the new data Andrew> which is a very silly application which is giving itself Andrew> unpredictable memory contents anyway. Andrew> I assume there's a more sensible scenario? You're right, that is a silly scenario ;) In fact if we mark vmas with VM_DONTCOPY, then the child just crashes with a seg fault. The type of thing I'm worried about is something like, for example: a) app registers memory region with RDMA hardware -- in other words, loads the device's translation table for future I/O b) app forks c) app writes to the registered memory region, and the kernel breaks the COW for the (now read-only) page by mapping a new page d) app starts an I/O that will do a DMA read from the region e) device reads using the wrong, old mapping This can be pretty insiduous because for example fork() + immediate exec() or just using system() still leaves the parent with PTEs marked read-only. If an application does overlapping memory registrations so get_user_pages() is called a lot, then as far as I can see can_share_swap_page() will always return 0 and the COW will happen even if the child process has thrown out its original vmas. Or if the counts are in the correct range, then there's a small window between fork() and exec() where the parent process can screw itself up, so most of the time the app works, until it doesn't. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Sat, Apr 23, 2005 at 07:44:21PM -0700, Andrew Morton wrote: > Timur Tabi <[EMAIL PROTECTED]> wrote: > > As I said, the testcase only works with our hardware, and it's also > > very large. It's one small test that's part of a huge test suite. > > It takes a couple hours just to install the damn thing. > > > > We want to produce a simpler test case that demonstrates the problem in an > > easy-to-understand manner, but we don't have time to do that now. > > If your theory is correct then it should be able to demonstrate this > problem without any special hardware at all: pin some user memory, then > generate memory pressure then check the contents of those pinned pages. > > But if, for the DMA transfer, you're using the array of page*'s which were > originally obtained from get_user_pages() then it's rather hard to see how > the kernel could alter the page's contents. > > Then again, if mlock() fixes it then something's up. Very odd. Andrew, Libor Michalek posted a much more reasonable (to my limited understanding) bug description in <[EMAIL PROTECTED]>. (And I'd love to provide a URL, but damned if I can figure out how to find that message on gmane. Clue-bat applications gladly accepted.) Libor Michalek wrote: # The driver did use get_user_pages() to elevated the refcount on all the # pages it was going to use for IO, as well as call set_page_dirty() since # the pages were going to have data written to them from the device. # # The problem we were seeing is that the minor fault by the app resulted # in a new physical page getting mapped for the application. The page that # had the elevated refcount was still waiting for the data to be written # to by the driver at the time that the app accessed the page causing the # minor fault. Obviously since the app had a new mapping the data written # by the driver was lost. # # It looks like code was added to try_to_unmap_one() to address this, so # hopefully it's no longer an issue... Which makes me think that Timur's bug is just an insufficiently-understood version of Libor's. -andy ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
That leaves a problem when the same memory region is registered with different vendors. Verbs A marks the area, Verbs B sees that is already marked, Verbs A unmarks the area when it is done not knowing that B is relying on the memory staying pinned. I do not believe there is a solution to this problem when working at arms length from Linux other than documenting the problem and informing applications of workarounds required when using multiple vendors concurrently with the same memory (i.e, destroy the most recently created memory region first, or pin the memory yourself before creating the first memory region). The only other alternative is to make the pinning some sort of shared service that would apply across multiple vendors. That is doable, but might not be worthwhile given that a single process using multiple vendor devices concurrently is decidely the exception. But those users deserve at least a warning. On 4/25/05, Roland Dreier <[EMAIL PROTECTED]> wrote: > Caitlin> Who is responsible for counting within a process, and > Caitlin> then between processes (in case shared memory is being > Caitlin> registered)? The application? Middleware? Driver? > > The verbs code doing the registration should do it as part of the > registration. Shared memory does not cause any additional issues > because it is mapped into the virtual memory map of each process and > must be marked VM_DONTCOPY in each process separately. > > - R. > > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Caitlin> Who is responsible for counting within a process, and Caitlin> then between processes (in case shared memory is being Caitlin> registered)? The application? Middleware? Driver? The verbs code doing the registration should do it as part of the registration. Shared memory does not cause any additional issues because it is mapped into the virtual memory map of each process and must be marked VM_DONTCOPY in each process separately. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On 4/25/05, Roland Dreier <[EMAIL PROTECTED]> wrote: > Timur> With mlock(), we don't need to use get_user_pages() at all. > Timur> Arjan tells me the only time an mlocked page can move is > Timur> with hot (un)plug of memory, but that isn't supported on > Timur> the systems that we support. We actually prefer mlock() > Timur> over get_user_pages(), because if the process dies, the > Timur> locks automatically go away too. > > There actually is another way pages can move, with both > get_user_pages() and mlock(): copy-on-write after a fork(). If > userspace does a fork(), then all PTEs are marked read-only, and if > the original process touches the page after the fork(), a new page > will be allocated and mapped at the original virtual address. > > This is actually a pretty big pain, because the only good solution > seems to be for the kernel to mark these registered regions as > VM_DONTCOPY. Right now this means that driver code ends up monkeying > with vm_flags for user vmas. > > Does it seem reasonable to add a new system call to let userspace mark > memory it doesn't want copied into forked processes? Something like > > long sys_mark_nocopy(unsigned long addr, size_t len, int mark) > > which would set VM_DONTCOPY if mark != 0, and clear it if mark == 0. > A better name would be gratefully accepted... > > Then to register memory for RDMA, userspace would call > sys_mark_nocopy() (with appropriate accounting to handle possibly > overlapping regions) and the kernel would call get_user_pages(). The > get_user_pages() is of course required because the kernel can't trust > userspace to keep the pages locked. mlock() would no longer be > necessary. We can trust userspace to call sys_mark_nocopy() as > needed, because a process can only hurt itself and its children by > misusing the sys_mark_nocopy() call. > > If this seems reasonable then I can code a patch. > Who is responsible for counting within a process, and then between processes (in case shared memory is being registered)? The application? Middleware? Driver? My concern here is that the application layer may not be fully aware when middleware is registering memory, and middleware may not be fully aware when the memory it receives from the application is shared with another process. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Roland> Does it seem reasonable to add a new system call to let Roland> userspace mark memory it doesn't want copied into forked Roland> processes? Something like Roland> long sys_mark_nocopy(unsigned long addr, size_t len, int Roland> mark) Roland> which would set VM_DONTCOPY if mark != 0, and clear it if Roland> mark == 0. A better name would be gratefully accepted... Christoph> add a new MAP_DONTCOPY flag and accept it in mmap and Christoph> mprotect? That is much better, thanks. But I think it would need to be PROT_DONTCOPY to work with mprotect(), right? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Sun, 2005-04-24 at 23:12 -0500, Timur Tabi wrote: > Greg KH wrote: > > I know of at least 1 x86-32 box from a three-letter-named company with > > this feature that has been shipping for a few _years_ now. That box is > > pretty much everywhere now, and I know that other versions of it are > > also quite popular (despite the high cost...) > > Hmm... Well, I think we were already planning on telling our customers that > we don't > support hot-swap RAM. Is there a CONFIG option for that feature? The driver to do the ACPI portion of both add and remove is in the kernel today, so it's certainly a feature that's coming relatively soon. There is a large variety of x86_64, ppc64, ia64 and ia64 hardware that will be doing memory hotplug. I believe that every POWER5 system is capable of supporting it, at least virtually. I don't think your concerns end with memory hotplug. The same approaches to moving memory around will be used for NUMA memory balancing and for memory defragmentation. Can you say that your cards will never be used on a system which has memory which becomes fragmented? -- Dave ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Mon, Apr 25, 2005 at 06:15:10AM -0700, Roland Dreier wrote: > Does it seem reasonable to add a new system call to let userspace mark > memory it doesn't want copied into forked processes? Something like > > long sys_mark_nocopy(unsigned long addr, size_t len, int mark) > > which would set VM_DONTCOPY if mark != 0, and clear it if mark == 0. > A better name would be gratefully accepted... add a new MAP_DONTCOPY flag and accept it in mmap and mprotect? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Timur> With mlock(), we don't need to use get_user_pages() at all. Timur> Arjan tells me the only time an mlocked page can move is Timur> with hot (un)plug of memory, but that isn't supported on Timur> the systems that we support. We actually prefer mlock() Timur> over get_user_pages(), because if the process dies, the Timur> locks automatically go away too. There actually is another way pages can move, with both get_user_pages() and mlock(): copy-on-write after a fork(). If userspace does a fork(), then all PTEs are marked read-only, and if the original process touches the page after the fork(), a new page will be allocated and mapped at the original virtual address. This is actually a pretty big pain, because the only good solution seems to be for the kernel to mark these registered regions as VM_DONTCOPY. Right now this means that driver code ends up monkeying with vm_flags for user vmas. Does it seem reasonable to add a new system call to let userspace mark memory it doesn't want copied into forked processes? Something like long sys_mark_nocopy(unsigned long addr, size_t len, int mark) which would set VM_DONTCOPY if mark != 0, and clear it if mark == 0. A better name would be gratefully accepted... Then to register memory for RDMA, userspace would call sys_mark_nocopy() (with appropriate accounting to handle possibly overlapping regions) and the kernel would call get_user_pages(). The get_user_pages() is of course required because the kernel can't trust userspace to keep the pages locked. mlock() would no longer be necessary. We can trust userspace to call sys_mark_nocopy() as needed, because a process can only hurt itself and its children by misusing the sys_mark_nocopy() call. If this seems reasonable then I can code a patch. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Greg KH wrote: I know of at least 1 x86-32 box from a three-letter-named company with this feature that has been shipping for a few _years_ now. That box is pretty much everywhere now, and I know that other versions of it are also quite popular (despite the high cost...) Hmm... Well, I think we were already planning on telling our customers that we don't support hot-swap RAM. Is there a CONFIG option for that feature? Your hardware is just a pci card, right? Why wouldn't it work on ppc64 and ia64 then? It's PCI-X, actually, and I don't think we've ever actually plugged it into a PPC box. Isn't Open Firmware support required for all PPC boxes, anyway? Our PCI card is not OF compatible, AFAIK. As for IA64, well, we could support it, but it's not a high enough priority. We do have some CPU-specific code in our driver that we would need to port to IA-64. Wait, what _is_ "your stuff"? The open-ib code? No, if anything, it's the competition to IB. It's called iWARP (RDMA over TCP/IP), and it's similar to IB except it uses gigabit ethernet instead of whatever hardware IB uses. Because we also support RMDA, we have the same problems as OpenIB, however, we would prefer that the kernel support OpenRDMA instead, since it's more generic. > Or some other, private fork? Any pointers to this stuff? http://ammasso.com/support.html The current version of the code calls sys_mlock() directly from the driver. We haven't released yet the version that calls mlock(). ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Sun, Apr 24, 2005 at 04:52:31PM -0500, Timur Tabi wrote: > Greg KH wrote: > > >You don't "support" i386 or ia64 or x86-64 or ppc64 systems? What > >hardware do you support? > > I've never seen or heard of any x86-32 or x86-64 system that supports > hot-swap RAM. I know of at least 1 x86-32 box from a three-letter-named company with this feature that has been shipping for a few _years_ now. That box is pretty much everywhere now, and I know that other versions of it are also quite popular (despite the high cost...) > Our hardware does not support PPC, and our software doesn't support > ia-64. Your hardware is just a pci card, right? Why wouldn't it work on ppc64 and ia64 then? > > And what about the fact that you are aiming to > >get this code into mainline, right? If not, why are you asking here? > >:) > > Well, our primary concern is getting our stuff to work. Since > get_user_pages() doesn't work, but mlock() does, that's what we use. I > don't know how to fix get_user_pages(), and I don't have the time right now > to figure it out. I know that technically mlock() is not the right way to > do it, and so we're not going to be submitting our code for the mainline > until get_user_pages() works and our code uses it instead of mlock(). Wait, what _is_ "your stuff"? The open-ib code? Or some other, private fork? Any pointers to this stuff? thanks, greg k-h ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Greg KH wrote: You don't "support" i386 or ia64 or x86-64 or ppc64 systems? What hardware do you support? I've never seen or heard of any x86-32 or x86-64 system that supports hot-swap RAM. Our hardware does not support PPC, and our software doesn't support ia-64. > And what about the fact that you are aiming to get this code into mainline, right? If not, why are you asking here? :) Well, our primary concern is getting our stuff to work. Since get_user_pages() doesn't work, but mlock() does, that's what we use. I don't know how to fix get_user_pages(), and I don't have the time right now to figure it out. I know that technically mlock() is not the right way to do it, and so we're not going to be submitting our code for the mainline until get_user_pages() works and our code uses it instead of mlock(). ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Sun, Apr 24, 2005 at 09:23:48AM -0500, Timur Tabi wrote: > Andrew Morton wrote: > > >If your theory is correct then it should be able to demonstrate this > >problem without any special hardware at all: pin some user memory, then > >generate memory pressure then check the contents of those pinned pages. > > I tried that, but I couldn't get it to fail. But that was a while ago, and > I've learned a few things since then, so I'll try again. > > >But if, for the DMA transfer, you're using the array of page*'s which were > >originally obtained from get_user_pages() then it's rather hard to see how > >the kernel could alter the page's contents. > > > >Then again, if mlock() fixes it then something's up. Very odd. > > With mlock(), we don't need to use get_user_pages() at all. Arjan tells me > the only time an mlocked page can move is with hot (un)plug of memory, but > that isn't supported on the systems that we support. You don't "support" i386 or ia64 or x86-64 or ppc64 systems? What hardware do you support? And what about the fact that you are aiming to get this code into mainline, right? If not, why are you asking here? :) thanks, greg k-h ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Andrew Morton wrote: If your theory is correct then it should be able to demonstrate this problem without any special hardware at all: pin some user memory, then generate memory pressure then check the contents of those pinned pages. I tried that, but I couldn't get it to fail. But that was a while ago, and I've learned a few things since then, so I'll try again. But if, for the DMA transfer, you're using the array of page*'s which were originally obtained from get_user_pages() then it's rather hard to see how the kernel could alter the page's contents. Then again, if mlock() fixes it then something's up. Very odd. With mlock(), we don't need to use get_user_pages() at all. Arjan tells me the only time an mlocked page can move is with hot (un)plug of memory, but that isn't supported on the systems that we support. We actually prefer mlock() over get_user_pages(), because if the process dies, the locks automatically go away too. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Timur Tabi <[EMAIL PROTECTED]> wrote: > > Christoph Hellwig wrote: > > On Mon, Apr 18, 2005 at 11:22:29AM -0500, Timur Tabi wrote: > > > >>That's not what we're seeing. We have hardware that does DMA over the > >>network (much like the Infiniband stuff), and we have a testcase that fails > >>if get_user_pages() is used, but not if mlock() is used. > > > > > > If you don't share your testcase it's unlikely to be fixed. > > As I said, the testcase only works with our hardware, and it's also very > large. It's one > small test that's part of a huge test suite. It takes a couple hours just to > install the > damn thing. > > We want to produce a simpler test case that demonstrates the problem in an > easy-to-understand manner, but we don't have time to do that now. If your theory is correct then it should be able to demonstrate this problem without any special hardware at all: pin some user memory, then generate memory pressure then check the contents of those pinned pages. But if, for the DMA transfer, you're using the array of page*'s which were originally obtained from get_user_pages() then it's rather hard to see how the kernel could alter the page's contents. Then again, if mlock() fixes it then something's up. Very odd. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Arjan van de Ven wrote: On Mon, 2005-04-18 at 11:09 -0500, Timur Tabi wrote: Roland Dreier wrote: Troy> How is memory pinning handled? (I haven't had time to read Troy> all the code, so please excuse my ignorance of something Troy> obvious). The userspace library calls mlock() and then the kernel does get_user_pages(). Why do you call mlock() and get_user_pages()? In our code, we only call mlock(), and the memory is pinned. this is a myth; linux is free to move the page about in physical memory even if it's mlock()ed!! Can you tell me when Linux actually does this? I know in theory it can happen, but I've never seen it. Does the code to implement moving of data from one physical page to another even exist in any version of Linux? Also, what would be the point? What reason would there be to move some data from one physical page to another, while keeping the same virtual address? -- Timur Tabi Staff Software Engineer [EMAIL PROTECTED] One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Fri, 2005-04-22 at 12:55 -0500, Timur Tabi wrote: > Arjan van de Ven wrote: > > On Mon, 2005-04-18 at 11:09 -0500, Timur Tabi wrote: > > > >>Roland Dreier wrote: > >> > >>>Troy> How is memory pinning handled? (I haven't had time to read > >>>Troy> all the code, so please excuse my ignorance of something > >>>Troy> obvious). > >>> > >>>The userspace library calls mlock() and then the kernel does > >>>get_user_pages(). > >> > >>Why do you call mlock() and get_user_pages()? In our code, we only call > >>mlock(), and the > >>memory is pinned. > > > > > > this is a myth; linux is free to move the page about in physical memory > > even if it's mlock()ed!! > > Can you tell me when Linux actually does this? I know in theory it can > happen, but I've > never seen it. Does the code to implement moving of data from one physical > page to > another even exist in any version of Linux? hot(un)plug memory. > > Also, what would be the point? What reason would there be to move some data > from one > physical page to another, while keeping the same virtual address? so that you can hot unplug the dimm in question. I guess that's a bit of a high end though though... so maybe you don't care about it. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Andy Isaacson <[EMAIL PROTECTED]> wrote: > On Wed, Apr 20, 2005 at 10:07:45PM -0500, Timur Tabi wrote: >> I don't know if VM_REGISTERED is a good idea or not, but it should be >> absolutely impossible for the kernel to reclaim "registered" (aka pinned) >> memory, no matter what. For RDMA services (such as Infiniband, iWARP, etc), >> it's normal for non-root processes to pin hundreds of megabytes of memory, >> and that memory better be locked to those physical pages until the >> application deregisters them. > > If you take the hardline position that "the app is the only thing that > matters", your code is unlikely to get merged. Linux is a > general-purpose OS. All userspace hardware drivers with DMA will require pinned pages (and some of them will require continuous memory). Since this memory may be scheduled to be accessed by DMA, reclaiming those pages may (aka. will) result in "random" memory corruption unless done by the driver itself. You can't even set a time limit, the driver may have allocated all DMA memory to queued transfers, and some media needs to get plugged in by the lazy robot. As soon as the robot arrives - boom. (For the same reason, this memory MUST NOT be freed if the application terminates abnormally, e.g. killed by OOM). In other words, you need to make this memory as unaccessible as the framebuffer on a graphic card. If that causes a lockup, you better had prevented that while allocating. > In a Linux context, I doubt that fullblown SA is necessary or > appropriate. Rather, I'd suggest two new signals, SIGMEMLOW and > SIGMEMCRIT. The userland comms library registers handlers for both. > When the kernel decides that it needs to reclaim some memory from the > app, it sends SIGMEMLOW. The comms library then has the responsibility > to un-reserve some memory in an orderly fashion. If a reasonable [1] > time has expired since SIGMEMLOW and the kernel is still hungry, the > kernel sends SIGMEMCRIT. At this point, the comms lib *must* unregister > some memory [2] even if it has to drop state to do so; if it returns > from the signal handler without having unregistered the memory, the > kernel will SIGKILL. Choosing Data loss vs. finitely stalled system may sometimes be a bad decision. If I designes an application that might get a "gimme memory or die", I'd reserve an extra bunch of memory with the only purpose of being released in this situation. If the kernel had done that instead, this part of memory could have been used e.g. as a read-only disk cache in the meantime (off cause provided somebody cared to implement that). > [2] Is there a way for the kernel to pass down to userspace how many > pages it wants, maybe in the sigcontext? Then you'd need only one signal. I think this interface is usefull, it would e.g. allow a picture viewer to cache as many decoded and scaled pictures as the RAM permits, freeing them if the RAM gets full and the swap would have to be used. -- "When the pin is pulled, Mr. Grenade is not our friend. -U.S. Marine Corps ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Hi! > >Timur> Why do you call mlock() and get_user_pages()? In our > >code, > >Timur> we only call mlock(), and the memory is pinned. We have a > >Timur> test case that fails if only get_user_pages() is called, > >Timur> but it passes if only mlock() is called. > > > >What if a buggy/malicious userspace program doesn't call mlock()? > > Our library calls mlock() when the apps requests memory to be > "registered". We then call munlock() when the app requests the > memory to be unregistered. All apps talk to our library for all > services. No apps talk to the driver directly. That does not cover "malicious" part. Pavel -- 64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Thu, Apr 21, 2005 at 03:07:42PM -0500, Timur Tabi wrote: > >*You* need to come up with a solution that looks good to *the community* > >if you want it merged. > > True, but I'm not going to waste my time adding this support if the > consensus I get from the kernel developers that they don't want Linux to > behave this way. I think we have been giving you that consensus from the very beginning :) The very fact that you tried to trot out the "enterprise" card should have raised a huge flag... thanks, greg k-h ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
On Thu, 2005-04-21 at 13:25 -0700, Chris Wright wrote: > * Timur Tabi ([EMAIL PROTECTED]) wrote: > > It works with every kernel I've tried. I'm sure there are plenty of kernel > > configuration options that will break our driver. But as long as all the > > distros our customers use work, as well as reasonably-configured custom > > kernels, we're happy. > > > > Hey, if you're happy (and, as you said, you don't intend to merge that > bit), I'm happy ;-) yeah... drivers giving unprivileged processes more privs belong on bugtraq though, not in the core kernel :) ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
* Timur Tabi ([EMAIL PROTECTED]) wrote: > It works with every kernel I've tried. I'm sure there are plenty of kernel > configuration options that will break our driver. But as long as all the > distros our customers use work, as well as reasonably-configured custom > kernels, we're happy. > Hey, if you're happy (and, as you said, you don't intend to merge that bit), I'm happy ;-) thanks, -chris -- Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Chris Wright wrote: FYI, that will not work on all 2.6 kernels. Specifically anything that's not using capabilities. It works with every kernel I've tried. I'm sure there are plenty of kernel configuration options that will break our driver. But as long as all the distros our customers use work, as well as reasonably-configured custom kernels, we're happy. -- Timur Tabi Staff Software Engineer [EMAIL PROTECTED] One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
* Timur Tabi ([EMAIL PROTECTED]) wrote: > Andy Isaacson wrote: > >Do you guys simply raise RLIMIT_MEMLOCK to allow apps to lock their > >pages? Or are you doing something more nasty? > > A little more nasty. I raise RLIMIT_MEMLOCK in the driver to "unlimited" > and also set cap_raise(IPC_LOCK). I do this because I need to support all > 2.4 and 2.6 kernel versions with the same driver, but only 2.6.10 and later > have any support for non-root mlock(). FYI, that will not work on all 2.6 kernels. Specifically anything that's not using capabilities. thanks, -chris -- Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general