Re: swapoff?
Thus spake Matthew Dillon [EMAIL PROTECTED]: :The concern was that there could be a race where the process is :swapped out again after I have swapped it back in but before I can :dirty its pages. (Perhaps I need to hold the process lock a bit :longer.) Under heavy swapping load, swapoff() is failing to find :a single page about one time out of ten, and I thought that might :be the cause. Have you definitively tracked down the missing page? It ought to be fairly easy to do from a kernel core with a gdb macro. No, I've tested it extensively, and I haven't been able to reproduce the problem since I updated my sources. (It was hard to reproduce beforehand.) I did two more runs with one swap device and two runs with two swap devices, and it worked even when the system was thrashing. The latest patches are at http://www.CSUA.Berkeley.EDU/~das/swapoff.patch4 Performance is now much better when there are multiple swap devices. Instead of effectively having to wait for each hash chain to become quiescent, swapoff now skips busy objects, then does a complete rescan if it missed anything. Only a few rescans are required, even with multiple active swap devices. A clustering optimization might still be worthwhile, but that can be done another day. (Sorry for the delay in getting back to you. I've been too busy and sick for the last week to work on this.) To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
This is looking really good. I'm going to start running it on my -current boxes. I think it could be committed after the 5.0 release rolls, as well as MFCd to -stable (which I would be happy to do the work for). -Matt Matthew Dillon [EMAIL PROTECTED] :No, I've tested it extensively, and I haven't been able to :reproduce the problem since I updated my sources. (It was hard to :reproduce beforehand.) I did two more runs with one swap device :and two runs with two swap devices, and it worked even when the :system was thrashing. : :The latest patches are at : : http://www.CSUA.Berkeley.EDU/~das/swapoff.patch4 : :Performance is now much better when there are multiple swap :devices. Instead of effectively having to wait for each hash :chain to become quiescent, swapoff now skips busy objects, then :does a complete rescan if it missed anything. Only a few rescans :are required, even with multiple active swap devices. :A clustering optimization might still be worthwhile, but that can :be done another day. : :(Sorry for the delay in getting back to you. I've been too busy :and sick for the last week to work on this.) To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
Thus spake Matthew Dillon [EMAIL PROTECTED]: :The concern was that there could be a race where the process is :swapped out again after I have swapped it back in but before I can :dirty its pages. (Perhaps I need to hold the process lock a bit :longer.) Under heavy swapping load, swapoff() is failing to find :a single page about one time out of ten, and I thought that might :be the cause. Have you definitively tracked down the missing page? It ought to be fairly easy to do from a kernel core with a gdb macro. I haven't figured out gdb macros yet, and I'm kinda busy throughout the week, but I'll try to track down the page this weekend. This sounds reasonable, it's certainly worth a shot. Another thing you may be able to do is to try to cluster the pageins on an object-by-object basis since our swap-scanning is probably resulting in having to wait for the PIP count of the same object many times (for large objects with lots of swap blocks). Hmm. Since we have to track down every swap block and we can (do?) get a count of pages that need to be swapped in, we could remove the PIP wait code and simply loop on the swap hash table over and over again until we've found all the swap blocks that we know have been allocated to that device. Or something like that. The latter idea was more or less what I had in mind. The only catch is that you have to wait at least once for each pass or you'll never give up the processor. It solves the basic problem with the current approach, which is that you get livelock if there's too much swapping activity to other swap devices. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
[ Latest patches at http://csua.berkeley.edu/~das/swapoff.patch3 ] Thus spake Matthew Dillon [EMAIL PROTECTED]: :I'm worried that vm_proc_swapin_all() has a similar race with the :swapout daemon. Presently I assume that my references to the :UPAGES object and the associated pages remain valid after the :faultin(), and that I can use swap_pager_freeswapspace() to free :the correct metadata, instead of calling swap_pager_unswapped() on :each page. Should just hold the process lock until the metadata :are freed? Hmm. Well, the proc lock is not held during vm_proc_swapin() (but the PS_SWAPPINGIN flag is set). The proc lock is held during vm_proc_swapout(). In your vm_proc_swapin_all() you seem to be doing the right thing in regards to the mutexes and retry, and you have already marked the device is SW_CLOSING so if something does get in there and try to swap the process back in it shouldn't allocate swap you are trying to free. I think you may be ok. The concern was that there could be a race where the process is swapped out again after I have swapped it back in but before I can dirty its pages. (Perhaps I need to hold the process lock a bit longer.) Under heavy swapping load, swapoff() is failing to find a single page about one time out of ten, and I thought that might be the cause. I have tweaked swap_pager.c as you suggested earlier. It runs about an order of magnitude slower under load now, since it's doing a vm_object_pip_wait() on every swap-backed object in the system that's currently paging, even for objects that are paging to a different swap device. Unless you have a better idea, I think one way to improve performance might be to skip the busy objects, and after the whole hash has been scanned, rescan starting at the first index that was skipped. Of course, it would have to wait for at least one object on each iteration so it doesn't get into a tight loop. Another important optimization is to page in the entire block at once, rather than doing it a page at a time. I tried to do this with the following algorithm: - grab SWAP_META_PAGES pages - note which ones are already in core using a bitmap - call getpages() to retrieve the entire range - re-lookup all of the pages at the appropriate offset within the object in case they've changed or gone away - dirty them, move them to the appropriate queue (based on the values in the bitmap computed earlier), and remove their backing store This didn't work, and it produced all sorts of interesting panics for reasons I haven't yet figured out. My latest patch has some remnants of of some my attempts in swp_pager_force_pagein(), but I'll probably leave that optimization for another day unless you can see an obvious flaw in my approach. BTW, thanks for all of your help! To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
:The concern was that there could be a race where the process is :swapped out again after I have swapped it back in but before I can :dirty its pages. (Perhaps I need to hold the process lock a bit :longer.) Under heavy swapping load, swapoff() is failing to find :a single page about one time out of ten, and I thought that might :be the cause. Have you definitively tracked down the missing page? It ought to be fairly easy to do from a kernel core with a gdb macro. :I have tweaked swap_pager.c as you suggested earlier. It runs :about an order of magnitude slower under load now, since it's :doing a vm_object_pip_wait() on every swap-backed object in the :system that's currently paging, even for objects that are paging :to a different swap device. Unless you have a better idea, I :think one way to improve performance might be to skip the busy :objects, and after the whole hash has been scanned, rescan :starting at the first index that was skipped. Of course, it would :have to wait for at least one object on each iteration so it :doesn't get into a tight loop. This sounds reasonable, it's certainly worth a shot. Another thing you may be able to do is to try to cluster the pageins on an object-by-object basis since our swap-scanning is probably resulting in having to wait for the PIP count of the same object many times (for large objects with lots of swap blocks). Hmm. Since we have to track down every swap block and we can (do?) get a count of pages that need to be swapped in, we could remove the PIP wait code and simply loop on the swap hash table over and over again until we've found all the swap blocks that we know have been allocated to that device. Or something like that. :Another important optimization is to page in the entire block at :once, rather than doing it a page at a time. I tried to do this :with the following algorithm: : : - grab SWAP_META_PAGES pages : - note which ones are already in core using a bitmap : - call getpages() to retrieve the entire range :... : :This didn't work, and it produced all sorts of interesting panics :for reasons I haven't yet figured out. My latest patch has some :remnants of of some my attempts in swp_pager_force_pagein(), but :I'll probably leave that optimization for another day unless you :can see an obvious flaw in my approach. I can probably deal with that post-commit. Lets ignore it for now. The main goal at the moment should be robustness. :BTW, thanks for all of your help! -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
: :Thanks, your solution looks pretty good. I guess as part of the :try_to_page_it_all_in, I'll want to call swap_pager_unswapped() on :each page. Now I really wish I had noticed swap_pager_unswapped() :earlier; it would have made my job much easier! As long as you properly check and dirty the page you can get rid of the backing swap. :I'm worried that vm_proc_swapin_all() has a similar race with the :swapout daemon. Presently I assume that my references to the :UPAGES object and the associated pages remain valid after the :faultin(), and that I can use swap_pager_freeswapspace() to free :the correct metadata, instead of calling swap_pager_unswapped() on :each page. Should just hold the process lock until the metadata :are freed? Hmm. Well, the proc lock is not held during vm_proc_swapin() (but the PS_SWAPPINGIN flag is set). The proc lock is held during vm_proc_swapout(). In your vm_proc_swapin_all() you seem to be doing the right thing in regards to the mutexes and retry, and you have already marked the device is SW_CLOSING so if something does get in there and try to swap the process back in it shouldn't allocate swap you are trying to free. I think you may be ok. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
Thus spake Matthew Dillon [EMAIL PROTECTED]: This is a sticky situation because both the VM object and the swblocks may be manipulated by other processes when you block. I think what you need to try to do is this (it's a mess, if you can think of a better solution definitely go another route!) while ((swap = *pswap) != NULL) { if (anything_is_swapped_to_the_device) { try_to_page_it_all_in (note that the swblock structure is invalid the moment you block, so swp_pager_force_pagein() should be given the whole range). /* fall through to retry */ } else if (the_related_object_pip_count_is_not_zero) { vm_object_pip_sleep(...) /* fall through to retry */ } else if (swap-swb_count = 0) { free the swap block *pswap = swap-swb_hnext; } } Thanks, your solution looks pretty good. I guess as part of the try_to_page_it_all_in, I'll want to call swap_pager_unswapped() on each page. Now I really wish I had noticed swap_pager_unswapped() earlier; it would have made my job much easier! I'm worried that vm_proc_swapin_all() has a similar race with the swapout daemon. Presently I assume that my references to the UPAGES object and the associated pages remain valid after the faultin(), and that I can use swap_pager_freeswapspace() to free the correct metadata, instead of calling swap_pager_unswapped() on each page. Should just hold the process lock until the metadata are freed? To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
Rather than spamming the list with another 37K patch, I posted a revised version at http://csua.berkeley.edu/~das/swapoff.patch2 Thus spake Matthew Dillon [EMAIL PROTECTED]: The swap_pager_isswapped() function may not be doing a sufficient test: [...] It is quite possible for a VM page to be present but invalid, meaning that the swap is still valid. You could incorrectly return that the object is not swapped when in fact it is. BUT, since you only appear to be making this call on the process's UPAGES object, there may not be a problem. Perhapss the best thing to do is to not do the vm_page_lookup() call and instead just unconditionally faultin() the uarea if it looks like there might be a problem. I revised the patches to do as you suggested. It turns out that a couple of extra lines are needed, because when the scan over all processes restarts, pagein_all() will no longer automagically skip over processes it has already swapped in. It has to immediately free the swap metadata for the UPAGES object and dirty the associated pages (as opposed to letting swap_pager_swapoff do it); otherwise it will loop forever trying to swap in the same process. You may need a master lock to ensure that only swapon() or swapoff() is 'in progress' at any given moment. Added. (This was a deficiency in the original swapon() as well.) The vm_page_grab() call below may block, I think: [...] I think you may want to do the pip_add before calling vm_page_grab(). Yep, fixed. I also tweaked the calculation that determines whether there is enough virtual memory to remove the device, but it doesn't seem to detect when there is insufficient space. (I actually thought it was right the first time.) Can you see anything obviously wrong with my math? The code works fine in all of my tests, except that calling swapoff() when the system is under heavy paging load and has multiple swap devices sometimes leads to a few pages being missed by the scan. I think the problem is that some process allocates some swap and starts paging out just before the device is marked as off-limits. Am I missing a simple solution to this problem? (For now, I kludge around the issue by rescanning if there are still blocks remaining.) To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
Thus spake Nate Lawson [EMAIL PROTECTED]: Nice, thanks for doing this. How about some more accurate names for the userland routines instead of this_is_swapoff and twiddle? Sure, suggest something and I'll change it. I shamelessly stole 'this_is_swapoff' from w / uptime, but you can blame me for 'twiddle'. The function was originally called 'add', I think, but now it adds or removes depending on whether it's being called as swapon or swapoff... To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
:The code works fine in all of my tests, except that calling :swapoff() when the system is under heavy paging load and has :multiple swap devices sometimes leads to a few pages being missed :by the scan. I think the problem is that some process allocates :some swap and starts paging out just before the device is marked :as off-limits. Am I missing a simple solution to this problem? :(For now, I kludge around the issue by rescanning if there are :still blocks remaining.) Hmm. Yes, I think the issue here is that you may be missing pages in objects which are undergoing I/O. You may need to wait for other paging on the object (the pip count) to go to zero. I will review that code more carefully in a little bit and give you a definitive answer. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
:... :detect when there is insufficient space. (I actually thought it :was right the first time.) Can you see anything obviously wrong :with my math? : :The code works fine in all of my tests, except that calling :swapoff() when the system is under heavy paging load and has :multiple swap devices sometimes leads to a few pages being missed :by the scan. I think the problem is that some process allocates :some swap and starts paging out just before the device is marked :as off-limits. Am I missing a simple solution to this problem? :(For now, I kludge around the issue by rescanning if there are :still blocks remaining.) Ok, I think the problem is in swap_pager_swapoff() and swp_pager_force_pagein(). Another process may be manipulating the swblock (or a prior swblock) while swp_pager_force_pagein() is blocked. In fact, the swap block can be ripped out from under swap_pager_swapoff() if swp_pager_force_pagein() blocks. i.e. the 'swap' structure may be invalid after you call swp_pager_force_pagein(). This is a sticky situation because both the VM object and the swblocks may be manipulated by other processes when you block. I think what you need to try to do is this (it's a mess, if you can think of a better solution definitely go another route!) while ((swap = *pswap) != NULL) { if (anything_is_swapped_to_the_device) { try_to_page_it_all_in (note that the swblock structure is invalid the moment you block, so swp_pager_force_pagein() should be given the whole range). /* fall through to retry */ } else if (the_related_object_pip_count_is_not_zero) { vm_object_pip_sleep(...) /* fall through to retry */ } else if (swap-swb_count = 0) { free the swap block *pswap = swap-swb_hnext; } } -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
:I'm resurrecting this thread because I finally got around to :finishing up the patches to implement swapoff. I would appreciate :some review of them, particularly to verify that I have done the :right thing WRT synchronization. I have not optimized it to do :read clustering, but I have ensured that such an optimization :could be made. Other than that, I don't know of any deficiencies. This is great, David. The code is about as clean as it's possible to make in a swapoff implementation. There are some minor inefficiencies.. some shortcuts that can be taken in the blist code for example, but I don't think we have to worry about them for this initial implementation. The SW_CLOSING test is an excellent solution to dealing with the swap bitmap when paging in from the dying swap area. The swap_pager_isswapped() function may not be doing a sufficient test: :+ pswap = swp_pager_hash(object, index); :+ :+ if ((swap = *pswap) != NULL) { :+ for (i = 0; i SWAP_META_PAGES; ++i) { :+ daddr_t v = swap-swb_pages[i]; :+ if (v != SWAPBLK_NONE :+ BLK2DEVIDX(v) == devidx :+!vm_page_lookup(object, swap-swb_index+i)) :+ return 1; :+ } :+ } It is quite possible for a VM page to be present but invalid, meaning that the swap is still valid. You could incorrectly return that the object is not swapped when in fact it is. BUT, since you only appear to be making this call on the process's UPAGES object, there may not be a problem. Perhapss the best thing to do is to not do the vm_page_lookup() call and instead just unconditionally faultin() the uarea if it looks like there might be a problem. You may need a master lock to ensure that only swapon() or swapoff() is 'in progress' at any given moment. The vm_page_grab() call below may block, I think: :+ :+ if (object-type != OBJT_SWAP) :+ panic(swp_pager_force_pagein: object not backed by swap); :+ :+ m = vm_page_grab(object, pindex, VM_ALLOC_NORMAL | VM_ALLOC_RETRY); :+ if (m-valid == VM_PAGE_BITS_ALL) { :+ /* :+ * The page is already in memory, but must be :+ * dirtied, since we're taking away its backing store. :+ */ :+ vm_page_lock_queues(); :+ vm_page_activate(m); :+ vm_page_dirty(m); :+ vm_page_wakeup(m); :+ vm_page_unlock_queues(); :+ return 1; :+ } :+ :+ vm_object_pip_add(object, 1); I think you may want to do the pip_add before calling vm_page_grab(). -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
Nice, thanks for doing this. How about some more accurate names for the userland routines instead of this_is_swapoff and twiddle? -Nate To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
:BTW, NetBSD's new UVM code has the ability to do this. Perhaps :it's worth looking in to how difficult it would really be in FreeBSD... : :To Unsubscribe: send mail to [EMAIL PROTECTED] :with unsubscribe freebsd-hackers in the body of the message Someone got it mostly working a year or two ago if I remember right but I don't know what happened to it finally. Implementing swapoff is a bunch of grunt-work but not too hard in concept. Basically the work involved is this: * Make a calculation to be sure that it is possible to turn off the swap device and not run the system out of VM. If it is not possible do not allow the swapoff. * Allocate all the free bitmap bits related to the swap device you are trying to remove to prevent pageouts to the device you are removing. * Flag the swap device being removed and then scan all OBJT_SWAP VM Objects looking for swap blocks associated with the device, and force a page-in of those blocks. The getpages code for the swap backing store would detect the flag and not clear the swap bitmap bits as it pages-in the data. (Forcing a pagein may force pages to cycle back out to another swap device, so special treatment of the paged-in pages (like immediately placing it in the VM page cache instead of the active or inactive queues) is necessary to reduce load effects on the system. * The swap device being removed can now be closed and the related swap device index marked free. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
Thus spake Matthew Dillon [EMAIL PROTECTED]: Implementing swapoff is a bunch of grunt-work but not too hard in concept. Basically the work involved is this: Sounds like a plan, and not too tricky. Perhaps I'll see if I can figure it out when I have some free time. * Make a calculation to be sure that it is possible to turn off the swap device and not run the system out of VM. If it is not possible do not allow the swapoff. Can't you have a race condition here where you decide that you have enough space, and by the time you've deallocated half of the swapfile that's no longer the case? It seems like the correct thing to do in that case is abort the system call (which could be painful). Perhaps the best thing to do in this case is wait for vm_pageout_scan to kill a few pigs. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
:Can't you have a race condition here where you decide that you :have enough space, and by the time you've deallocated half of the :swapfile that's no longer the case? It seems like the correct :thing to do in that case is abort the system call (which could be :painful). Perhaps the best thing to do in this case is wait for :vm_pageout_scan to kill a few pigs. I wouldn't worry about it. Nobody turns off swap on a running system at a whim. It just needs to prevent stupid mistakes like trying to remove a swap device without having adequate memory + other swap to take care of the data. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
David Schultz wrote: Thus spake Matthew Dillon [EMAIL PROTECTED]: Implementing swapoff is a bunch of grunt-work but not too hard in concept. Basically the work involved is this: Sounds like a plan, and not too tricky. Perhaps I'll see if I can figure it out when I have some free time. * Make a calculation to be sure that it is possible to turn off the swap device and not run the system out of VM. If it is not possible do not allow the swapoff. Can't you have a race condition here where you decide that you have enough space, and by the time you've deallocated half of the swapfile that's no longer the case? It seems like the correct thing to do in that case is abort the system call (which could be painful). Perhaps the best thing to do in this case is wait for vm_pageout_scan to kill a few pigs. One system I used to use years and years ago seperated this process into stages. The swap(1M) command could be used to enable, disable and 'weight' allocation to swap areas. The add was easy. 'delete' would cause the device to be attempted to be paged in, but if the system looked like it was going to run out of resources it would fail and stop right there. You could either turn allocation back on, or kill processes or wait for the pager catch up with moving stuff out to other swap spaces. When (if) it finally hit zero inuse, it would be deleted. It did manage multiple swap spaces as seperate entities with different fill levels etc [rather than one giant logical swap area], so doing it this way kinda made sense. I did actually use it once and it even worked. :-) (I cannibalized my /tmp file system and used it for swap for a project, and then turned it off and re-mkfs'ed it) Cheers, -Peter -- Peter Wemm - [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] All of this is for nothing if we don't go to the stars - JMS/B5 To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
Also this would probably be useful in the situation when you need to change swap device on a running system. We had to do this once or twice on a very busy commerical mail server running Solaris. We needed to dismount current swap device and use it for other purpose while having switched paging/swapping to another disk. I wouldn't worry about it. Nobody turns off swap on a running system at a whim. It just needs to prevent stupid mistakes like trying to remove a swap device without having adequate memory + other swap to take care of the data. -Matt Matthew Dillon [EMAIL PROTECTED] -- Andrey Alekseyev. Zenon N.S.P. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
Matthew Dillon wrote: * Flag the swap device being removed and then scan all OBJT_SWAP VM Objects looking for swap blocks associated with the device, and force a page-in of those blocks. The getpages code for the swap backing store would detect the flag and not clear the swap bitmap bits as it pages-in the data. (Forcing a pagein may force pages to cycle back out to another swap device, so special treatment of the paged-in pages (like immediately placing it in the VM page cache instead of the active or inactive queues) is necessary to reduce load effects on the system. Uh... so you set the bit that tells you it's allocated to prevent it being allocated? When I swap something in and the bit is set, how do I know that it's in, except that it's not allocated? In other words, I do what you say... how do I know when the device has been drained out, vs. being in use? I think you have to disable swapping to the device some other way, and then return fromt he swapoff only when the bitmap is all zero. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
On Sat, Jul 13, 2002 at 03:17:33AM -0700, Terry Lambert wrote: Matthew Dillon wrote: * Flag the swap device being removed and then scan all OBJT_SWAP VM Objects looking for swap blocks associated with the device, and force a page-in of those blocks. The getpages code for the swap backing store would detect the flag and not clear the swap bitmap bits as it pages-in the data. (Forcing a pagein may force pages to cycle back out to another swap device, so special treatment of the paged-in pages (like immediately placing it in the VM page cache instead of the active or inactive queues) is necessary to reduce load effects on the system. Uh... so you set the bit that tells you it's allocated to prevent it being allocated? When I swap something in and the bit is set, how do I know that it's in, except that it's not allocated? In other words, I do what you say... how do I know when the device has been drained out, vs. being in use? I think you have to disable swapping to the device some other way, and then return fromt he swapoff only when the bitmap is all zero. Well, I think that disabling the device another way would probably be a better approach, but you don't *have* to do it another way. You can: 1) Flag the device as being removed. 2) Scan through the bitmap, and for each page allocated, add it to a list if pages to be paged in. For each page free, set its bit to keep it from being used. 3) Once you've set all of the bits to 1, force a pagein of every page on the list. If the size of the list of prohibitive, you can force the pagein every time the list becomes full (a low/high watermark system would probably be the most effective). 3) When pages get paged in, check the 'device being removed' flag and only clear the bit in the bitmap if the flag isn't set. Also, decrement a counter of the total pages in use on the device. 4) When the counter reaches zero, remove the device. One would need to increment the counter when they page out, obviously, but that is not a problem. This works, but it has the potential problem that if the list of pages to be paged in can't grow large enough, you might page pages out to the device you're trying to disable, only to page them back in basically immediately. This is rather silly, but would still function. Personally, I think it would be more intuitive to add a check to the allocation algorithm that forces it to not consider devices flaged for removal, and mark each page as free after it comes in. When the bitmap is clean you're done. However, you're still going to want to count the pages allocated on a device because it's a lot easier to check the counter than to scan the whole bitmap. -- Jonathan Mini [EMAIL PROTECTED] http://www.freebsd.org/ To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
Jon Mini wrote: Personally, I think it would be more intuitive to add a check to the allocation algorithm that forces it to not consider devices flaged for removal, and mark each page as free after it comes in. When the bitmap is clean you're done. However, you're still going to want to count the pages allocated on a device because it's a lot easier to check the counter than to scan the whole bitmap. Flagging is how you would have to do it. Counting is fairly useless, since it assumes that, given a bit in the bitmap, you can reverse lookup the page that points to it, and there is not a reverse mapping for this, only a forward (swap mappings swapped out don't write swap metadata to the disk). I guess if you did it over and over again, you could know when you are done. I think that what has to happen is that you are going to have to scan all page mappings in the system to find anyone using the device, and page it in that way. My reasoning on this is that the page mappings exist only in the context of the current process address space. This means that unless you force in each process in turn so that the mappings are relevent to its address space, and then force out what's in core (to a new device) in order to get the pages swapped to the disappearing device, the only really clean way to do it is to rewrite the mappings after manually copying the page from one device to the other using non-swappable pages in the kernel address space as temporary buffers. It's doable, but it's ugly... it's nearly the same problem you'd face if you wanted to defrag kernel memory so that you could do a contigmalloc very late in the game, only a bit easier (the kernel does not set ELF attributes to indicate pages containg code or data not in the paging path for you to do this). I'm not positive if you would need to lock the pages you are moving, or not. I think so. The problem is when you have a program that has something like a writeable data page from a shared library, which has been copied via copy-on-write, and then the program that did it forks (this happens all the time in any fork for libc and similar offset fixups). You would end up needing to lock all processes which referenced a page that had been swapped out, with multiple vm_object_t's. Maybe I'm missing something, and there really is a way to get this information back from a known bit in the bitmap... Matt? It was your suggestion? Am I not seeing a function right in front of me? -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
Thus spake Peter Wemm [EMAIL PROTECTED]: One system I used to use years and years ago seperated this process into stages. The swap(1M) command could be used to enable, disable and 'weight' allocation to swap areas. The add was easy. 'delete' would cause the device to be attempted to be paged in, but if the system looked like it was going to run out of resources it would fail and stop right there. You could either turn allocation back on, or kill processes or wait for the pager catch up with moving stuff out to other swap spaces. When (if) it finally hit zero inuse, it would be deleted. It did manage multiple swap spaces as seperate entities with different fill levels etc [rather than one giant logical swap area], so doing it this way kinda made sense. I did actually use it once and it even worked. :-) (I cannibalized my /tmp file system and used it for swap for a project, and then turned it off and re-mkfs'ed it) The weight idea is very interesting. NetBSD does this using priorities; all the swap devices of a given priority are filled round robin before devices of lower priority, the idea being that the slower ones are a last resort (e.g. NFS). On the other hand, this design allows large and fast swap devices to start swapping to death before the `backup' devices see any action. It isn't clear to me whether priorities or fill levels are better. (Certainly a hybrid is possible, that is, weights within priority levels.) This may be a better project for me than swapoff in the immediate future because I won't have to understand how to track down the appropriate VM objects and handle them in a kosher manner. Implementing weights/priorities will also involve dynamically allocating struct swdevt's, which should be done anyway and will only be harder after swapoff() is written. BTW, I believe the comment about swfree() in vm_swap.c is outdated as of rev. 1.17, and nothing uses SW_FREED anymore. This means that technically, swap devices don't have any flags right now, but that could change with swapoff(). To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
David Schultz wrote: The weight idea is very interesting. NetBSD does this using priorities; all the swap devices of a given priority are filled round robin before devices of lower priority, the idea being that the slower ones are a last resort (e.g. NFS). On the other hand, this design allows large and fast swap devices to start swapping to death before the `backup' devices see any action. It isn't clear to me whether priorities or fill levels are better. (Certainly a hybrid is possible, that is, weights within priority levels.) I like the idea of a moving average on time-from-request-to-service. 8-). Works great for Server Load Balancing, too. The moving average takes load into account, without explicit load notification (i.e. no need to have a load notification protocol between NFS clients and servers, etc.). This may be a better project for me than swapoff in the immediate future because I won't have to understand how to track down the appropriate VM objects and handle them in a kosher manner. Implementing weights/priorities will also involve dynamically allocating struct swdevt's, which should be done anyway and will only be harder after swapoff() is written. 8-). Now that everyone is talking about it, better get my hacks in first, so that other people have to integrate with my changes, instead of the other way around... Actually, I think it's a nice idea for an incremental project. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
We are not going to be doing any sort of weighting. It's an idea whos time has come... and gone again. It might have been useful 8 years ago but it is not useful today. Also, please note that it is not possible to reverse-lookup a swap bitmap block and get the VM object / page number. The OBJT_SWAP VM objects have to be scanned to get the swap bitmap blocks. Nor does it make much sense to try to 'record' the blocks somewhere, there could be hundreds of thousands of blocks and memory is not normally a luxury in this situation. All you need to do is prevent new blocks from being allocated from the old swap device. Since the radix tree bitmap code cannot make a distinction between devices the easiest way to do this is to simply allocate all the free bits associated with the device (which you can do), and prevent any existing allocated blocks from being freed from the bitmap (which is a simple calculation) ... and of course mark the page dirty again since its backing store is being ripped out from under it. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
Thus spake Matthew Dillon [EMAIL PROTECTED]: We are not going to be doing any sort of weighting. It's an idea whos time has come... and gone again. It might have been useful 8 years ago but it is not useful today. Also, please note that it is not possible to reverse-lookup a swap bitmap block and get the VM object / page number. The OBJT_SWAP VM objects have to be scanned to get the swap bitmap blocks. Nor does it make much sense to try to 'record' the blocks somewhere, there could be hundreds of thousands of blocks and memory is not normally a luxury in this situation. I'm aware of that. That's why swapoff is a harder project; it requires working at more levels of abstraction, not all of which I fully understand yet. At least most of the VM stuff is well-documented now. ;-) All you need to do is prevent new blocks from being allocated from the old swap device. Since the radix tree bitmap code cannot make a distinction between devices the easiest way to do this is to simply allocate all the free bits associated with the device (which you can do), and prevent any existing allocated blocks from being freed from the bitmap (which is a simple calculation) ... and of course mark the page dirty again since its backing store is being ripped out from under it. This makes sense. I was originally thinking of marking the device as off-limits to new allocations, but I realize now why that would not work. As long as the logical swap blocks that correspond to the device are still fair game for the swap pager, swapdev_strategy will still have to swap out to the device. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
Matthew Dillon wrote: We are not going to be doing any sort of weighting. It's an idea whos time has come... and gone again. It might have been useful 8 years ago but it is not useful today. Thank goodness! :-) Cheers, -Peter -- Peter Wemm - [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] All of this is for nothing if we don't go to the stars - JMS/B5 To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
swapoff?
Not sure if this is offtopic here or not, so apologies ahead of time if so. It has been many years since I used Linux, but one thing I recall is that there was a `swapoff` command in Linux to complement the `swapon` command. Are there any patches or plans to implement such a thing in FreeBSD? My familiarity with the workings of FreeBSD is still pretty minimal. Are there certain reasons that there currently is no way to stop paging to a device/file? -- Sean Kelly | PGP KeyID: 77042C7B [EMAIL PROTECTED] | http://www.zombie.org To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
Thus spake Sean Kelly [EMAIL PROTECTED]: My familiarity with the workings of FreeBSD is still pretty minimal. Are there certain reasons that there currently is no way to stop paging to a device/file? I imagine the implementation of this would be complicated, as it is in Linux. You'd have to prevent further allocations on the swap device, then figure out where to evict the pages already allocated on the device. You also have to be able to back out if you run out of space to put things in the process. Maybe someone who is familiar with the race conditions involved will implement it some day, but swapoff would only occasionally be useful... at least until everyone is using hot-swappable swap. ;-) To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: swapoff?
BTW, NetBSD's new UVM code has the ability to do this. Perhaps it's worth looking in to how difficult it would really be in FreeBSD... To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message