Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
On Thu, Aug 23, 2007 at 11:26:48AM +0200, Peter Zijlstra wrote: > On Thu, 2007-08-23 at 05:38 +0200, Nick Piggin wrote: > > On Tue, Aug 21, 2007 at 04:07:15PM +0200, Peter Zijlstra wrote: > > > On Tue, 2007-08-21 at 02:39 +0200, Nick Piggin wrote: > > > > > > > > Although interestingly, we are not guaranteed to have enough memory to > > > > completely initialise writeout of a single page. > > > > > > Yes, that is due to the unbounded nature of direct reclaim, no? > > > > Even writing out a single page to a plain old block backed filesystem > > can take a fair chunk of memory. I'm not really sure how problematic > > this is with a "real" filesystem, but even with something pretty simple, > > you might have to do block allocation, which itself might have to do > > indirect block allocation (which itself can be 3 or 4 levels), all of > > which have to actually update block bitmaps (which themselves may be > > many pages big). Then you also may have to even just allocate the > > buffer_head structure itself. And that's just to write out a single > > buffer in the page (on a 64K page system, there might be 64 of these). > > Right, nikita once talked me though all that when we talked about > clustered writeout. > > IIRC filesystems were supposed to keep mempools big enough to do this > for a single writepage at a time. Not sure its actually done though. It isn't ;) At least I don't think so for the minix-derived ones I've seen. But no matter, this is going a bit off topic anyway. > > But again, on the pragmatic side, the best behaviour I think is just > > to have writeouts not allocate from reserves without first trying to > > reclaim some clean memory, and also limit the number of users of the > > reserve. We want this anyway, right, because we don't want regular > > reclaim to start causing things like atomic allocation failures when > > load goes up. > > My idea is to extend kswapd, run cpus_per_node instances of kswapd per > node for each of GFP_KERNEL, GFP_NOFS, GFP_NOIO. (basically 3 kswapds > per cpu) > > whenever we would hit direct reclaim, add ourselves to a special > waitqueue corresponding to the type of GFP and kick all the > corresponding kswapds. I don't know what this is solving? You don't need to run all reclaim from kswapd process in order to limit concurrency. Just explicitly limit it when a process applies for PF_MEMALLOC reserves. I had a patch to do this at one point, but it never got much testing -- I think there were other problems iwth a single process able to do unbounded writeout and such anyway. But yeah, I don't think getting rid of direct reclaim will do anything magical. > Now Linus' big objection is that all these processes would hit a wall > and not progress until the watermarks are high again. > > Here is were the 'special' part of the waitqueue comes into order. > > Instead of freeing pages to the page allocator, these kswapds would hand > out pages to the waiting processes in a round robin fashion. Only if > there are no more waiting processes left, would the page go to the buddy > system. Directly getting back pages (and having more than 1 kswapd per node) may be things worth exploring at some point. But I don't see how muchi bearing they have to any deadlock problems. > > > And then there is the deadlock in add_to_swap() that I still have to > > > look into, I hope it can eventually be solved using reserve based > > > allocation. > > > > Yes it should have a reserve. It wouldn't be hard, all you need is > > enough memory to be able to swap out a single page I would think (ie. > > one preload's worth). > > Yeah, just need to look at the locking an batching, and ensure it has > enough preload to survive one batch, once all the locks are dropped it > can breathe again :-) I don't think you'd need to do anything remotely fancy ;) Just so long as it can allocate a swapcache entry for a single page to write out, that page will be written and eventually reclaimed, along with its radix tree nodes. > > > The biggest issue is receiving the completion notification. Network > > > needs to fall back to a state where it does not blindly consumes memory > > > or drops _all_ packets. An intermediate state is required, one where we > > > can receive and inspect incoming packets but commit to very few. > > > > Yes, I understand this is the main problem. But it is not _helped_ by > > the fact that reclaim reserves include the atomic allocation reserves. > > I haven't run this problem for a long time, but I'd venture to guess the > > _main_ reason the deadlock is hit is not because of networking allocating > > a lot of other irrelevant data, but because of reclaim using up most of > > the atomic allocation reserves. > > Ah, interesting notion. > > > And this observation is not tied to recurisve reclaim: if we somehow had > > a reserve for atomic allocations that was aside from the reclaim reserve, > > I think such a system would be practically free of deadlock fo
Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
On Thu, 2007-08-23 at 14:11 +0400, Nikita Danilov wrote: > Peter Zijlstra writes: > > [...] > > > My idea is to extend kswapd, run cpus_per_node instances of kswapd per > > node for each of GFP_KERNEL, GFP_NOFS, GFP_NOIO. (basically 3 kswapds > > per cpu) > > > > whenever we would hit direct reclaim, add ourselves to a special > > waitqueue corresponding to the type of GFP and kick all the > > corresponding kswapds. > > There are two standard objections to this: > > - direct reclaim was introduced to reduce memory allocation latency, > and going to scheduler kills this. But more importantly, The part you snipped: > > Here is were the 'special' part of the waitqueue comes into order. > > > > Instead of freeing pages to the page allocator, these kswapds would hand > > out pages to the waiting processes in a round robin fashion. Only if > > there are no more waiting processes left, would the page go to the buddy > > system. should deal with that, it allows processes to quickly get some memory. > - it might so happen that _all_ per-cpu kswapd instances are > blocked, e.g., waiting for IO on indirect blocks, or queue > congestion. In that case whole system stops waiting for IO to > complete. In the direct reclaim case, other threads can continue > zone scanning. By running separate GFP_KERNEL, GFP_NOFS and GFP_NOIO kswapds this should not occur. Much like it now does not occur. This approach would make it work pretty much like it does now. But instead of letting each separate context run into reclaim we then have a fixed set of reclaim contexts which evenly distribute their resulting free pages. The possible down sides are: - more schedule()s, but I don't think these will matter when we're that deep into reclaim - less concurrency - but I hope 1 set per cpu is enough, we could up this if it turns out to really help. signature.asc Description: This is a digitally signed message part
Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
Peter Zijlstra writes: [...] > My idea is to extend kswapd, run cpus_per_node instances of kswapd per > node for each of GFP_KERNEL, GFP_NOFS, GFP_NOIO. (basically 3 kswapds > per cpu) > > whenever we would hit direct reclaim, add ourselves to a special > waitqueue corresponding to the type of GFP and kick all the > corresponding kswapds. There are two standard objections to this: - direct reclaim was introduced to reduce memory allocation latency, and going to scheduler kills this. But more importantly, - it might so happen that _all_ per-cpu kswapd instances are blocked, e.g., waiting for IO on indirect blocks, or queue congestion. In that case whole system stops waiting for IO to complete. In the direct reclaim case, other threads can continue zone scanning. Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
On Thu, 2007-08-23 at 05:38 +0200, Nick Piggin wrote: > On Tue, Aug 21, 2007 at 04:07:15PM +0200, Peter Zijlstra wrote: > > On Tue, 2007-08-21 at 02:39 +0200, Nick Piggin wrote: > > > > > > Although interestingly, we are not guaranteed to have enough memory to > > > completely initialise writeout of a single page. > > > > Yes, that is due to the unbounded nature of direct reclaim, no? > > Even writing out a single page to a plain old block backed filesystem > can take a fair chunk of memory. I'm not really sure how problematic > this is with a "real" filesystem, but even with something pretty simple, > you might have to do block allocation, which itself might have to do > indirect block allocation (which itself can be 3 or 4 levels), all of > which have to actually update block bitmaps (which themselves may be > many pages big). Then you also may have to even just allocate the > buffer_head structure itself. And that's just to write out a single > buffer in the page (on a 64K page system, there might be 64 of these). Right, nikita once talked me though all that when we talked about clustered writeout. IIRC filesystems were supposed to keep mempools big enough to do this for a single writepage at a time. Not sure its actually done though. One advantage here is that swap writeout is very simple, so for swap_writepage() the overhead is minimal, and we can free up space to make progress with the fs writeout. And if there is little anonymous in the system it must have a lot clean because of the dirty limit. But yeah, there are some nasty details left here. > > I've been meaning to write some patches to address this problem in a way > > that does not introduce the hard wall Linus objects to. If only I had > > this extra day in the week :-/ > > For this problem I think the right way to go is to ensure everything > is allocated to do writeout at page-dirty-time. This is what fsblock > does (or at least _allows_ for: filesystems that do journalling or > delayed allocation etc. themselves will have to ensure they have > sufficient preallocations to do the manipulations they need at writeout > time). > > But again, on the pragmatic side, the best behaviour I think is just > to have writeouts not allocate from reserves without first trying to > reclaim some clean memory, and also limit the number of users of the > reserve. We want this anyway, right, because we don't want regular > reclaim to start causing things like atomic allocation failures when > load goes up. My idea is to extend kswapd, run cpus_per_node instances of kswapd per node for each of GFP_KERNEL, GFP_NOFS, GFP_NOIO. (basically 3 kswapds per cpu) whenever we would hit direct reclaim, add ourselves to a special waitqueue corresponding to the type of GFP and kick all the corresponding kswapds. Now Linus' big objection is that all these processes would hit a wall and not progress until the watermarks are high again. Here is were the 'special' part of the waitqueue comes into order. Instead of freeing pages to the page allocator, these kswapds would hand out pages to the waiting processes in a round robin fashion. Only if there are no more waiting processes left, would the page go to the buddy system. > > And then there is the deadlock in add_to_swap() that I still have to > > look into, I hope it can eventually be solved using reserve based > > allocation. > > Yes it should have a reserve. It wouldn't be hard, all you need is > enough memory to be able to swap out a single page I would think (ie. > one preload's worth). Yeah, just need to look at the locking an batching, and ensure it has enough preload to survive one batch, once all the locks are dropped it can breathe again :-) > > > The buffer layer doesn't require disk blocks to be allocated at page > > > dirty-time. Allocating disk blocks can require complex filesystem > > > operations > > > and readin of buffer cache pages. The buffer_head structures themselves > > > may > > > not even be present and must be allocated :P > > > > > > In _practice_, this isn't such a problem because we have dirty limits, and > > > we're almost guaranteed to have some clean pages to be reclaimed. In this > > > same way, networked filesystems are not a problem in practice. However > > > network swap, because there is no dirty limits on swap, can actually see > > > the deadlock problems. > > > > The main problem with networked swap is not so much sending out the > > pages (this has similar problems like the filesystems but is all bounded > > in its memory use). > > > > The biggest issue is receiving the completion notification. Network > > needs to fall back to a state where it does not blindly consumes memory > > or drops _all_ packets. An intermediate state is required, one where we > > can receive and inspect incoming packets but commit to very few. > > Yes, I understand this is the main problem. But it is not _helped_ by > the fact that reclaim reserves include the atomic allocation reserv
Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
On Tue, Aug 21, 2007 at 04:07:15PM +0200, Peter Zijlstra wrote: > On Tue, 2007-08-21 at 02:39 +0200, Nick Piggin wrote: > > > > Although interestingly, we are not guaranteed to have enough memory to > > completely initialise writeout of a single page. > > Yes, that is due to the unbounded nature of direct reclaim, no? Even writing out a single page to a plain old block backed filesystem can take a fair chunk of memory. I'm not really sure how problematic this is with a "real" filesystem, but even with something pretty simple, you might have to do block allocation, which itself might have to do indirect block allocation (which itself can be 3 or 4 levels), all of which have to actually update block bitmaps (which themselves may be many pages big). Then you also may have to even just allocate the buffer_head structure itself. And that's just to write out a single buffer in the page (on a 64K page system, there might be 64 of these). Unbounded direct reclaim surely doesn't help either :P > I've been meaning to write some patches to address this problem in a way > that does not introduce the hard wall Linus objects to. If only I had > this extra day in the week :-/ For this problem I think the right way to go is to ensure everything is allocated to do writeout at page-dirty-time. This is what fsblock does (or at least _allows_ for: filesystems that do journalling or delayed allocation etc. themselves will have to ensure they have sufficient preallocations to do the manipulations they need at writeout time). But again, on the pragmatic side, the best behaviour I think is just to have writeouts not allocate from reserves without first trying to reclaim some clean memory, and also limit the number of users of the reserve. We want this anyway, right, because we don't want regular reclaim to start causing things like atomic allocation failures when load goes up. > And then there is the deadlock in add_to_swap() that I still have to > look into, I hope it can eventually be solved using reserve based > allocation. Yes it should have a reserve. It wouldn't be hard, all you need is enough memory to be able to swap out a single page I would think (ie. one preload's worth). > > The buffer layer doesn't require disk blocks to be allocated at page > > dirty-time. Allocating disk blocks can require complex filesystem operations > > and readin of buffer cache pages. The buffer_head structures themselves may > > not even be present and must be allocated :P > > > > In _practice_, this isn't such a problem because we have dirty limits, and > > we're almost guaranteed to have some clean pages to be reclaimed. In this > > same way, networked filesystems are not a problem in practice. However > > network swap, because there is no dirty limits on swap, can actually see > > the deadlock problems. > > The main problem with networked swap is not so much sending out the > pages (this has similar problems like the filesystems but is all bounded > in its memory use). > > The biggest issue is receiving the completion notification. Network > needs to fall back to a state where it does not blindly consumes memory > or drops _all_ packets. An intermediate state is required, one where we > can receive and inspect incoming packets but commit to very few. Yes, I understand this is the main problem. But it is not _helped_ by the fact that reclaim reserves include the atomic allocation reserves. I haven't run this problem for a long time, but I'd venture to guess the _main_ reason the deadlock is hit is not because of networking allocating a lot of other irrelevant data, but because of reclaim using up most of the atomic allocation reserves. And this observation is not tied to recurisve reclaim: if we somehow had a reserve for atomic allocations that was aside from the reclaim reserve, I think such a system would be practically free of deadlock for more anonymous-intensive workloads too. > In order to create such a network state and for it to be stable, a > certain amount of memory needs to be available and an external trigger > is needed to enter and leave this state - currently provided by there > being more memory available than needed or not. I do appreciate the deadlock and solution. I'm puzzled by your last line though? Currently we do not provide the required reserves in the network layer, *at all*, right? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
On Tue, 2007-08-21 at 02:39 +0200, Nick Piggin wrote: > On Mon, Aug 20, 2007 at 11:14:08PM +0200, Peter Zijlstra wrote: > > On Mon, 2007-08-20 at 13:27 -0700, Christoph Lameter wrote: > > > On Mon, 20 Aug 2007, Peter Zijlstra wrote: > > > > > > > > Plus the same issue can happen today. Writes are usually not > > > > > completed > > > > > during reclaim. If the writes are sufficiently deferred then you have > > > > > the > > > > > same issue now. > > > > > > > > Once we have initiated (disk) writeout we do not need more memory to > > > > complete it, all we need to do is wait for the completion interrupt. > > > > > > We cannot reclaim the page as long as the I/O is not complete. If you > > > have too many anonymous pages and the rest of memory is dirty then you > > > can > > > get into OOM scenarios even without this patch. > > > > As long as the reserve is large enough to completely initialize writeout > > of a single page we can make progress. Once writeout is initialized the > > completion interrupt is guaranteed to happen (assuming working > > hardware). > > Although interestingly, we are not guaranteed to have enough memory to > completely initialise writeout of a single page. Yes, that is due to the unbounded nature of direct reclaim, no? I've been meaning to write some patches to address this problem in a way that does not introduce the hard wall Linus objects to. If only I had this extra day in the week :-/ And then there is the deadlock in add_to_swap() that I still have to look into, I hope it can eventually be solved using reserve based allocation. > The buffer layer doesn't require disk blocks to be allocated at page > dirty-time. Allocating disk blocks can require complex filesystem operations > and readin of buffer cache pages. The buffer_head structures themselves may > not even be present and must be allocated :P > > In _practice_, this isn't such a problem because we have dirty limits, and > we're almost guaranteed to have some clean pages to be reclaimed. In this > same way, networked filesystems are not a problem in practice. However > network swap, because there is no dirty limits on swap, can actually see > the deadlock problems. The main problem with networked swap is not so much sending out the pages (this has similar problems like the filesystems but is all bounded in its memory use). The biggest issue is receiving the completion notification. Network needs to fall back to a state where it does not blindly consumes memory or drops _all_ packets. An intermediate state is required, one where we can receive and inspect incoming packets but commit to very few. In order to create such a network state and for it to be stable, a certain amount of memory needs to be available and an external trigger is needed to enter and leave this state - currently provided by there being more memory available than needed or not. signature.asc Description: This is a digitally signed message part
Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
On Mon, 2007-08-20 at 14:17 -0700, Christoph Lameter wrote: > On Mon, 20 Aug 2007, Peter Zijlstra wrote: > > > > Its not that different. > > > > Yes it is, disk based completion does not require memory, network based > > completion requires unbounded memory. > > Disk based completion only require no memory if its not on a stack of > other devices and if the interrupt handles is appropriately shaped. If > there are multile levels below or there is some sort of complex > completion handling then this also may require memory. I'm not aware of such a scenario - but it could well be. Still if it would it would take a _bounded_ amount of memory per page. Network would still differ in that it requires an _unbounded_ amount of packets to receive and process in order to receive that completion. signature.asc Description: This is a digitally signed message part
Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
On Mon, Aug 20, 2007 at 11:14:08PM +0200, Peter Zijlstra wrote: > On Mon, 2007-08-20 at 13:27 -0700, Christoph Lameter wrote: > > On Mon, 20 Aug 2007, Peter Zijlstra wrote: > > > > > > Plus the same issue can happen today. Writes are usually not completed > > > > during reclaim. If the writes are sufficiently deferred then you have > > > > the > > > > same issue now. > > > > > > Once we have initiated (disk) writeout we do not need more memory to > > > complete it, all we need to do is wait for the completion interrupt. > > > > We cannot reclaim the page as long as the I/O is not complete. If you > > have too many anonymous pages and the rest of memory is dirty then you can > > get into OOM scenarios even without this patch. > > As long as the reserve is large enough to completely initialize writeout > of a single page we can make progress. Once writeout is initialized the > completion interrupt is guaranteed to happen (assuming working > hardware). Although interestingly, we are not guaranteed to have enough memory to completely initialise writeout of a single page. The buffer layer doesn't require disk blocks to be allocated at page dirty-time. Allocating disk blocks can require complex filesystem operations and readin of buffer cache pages. The buffer_head structures themselves may not even be present and must be allocated :P In _practice_, this isn't such a problem because we have dirty limits, and we're almost guaranteed to have some clean pages to be reclaimed. In this same way, networked filesystems are not a problem in practice. However network swap, because there is no dirty limits on swap, can actually see the deadlock problems. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
On Mon, 20 Aug 2007, Peter Zijlstra wrote: > > Its not that different. > > Yes it is, disk based completion does not require memory, network based > completion requires unbounded memory. Disk based completion only require no memory if its not on a stack of other devices and if the interrupt handles is appropriately shaped. If there are multile levels below or there is some sort of complex completion handling then this also may require memory. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
On Mon, 2007-08-20 at 13:27 -0700, Christoph Lameter wrote: > On Mon, 20 Aug 2007, Peter Zijlstra wrote: > > > > Plus the same issue can happen today. Writes are usually not completed > > > during reclaim. If the writes are sufficiently deferred then you have the > > > same issue now. > > > > Once we have initiated (disk) writeout we do not need more memory to > > complete it, all we need to do is wait for the completion interrupt. > > We cannot reclaim the page as long as the I/O is not complete. If you > have too many anonymous pages and the rest of memory is dirty then you can > get into OOM scenarios even without this patch. As long as the reserve is large enough to completely initialize writeout of a single page we can make progress. Once writeout is initialized the completion interrupt is guaranteed to happen (assuming working hardware). This makes that I can happily run a 256M anonymous workload on a machine with only 128M memory. > > Networking is different here in that an unbounded amount of net traffic > > needs to be processed in order to find the completion event. > > Its not that different. Yes it is, disk based completion does not require memory, network based completion requires unbounded memory. > Pages are pinned during writeout from reclaim and > it is not clear when the write will complete. For disk based writeback you do not know when it comes, but you need only passively wait for it. For networked writeback you need to receive all packets that happen to be targeted at your machine and inspect them - and toss some away because you cannot keep everything, memory is limited. > There are no bounds that I > know in reclaim for the writeback of dirty anonymous pages. throttle_vm_writeout() does sort-of. > But some throttling function like for dirty pages is likely needed for > network traffic. Yes, Daniel is working on writeout throttling. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
On Mon, 20 Aug 2007, Peter Zijlstra wrote: > > Plus the same issue can happen today. Writes are usually not completed > > during reclaim. If the writes are sufficiently deferred then you have the > > same issue now. > > Once we have initiated (disk) writeout we do not need more memory to > complete it, all we need to do is wait for the completion interrupt. We cannot reclaim the page as long as the I/O is not complete. If you have too many anonymous pages and the rest of memory is dirty then you can get into OOM scenarios even without this patch. > Networking is different here in that an unbounded amount of net traffic > needs to be processed in order to find the completion event. Its not that different. Pages are pinned during writeout from reclaim and it is not clear when the write will complete. There are no bounds that I know in reclaim for the writeback of dirty anonymous pages. But some throttling function like for dirty pages is likely needed for network traffic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
On Mon, 2007-08-20 at 12:00 -0700, Christoph Lameter wrote: > On Sat, 18 Aug 2007, Pavel Machek wrote: > > > > The reclaim is of particular important to stacked filesystems that may > > > do a lot of allocations in the write path. Reclaim will be working > > > as long as there are clean file backed pages to reclaim. > > > > I don't get it. Lets say that we have stacked filesystem that needs > > it. That filesystem is broken today. > > > > Now you give it second chance by reclaiming clean pages, but there are > > no guarantees that we have any so that filesystem is still broken > > with your patch...? > > There is a guarantee that we have some because the user space program is > executing. Meaning the executable pages can be retrieved. The amount > dirty memory in the system is limited by the dirty_ratio. So the VM can > only get into trouble if there is a sufficient amount of anonymous pages > and all executables have been reclaimed. That is pretty rare. > > Plus the same issue can happen today. Writes are usually not completed > during reclaim. If the writes are sufficiently deferred then you have the > same issue now. Once we have initiated (disk) writeout we do not need more memory to complete it, all we need to do is wait for the completion interrupt. Networking is different here in that an unbounded amount of net traffic needs to be processed in order to find the completion event. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
On Sat, 18 Aug 2007, Pavel Machek wrote: > > The reclaim is of particular important to stacked filesystems that may > > do a lot of allocations in the write path. Reclaim will be working > > as long as there are clean file backed pages to reclaim. > > I don't get it. Lets say that we have stacked filesystem that needs > it. That filesystem is broken today. > > Now you give it second chance by reclaiming clean pages, but there are > no guarantees that we have any so that filesystem is still broken > with your patch...? There is a guarantee that we have some because the user space program is executing. Meaning the executable pages can be retrieved. The amount dirty memory in the system is limited by the dirty_ratio. So the VM can only get into trouble if there is a sufficient amount of anonymous pages and all executables have been reclaimed. That is pretty rare. Plus the same issue can happen today. Writes are usually not completed during reclaim. If the writes are sufficiently deferred then you have the same issue now. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
Hi! > If we exhaust the reserves in the page allocator when PF_MEMALLOC is set > then no longer give up but call into reclaim with PF_MEMALLOC set. > > This is in essence a recursive call back into page reclaim with another > page flag (__GFP_NOMEMALLOC) set. The recursion is bounded since potential > allocations with __PF_NOMEMALLOC set will not enter that branch again. > > This means that allocation under PF_MEMALLOC will no longer run out of > memory. Allocations under PF_MEMALLOC will do a limited form of reclaim > instead. > > The reclaim is of particular important to stacked filesystems that may > do a lot of allocations in the write path. Reclaim will be working > as long as there are clean file backed pages to reclaim. I don't get it. Lets say that we have stacked filesystem that needs it. That filesystem is broken today. Now you give it second chance by reclaiming clean pages, but there are no guarantees that we have any so that filesystem is still broken with your patch...? Should we fix the filesystem instead? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/