Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
Hi! If it is only one place, why not pre-allocate one I'm sick now skb and hold onto it. Any bigger solution seems to snowball into a huge mess. But the problem is even sending/receiving a single packet can cause multiple dynamic allocations in the networking path all the way from the sockets layer-transport-ip-driver. To successfully send a packet, we may have to do arp, send acks and create cached routes etc. So my patch tried to identify the allocations that are needed to succesfully send/receive packets over a pre-established socket and adds a new flag GFP_CRITICAL to those calls. This doesn't make any difference when we are not in emergency. But when we go into emergency, VM will try to satisfy these allocations from a critical pool if the normal path leads to failure. We go into emergency when some management app detects that a swap device is about to fail(we are not yet in OOM, but will enter OOM soon). In order to avoid entering OOM, we need to send a message over a critical socket to a remote server that can initiate failover and switch to a different swap device. The switchover will happen within 2 minutes after it is initiated. In a cluster environment, the remote server also sends a message to other nodes which are also running the management app so that they also enter emergency. Once we successfully switch to a different swap device, the remote server sends a message to all the nodes and they come out of emergency. During the period of emergency, all other communications can block. But guranteeing the successful delivery of the critical messages will help in making sure that we do not enter OOM situation. Why not do it the other way? If you don't hear from me for 2 minutes, do a switchover. Then all you have to do is _not_ to send a packet -- easier to do. Anything else seems overkill. Pavel -- Thanks, Sharp! - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
Why not do it the other way? If you don't hear from me for 2 minutes, do a switchover. Then all you have to do is _not_ to send a packet -- easier to do. Anything else seems overkill. Pavel Because in some of the scenarios, including ours, it isn't a simple failover to a known alternate device or configuration -- it is reconfiguring dynamically with information received on a socket from a remote machine (while the swap device is unavailable). Limited socket communication without allocating new memory that may not be available is the problem definition. Avoiding the problem in the first place (your solution) is effective if you can do it, of course. The trick is to solve the problem when you can't avoid it. :-) +-DLS - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
David S. Miller [EMAIL PROTECTED] wrote: The idea to mark, for example, IPSEC key management daemon's sockets as critical is flawed, because the key management daemon could hit a swap page over the iSCSI device. Don't even start with the idea to lock the IPSEC key management daemon into ram with mlock(). How are you going to swap in the key manager if you need the key manager for doing this? However, I'd prefer a system where you can't dirty mor than (e.g.) 80 % of RAM unless you need this to maintain vital system activity and not more than 95 % unless it will help to get more clean RAM. (Like the priority inheritance suggestion from this thread.) I suppose this to least significantly reduce thrashing and give a very good chance of recovering from memory pressure. Off cause the implementation won't be easy, especially if userspace applications need to inherit priority from different code paths, but in theory, it can be done. -- Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF verbreiteten Lügen zu sabotieren. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
From: Sridhar Samudrala [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 23:37:37 -0800 (PST) Instead, you seem to be suggesting in_emergency to be set dynamically when we are about to run out of ATOMIC memory. Is this right? Not when we run out, but rather when we reach some low water mark, the critical sockets would still use GFP_ATOMIC memory but only critical sockets would be allowed to do so. But even this has faults, consider the IPSEC scenerio I mentioned, and this applies to any kind of encapsulation actually, even simple tunneling examples can be concocted which make the critical socket idea fail. The knee jerk reaction is mark IPSEC's sockets critical, and mark the tunneling allocations critical, and... and... well you have GFP_ATOMIC then my friend. In short, these seperate page pool and critical socket ideas do not work and we need a different solution, I'm sorry folks spent so much time on them, but they are heavily flawed. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Thu, 2005-12-15 at 00:21 -0800, David S. Miller wrote: From: Sridhar Samudrala [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 23:37:37 -0800 (PST) Instead, you seem to be suggesting in_emergency to be set dynamically when we are about to run out of ATOMIC memory. Is this right? Not when we run out, but rather when we reach some low water mark, the critical sockets would still use GFP_ATOMIC memory but only critical sockets would be allowed to do so. But even this has faults, consider the IPSEC scenerio I mentioned, and this applies to any kind of encapsulation actually, even simple tunneling examples can be concocted which make the critical socket idea fail. The knee jerk reaction is mark IPSEC's sockets critical, and mark the tunneling allocations critical, and... and... well you have GFP_ATOMIC then my friend. In short, these seperate page pool and critical socket ideas do not work and we need a different solution, I'm sorry folks spent so much time on them, but they are heavily flawed. maybe it should be approached from the other side; having a way to mark connections as low priority (say incoming http connections to your webserver) or as non-critical/expendable would give the normal GFP_ATOMIC ones a better chance in case of overload/DDOS etc. It's not going to solve the VM deadlock issue wrt iscsi/nfs; however it might be useful in the survive slashdot sense... - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
Also, all this stuff is just a band aid because linux OOM behavior is so fucked up. In our internal discussions, characterizing this as OOM came up a lot, and I don't think of it as that at all. OOM is exactly what the scheme is trying to avoid! The actual situation we have in mind is a swap device management system in a cluster where a remote system tells you (via socket communication to a user-land management app) that a swap device is going to fail over and it'd be a good idea not to do anything that requires paging out or swapping for a short period of time. The socket communication must work, but the system is not at all out of memory, and the important point is that it never will be if you limit allocations to those things that are required for the critical socket to work (and nothing/little else). Receiver side allocations are unavoidable, because you don't know if you can drop the packet or not until you look at it. Some infrastructure must work. But everything else can fail or succeed based on ordinary churn in ordinary memory pools, until the in_emergency condition has passed. The critical socket(s) simply have to be out of the zero-sum game for the rest of the allocations, because those are the (only) path to getting a working swap device again. If you're out of memory without a network mechanism to get you more, this doesn't do anything for you (and it isn't intended to). And if you mark any socket that isn't going to get you failed over or otherwise get you more swap, it isn't going to help you, either. It isn't a priority scheme for low-memory, it's a failover mechanism that relies on networking. There are exactly 2 priorities: critical (as in you might as well crash if these aren't satisfied) and everything else. Doing other, more general things that handle low memory, or OOM, or identified priorities are great, but the problem we're interested in solving here is really just about making socket communication work when the alternative is a completely dead system. I think these patches do that in a reasonable way. A better solution would be great, too, if there is one. :-) +-DLS - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
David S. Miller [EMAIL PROTECTED] wrote on 12/15/2005 12:58:05 AM: From: David Stevens [EMAIL PROTECTED] Date: Thu, 15 Dec 2005 00:44:52 -0800 In our internal discussions I really wish this hadn't been discussed internally before being implemented. Any such internal discussions are lost completely upon the community that ends up reviewing such a core and invasive patch such as this one. I think those were more informal and less extensive than the impression I gave you. I mean simply bouncing around incomplete ideas and discussing some of the potential issues before coming up with a prototype solution, which is intended to be the starting point for community discussions (and the KS discussions, too). OOM came up immediately (even when naming the problem), and it isn't how I ever saw it. The patches, of course, are intended to NOT be invasive, or any more than they need to be, and they are not the solution, but a solution. A completely different one that solves the problem is just as good to me. +-DLS - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
Mitchell Blank Jr wrote: James Courtier-Dutton wrote: When I had the conversation with Matt at KS, the problem we were trying to solve was Memory pressure with network attached swap space. s/swap space/writable filesystems/ You can hit these problems even if you have no swap. Too much of the memory becomes filled with dirty pages needing writeback -- then you lose your NFS server's ARP entry at the wrong moment. If you have a local disk to swap to the machine will recover after a little bit of grinding, otherwise it's all pretty much over. The big problem is that as long as there's network I/O coming in it's likely that pages you free (as the VM gets more and more desperate about dropping the few remaining non-dirty pages) will get used for sockets that AREN'T helping you recover RAM. You really need to be able to tell the whole network stack we're in really rough shape here; ignore all RX work unless it's going to help me get write ACKs back from my {NFS,iSCSI} server My understanding is that is what this patchset is trying to accomplish. -Mitch You are using the wrong hammer to crack your nut. You should instead approach your problem of why the ARP entry gets lost. For example, you could give as critical priority to your TCP session, but that still won't cure your ARP problem. I would suggest that the best way to cure your arp problem, is to increase the time between arp cache refreshes. James - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
You are using the wrong hammer to crack your nut. You should instead approach your problem of why the ARP entry gets lost. For example, you could give as critical priority to your TCP session, but that still won't cure your ARP problem. I would suggest that the best way to cure your arp problem, is to increase the time between arp cache refreshes. or turn it around entirely: all traffic is considered important unless... and have a bunch of non-critical sockets (like http requests) be marked non-critical. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Thu, 2005-15-12 at 12:47 +0100, Arjan van de Ven wrote: You are using the wrong hammer to crack your nut. You should instead approach your problem of why the ARP entry gets lost. For example, you could give as critical priority to your TCP session, but that still won't cure your ARP problem. I would suggest that the best way to cure your arp problem, is to increase the time between arp cache refreshes. or turn it around entirely: all traffic is considered important unless... and have a bunch of non-critical sockets (like http requests) be marked non-critical. The big hole punched by DaveM is that of dependencies: a http tcp connection is tied to ICMP or the IPSEC example given; so you need a lot more intelligence than just what your app is knowledgeable about at its level. You cant really do this shit at the socket level. You need to do it much earlier. At runtime, when lower memory thresholds gets crossed, you kick classification of what packets need to be dropped using something along the lines of statefull/connection tracking. When things get better you undo. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Thu, 2005-12-15 at 08:00 -0500, jamal wrote: On Thu, 2005-15-12 at 12:47 +0100, Arjan van de Ven wrote: You are using the wrong hammer to crack your nut. You should instead approach your problem of why the ARP entry gets lost. For example, you could give as critical priority to your TCP session, but that still won't cure your ARP problem. I would suggest that the best way to cure your arp problem, is to increase the time between arp cache refreshes. or turn it around entirely: all traffic is considered important unless... and have a bunch of non-critical sockets (like http requests) be marked non-critical. The big hole punched by DaveM is that of dependencies: a http tcp connection is tied to ICMP or the IPSEC example given; so you need a lot more intelligence than just what your app is knowledgeable about at its level. yeah well sort of. You're right of course, but that also doesn't mean you can't give hints from the other side. Like data for this socked is NOT critical important. It gets tricky if you only do it for OOM stuff; because then that one ACK packet could cause a LOT of memory to be freed, and as such can be important for the system even if the socket isn't. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Thu, 2005-12-15 at 00:21 -0800, David S. Miller wrote: From: Sridhar Samudrala [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 23:37:37 -0800 (PST) Instead, you seem to be suggesting in_emergency to be set dynamically when we are about to run out of ATOMIC memory. Is this right? Not when we run out, but rather when we reach some low water mark, the critical sockets would still use GFP_ATOMIC memory but only critical sockets would be allowed to do so. But even this has faults, consider the IPSEC scenerio I mentioned, and this applies to any kind of encapsulation actually, even simple tunneling examples can be concocted which make the critical socket idea fail. The knee jerk reaction is mark IPSEC's sockets critical, and mark the tunneling allocations critical, and... and... well you have GFP_ATOMIC then my friend. I would like to mention another reason why we need to have a new GFP_CRITICAL flag for an allocation request. When we are in emergency, even the GFP_KERNEL allocations for a critical socket should not sleep. This is because the swap device may have failed and we would like to communicate this event to a management server over the critical socket so that it can initiate the failover. We are not trying to solve swapping over network problem. It is much simpler. The critical sockets are to be used only to send/receive a few critical messages reliably during a short period of emergency. Thanks Sridhar - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
These set of patches provide a TCP/IP emergency communication mechanism that could be used to guarantee high priority communications over a critical socket to succeed even under very low memory conditions that last for a couple of minutes. It uses the critical page pool facility provided by Matt's patches that he posted recently on lkml. http://lkml.org/lkml/2005/12/14/34/index.html This mechanism provides a new socket option SO_CRITICAL that can be used to mark a socket as critical. A critical connection used for emergency communications has to be established and marked as critical before we enter the emergency condition. It uses the __GFP_CRITICAL flag introduced in the critical page pool patches to indicate an allocation request as critical and should be satisfied from the critical page pool if required. In the send path, this flag is passed with all allocation requests that are made for a critical socket. But in the receive path we do not know if a packet is critical or not until we receive it and find the socket that it is destined to. So we treat all the allocation requests in the receive path as critical. The critical page pool patches also introduces a global flag 'system_in_emergency' that is used to indicate an emergency situation(could be a low memory condition). When this flag is set any incoming packets that belong to non-critical sockets are dropped as soon as possible in the receive path. This is necessary to prevent incoming non-critical packets to consume memory from critical page pool. I would appreciate any feedback or comments on this approach. Thanks Sridhar - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
I would appreciate any feedback or comments on this approach. Maybe I'm missing something but wouldn't you need an own critical pool (or at least reservation) for each socket to be safe against deadlocks? Otherwise if a critical sockets needs e.g. 2 pages to finish something and 2 critical sockets are active they can each steal the last pages from each other and deadlock. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Wed, 2005-12-14 at 10:22 +0100, Andi Kleen wrote: I would appreciate any feedback or comments on this approach. Maybe I'm missing something but wouldn't you need an own critical pool (or at least reservation) for each socket to be safe against deadlocks? Otherwise if a critical sockets needs e.g. 2 pages to finish something and 2 critical sockets are active they can each steal the last pages from each other and deadlock. Here we are assuming that the pre-allocated critical page pool is big enough to satisfy the requirements of all the critical sockets. In the current critical page pool implementation, there is also a limitation that only order-0 allocations(single page) are supported. I think in the networking send/receive patch, the only place where multi-page allocs are requested is in the drivers if the MTU PAGESIZE. But i guess the drivers are getting updated to avoid order-0 allocations. Also during the emergency, we free the memory allocated for non-critical packets as quickly as possible so that it can be re-used for critical allocations. Thanks Sridhar - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
It has a lot more users that compete true, but likely the set of GFP_CRITICAL users would grow over time too and it would develop the same problem. No, because the critical set is determined by the user (by setting the socket flag). The receive side has some things marked as critical until we have processed enough to check the socket flag, but then they should be released. Those short-lived allocations and frees are more or less 0 net towards the pool. Certainly, it wouldn't work very well if every socket is marked as critical, but with an adequate pool for the workload, I expect it'll work as advertised (esp. since it'll usually be only one socket associated with swap management that'll be critical). +-DLS - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On 12/14/05, Sridhar Samudrala [EMAIL PROTECTED] wrote: These set of patches provide a TCP/IP emergency communication mechanism that could be used to guarantee high priority communications over a critical socket to succeed even under very low memory conditions that last for a couple of minutes. It uses the critical page pool facility provided by Matt's patches that he posted recently on lkml. http://lkml.org/lkml/2005/12/14/34/index.html This mechanism provides a new socket option SO_CRITICAL that can be used to mark a socket as critical. A critical connection used for emergency So now everyone writing commercial apps for Linux are going to set SO_CRITICAL on sockets in their apps so their apps can survive better under pressure than the competitors aps and clueless programmers all over are going to think cool, with this I can make my app more important than everyone elses, I'm going to use this. When everyone and his dog starts to set this, what's the point? communications has to be established and marked as critical before we enter the emergency condition. It uses the __GFP_CRITICAL flag introduced in the critical page pool patches to indicate an allocation request as critical and should be satisfied from the critical page pool if required. In the send path, this flag is passed with all allocation requests that are made for a critical socket. But in the receive path we do not know if a packet is critical or not until we receive it and find the socket that it is destined to. So we treat all the allocation requests in the receive path as critical. The critical page pool patches also introduces a global flag 'system_in_emergency' that is used to indicate an emergency situation(could be a low memory condition). When this flag is set any incoming packets that belong to non-critical sockets are dropped as soon as possible in the receive path. Hmm, so if I fire up an app that has SO_CRITICAL set on a socket and can then somehow put a lot of memory pressure on the machine I can cause traffic on other sockets to be dropped.. hmmm.. sounds like something to play with to create new and interresting DoS attacks... This is necessary to prevent incoming non-critical packets to consume memory from critical page pool. I would appreciate any feedback or comments on this approach. To be a little serious, it sounds like something that could be used to cause trouble and something that will lose its usefulness once enough people start using it (for valid or invalid reasons), so what's the point... -- Jesper Juhl [EMAIL PROTECTED] Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
Jesper Juhl wrote: To be a little serious, it sounds like something that could be used to cause trouble and something that will lose its usefulness once enough people start using it (for valid or invalid reasons), so what's the point... It could easily be a user-configurable option in an application. If DOS is a real concern, only let this work for root users... Ben -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
Jesper Juhl wrote: On 12/14/05, Sridhar Samudrala [EMAIL PROTECTED] wrote: These set of patches provide a TCP/IP emergency communication mechanism that could be used to guarantee high priority communications over a critical socket to succeed even under very low memory conditions that last for a couple of minutes. It uses the critical page pool facility provided by Matt's patches that he posted recently on lkml. http://lkml.org/lkml/2005/12/14/34/index.html This mechanism provides a new socket option SO_CRITICAL that can be used to mark a socket as critical. A critical connection used for emergency So now everyone writing commercial apps for Linux are going to set SO_CRITICAL on sockets in their apps so their apps can survive better under pressure than the competitors aps and clueless programmers all over are going to think cool, with this I can make my app more important than everyone elses, I'm going to use this. When everyone and his dog starts to set this, what's the point? I don't think the initial patches that Matt did were intended for what you are describing. When I had the conversation with Matt at KS, the problem we were trying to solve was Memory pressure with network attached swap space. I came up with the idea that I think Matt has implemented. Letting the OS choose which are critical TCP/IP sessions is fine. But letting an application choose is a recipe for disaster. James - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Wed, 2005-12-14 at 20:49 +, James Courtier-Dutton wrote: Jesper Juhl wrote: On 12/14/05, Sridhar Samudrala [EMAIL PROTECTED] wrote: These set of patches provide a TCP/IP emergency communication mechanism that could be used to guarantee high priority communications over a critical socket to succeed even under very low memory conditions that last for a couple of minutes. It uses the critical page pool facility provided by Matt's patches that he posted recently on lkml. http://lkml.org/lkml/2005/12/14/34/index.html This mechanism provides a new socket option SO_CRITICAL that can be used to mark a socket as critical. A critical connection used for emergency So now everyone writing commercial apps for Linux are going to set SO_CRITICAL on sockets in their apps so their apps can survive better under pressure than the competitors aps and clueless programmers all over are going to think cool, with this I can make my app more important than everyone elses, I'm going to use this. When everyone and his dog starts to set this, what's the point? I don't think the initial patches that Matt did were intended for what you are describing. When I had the conversation with Matt at KS, the problem we were trying to solve was Memory pressure with network attached swap space. I came up with the idea that I think Matt has implemented. Letting the OS choose which are critical TCP/IP sessions is fine. But letting an application choose is a recipe for disaster. We could easily add capable(CAP_NET_ADMIN) check to allow this option to be set only by privileged users. Thanks Sridhar - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
Sridhar Samudrala wrote: On Wed, 2005-12-14 at 20:49 +, James Courtier-Dutton wrote: Jesper Juhl wrote: On 12/14/05, Sridhar Samudrala [EMAIL PROTECTED] wrote: These set of patches provide a TCP/IP emergency communication mechanism that could be used to guarantee high priority communications over a critical socket to succeed even under very low memory conditions that last for a couple of minutes. It uses the critical page pool facility provided by Matt's patches that he posted recently on lkml. http://lkml.org/lkml/2005/12/14/34/index.html This mechanism provides a new socket option SO_CRITICAL that can be used to mark a socket as critical. A critical connection used for emergency So now everyone writing commercial apps for Linux are going to set SO_CRITICAL on sockets in their apps so their apps can survive better under pressure than the competitors aps and clueless programmers all over are going to think cool, with this I can make my app more important than everyone elses, I'm going to use this. When everyone and his dog starts to set this, what's the point? I don't think the initial patches that Matt did were intended for what you are describing. When I had the conversation with Matt at KS, the problem we were trying to solve was Memory pressure with network attached swap space. I came up with the idea that I think Matt has implemented. Letting the OS choose which are critical TCP/IP sessions is fine. But letting an application choose is a recipe for disaster. We could easily add capable(CAP_NET_ADMIN) check to allow this option to be set only by privileged users. Thanks Sridhar Sridhar, Have you actually thought about what would happen in a real world senario? There is no real world requirement for this sort of user land feature. In memory pressure mode, you don't care about user applications. In fact, under memory pressure no user applications are getting scheduled. All you care about is swapping out memory to achieve a net gain in free memory, so that the applications can then run ok again. James - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
James Courtier-Dutton wrote: Have you actually thought about what would happen in a real world senario? There is no real world requirement for this sort of user land feature. In memory pressure mode, you don't care about user applications. In fact, under memory pressure no user applications are getting scheduled. All you care about is swapping out memory to achieve a net gain in free memory, so that the applications can then run ok again. Low 'ATOMIC' memory is different from the memory that user space typically uses, so just because you can't allocate an SKB does not mean you are swapping out user-space apps. I have an app that can have 2000+ sockets open. I would definately like to make the management and other important sockets have priority over others in my app... Ben -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Wed, 2005-12-14 at 14:39 -0800, Ben Greear wrote: James Courtier-Dutton wrote: Have you actually thought about what would happen in a real world senario? There is no real world requirement for this sort of user land feature. In memory pressure mode, you don't care about user applications. In fact, under memory pressure no user applications are getting scheduled. All you care about is swapping out memory to achieve a net gain in free memory, so that the applications can then run ok again. Low 'ATOMIC' memory is different from the memory that user space typically uses, so just because you can't allocate an SKB does not mean you are swapping out user-space apps. I have an app that can have 2000+ sockets open. I would definately like to make the management and other important sockets have priority over others in my app... The scenario we are trying to address is also a management connection between the nodes of a cluster and a server that manages the swap devices accessible by all the nodes of the cluster. The critical connection is supposed to be used to exchange status notifications of the swap devices so that failover can happen and propagated to all the nodes as quickly as possible. The management apps will be pinned into memory so that they are not swapped out. As such the traffic that flows over the critical sockets is not high but should not stall even if we run into a memory constrained situation. That is the reason why we would like to have a pre-allocated critical page pool which could be used when we run out of ATOMIC memory. Thanks Sridhar - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Wed, Dec 14, 2005 at 09:55:45AM -0800, Sridhar Samudrala wrote: On Wed, 2005-12-14 at 10:22 +0100, Andi Kleen wrote: I would appreciate any feedback or comments on this approach. Maybe I'm missing something but wouldn't you need an own critical pool (or at least reservation) for each socket to be safe against deadlocks? Otherwise if a critical sockets needs e.g. 2 pages to finish something and 2 critical sockets are active they can each steal the last pages from each other and deadlock. Here we are assuming that the pre-allocated critical page pool is big enough to satisfy the requirements of all the critical sockets. Not a good assumption. A system can have between 1-1000 iSCSI connections open and we certainly don't want to preallocate enough room for 1000 connections to make progress when we might only have one in use. I think we need a global receive pool and per-socket send pools. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Wed, Dec 14, 2005 at 08:30:23PM -0800, David S. Miller wrote: From: Matt Mackall [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 19:39:37 -0800 I think we need a global receive pool and per-socket send pools. Mind telling everyone how you plan to make use of the global receive pool when the allocation happens in the device driver and we have no idea which socket the packet is destined for? What should be done for non-local packets being routed? The device drivers allocate packets for the entire system, long before we know who the eventually received packets are for. It is fully anonymous memory, and it's easy to design cases where the whole pool can be eaten up by non-local forwarded packets. There needs to be two rules: iff global memory critical flag is set - allocate from the global critical receive pool on receive - return packet to global pool if not destined for a socket with an attached send mempool I think this will provide the desired behavior, though only probabilistically. That is, we can fill the global receive pool with uninteresting packets such that we're forced to drop critical ACKs, but the boring packets will eventually be discarded as we walk up the stack and we'll eventually have room to receive retried ACKs. I truly dislike these patches being discussed because they are a complete hack, and admittedly don't even solve the problem fully. I don't have any concrete better ideas but that doesn't mean this stuff should go into the tree. Agreed. I'm fairly convinced a full fix is doable, if you make a couple assumptions (limited fragmentation), but will unavoidably be less than pretty as it needs to cross some layers. I think GFP_ATOMIC memory pools are more powerful than they are given credit for. There is nothing preventing the implementation of dynamic GFP_ATOMIC watermarks, and having critical socket behavior kick in in response to hitting those water marks. There are two problems with GFP_ATOMIC. The first is that its users don't pre-state their worst-case usage, which means sizing the pool to reliably avoid deadlocks is impossible. The second is that there aren't any guarantees that GFP_ATOMIC allocations are actually critical in the needed-to-make-forward-VM-progress sense or will be returned to the pool in a timely fashion. So I do think we need a distinct pool if we want to tackle this problem. Though it's probably worth mentioning that Linus was rather adamantly against even trying at KS. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
From: Matt Mackall [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 21:02:50 -0800 There needs to be two rules: iff global memory critical flag is set - allocate from the global critical receive pool on receive - return packet to global pool if not destined for a socket with an attached send mempool This shuts off a router and/or firewall just because iSCSI or NFS peed in it's pants. Not really acceptable. I think this will provide the desired behavior It's not desirable. What if iSCSI is protected by IPSEC, and the key management daemon has to process a security assosciation expiration and negotiate a new one in order for iSCSI to further communicate with it's peer when this memory shortage occurs? It needs to send packets back and forth with the remove key management daemon in order to do this, but since you cut it off with this critical receive pool, the negotiation will never succeed. This stuff won't work. It's not a generic solution and that's why it has more holes than swiss cheese. :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Wed, Dec 14, 2005 at 08:30:23PM -0800, David S. Miller wrote: From: Matt Mackall [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 19:39:37 -0800 I think we need a global receive pool and per-socket send pools. Mind telling everyone how you plan to make use of the global receive pool when the allocation happens in the device driver and we have no idea which socket the packet is destined for? What should be done for In theory one could use multiple receive queue on intelligent enough NIC with the NIC distingushing the sockets. But that would be still a nasty you need advanced hardware FOO to avoid subtle problem Y case. Also it would require lots of driver hacking. And most NICs seem to have limits on the size of the socket tables for this, which means you would end up in a only N sockets supported safely situation, with N likely being quite small on common hardware. I think the idea of the original poster was that just freeing non critical packets after a short time again would be good enough, but I'm a bit sceptical on that. I truly dislike these patches being discussed because they are a complete hack, and admittedly don't even solve the problem fully. I I agree. I think GFP_ATOMIC memory pools are more powerful than they are given credit for. There is nothing preventing the implementation of dynamic Their main problem is that they are used too widely and in a lot of situations that aren't really critical. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
David S. Miller wrote: From: Matt Mackall [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 21:02:50 -0800 There needs to be two rules: iff global memory critical flag is set - allocate from the global critical receive pool on receive - return packet to global pool if not destined for a socket with an attached send mempool This shuts off a router and/or firewall just because iSCSI or NFS peed in it's pants. Not really acceptable. But that should only happen (shut off a router and/or firewall) in cases where we now completely deadlock and never recover, including shutting off the router and firewall, because they don't have enough memory to recv packets either. I think this will provide the desired behavior It's not desirable. What if iSCSI is protected by IPSEC, and the key management daemon has to process a security assosciation expiration and negotiate a new one in order for iSCSI to further communicate with it's peer when this memory shortage occurs? It needs to send packets back and forth with the remove key management daemon in order to do this, but since you cut it off with this critical receive pool, the negotiation will never succeed. I guess IPSEC would be a critical socket too, in that case. Sure there is nothing we can do if the daemon insists on allocating lots of memory... This stuff won't work. It's not a generic solution and that's why it has more holes than swiss cheese. :-) True it will have holes. I think something that is complementary and would be desirable is to simply limit the amount of in-flight writeout that things like NFS allows (or used to allow, haven't checked for a while and there were noises about it getting better). -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Wed, 14 Dec 2005 21:23:09 -0800 (PST) David S. Miller [EMAIL PROTECTED] wrote: From: Matt Mackall [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 21:02:50 -0800 There needs to be two rules: iff global memory critical flag is set - allocate from the global critical receive pool on receive - return packet to global pool if not destined for a socket with an attached send mempool This shuts off a router and/or firewall just because iSCSI or NFS peed in it's pants. Not really acceptable. I think this will provide the desired behavior It's not desirable. What if iSCSI is protected by IPSEC, and the key management daemon has to process a security assosciation expiration and negotiate a new one in order for iSCSI to further communicate with it's peer when this memory shortage occurs? It needs to send packets back and forth with the remove key management daemon in order to do this, but since you cut it off with this critical receive pool, the negotiation will never succeed. This stuff won't work. It's not a generic solution and that's why it has more holes than swiss cheese. :-) Also, all this stuff is just a band aid because linux OOM behavior is so fucked up. The VM system just lets the user dig themselves into a huge over commit, then we get into trying to change every other system to compensate. How about cutting things off earlier, and not falling off the cliff? How about pushing out pages to swap earlier when memory pressure starts to get noticed. Then you can free those non-dirty pages to make progress. Too many of the VM decisions seem to be made in favor of keep-it-in-memory benchmark situations. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Thu, 15 Dec 2005 06:42:45 +0100 Andi Kleen [EMAIL PROTECTED] wrote: On Wed, Dec 14, 2005 at 08:30:23PM -0800, David S. Miller wrote: From: Matt Mackall [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 19:39:37 -0800 I think we need a global receive pool and per-socket send pools. Mind telling everyone how you plan to make use of the global receive pool when the allocation happens in the device driver and we have no idea which socket the packet is destined for? What should be done for In theory one could use multiple receive queue on intelligent enough NIC with the NIC distingushing the sockets. But that would be still a nasty you need advanced hardware FOO to avoid subtle problem Y case. Also it would require lots of driver hacking. And most NICs seem to have limits on the size of the socket tables for this, which means you would end up in a only N sockets supported safely situation, with N likely being quite small on common hardware. I think the idea of the original poster was that just freeing non critical packets after a short time again would be good enough, but I'm a bit sceptical on that. I truly dislike these patches being discussed because they are a complete hack, and admittedly don't even solve the problem fully. I I agree. I think GFP_ATOMIC memory pools are more powerful than they are given credit for. There is nothing preventing the implementation of dynamic Their main problem is that they are used too widely and in a lot of situations that aren't really critical. Most of the use of GFP_ATOMIC is by stuff that could fail but can't sleep waiting for memory. How about adding a GFP_NORMAL for allocations while holding a lock. #define GFP_NORMAL (__GFP_NOMEMALLOC) Then get people to change the unneeded GFP_ATOMIC's to GFP_NORMAL in places where the error paths are reasonable. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Wed, 14 Dec 2005, David S. Miller wrote: From: Matt Mackall [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 19:39:37 -0800 I think we need a global receive pool and per-socket send pools. Mind telling everyone how you plan to make use of the global receive pool when the allocation happens in the device driver and we have no idea which socket the packet is destined for? What should be done for non-local packets being routed? The device drivers allocate packets for the entire system, long before we know who the eventually received packets are for. It is fully anonymous memory, and it's easy to design cases where the whole pool can be eaten up by non-local forwarded packets. I truly dislike these patches being discussed because they are a complete hack, and admittedly don't even solve the problem fully. I don't have any concrete better ideas but that doesn't mean this stuff should go into the tree. I think GFP_ATOMIC memory pools are more powerful than they are given credit for. There is nothing preventing the implementation of dynamic GFP_ATOMIC watermarks, and having critical socket behavior kick in in response to hitting those water marks. Does this mean that you are OK with having a mechanism to mark the sockets as critical and dropping the non critical packets under emergency, but you do not like having a separate critical page pool. Instead, you seem to be suggesting in_emergency to be set dynamically when we are about to run out of ATOMIC memory. Is this right? Thanks Sridhar - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html