Re: VIMAGE UDP memory leak fix
Hi, On 22/11/14 18:09, Robert N. M. Watson wrote: On 21 Nov 2014, at 17:40, Adrian Chadd adr...@freebsd.org wrote: Skimming through a bunch of hosts with moderately loaded hosts with reasonably high uptime I couldn't find one where net.inet.tcp.timer_race was not zero. A ny suggestions how to best reproduce the race(s) in tcp_timer.c? They would likely occur only on very highly loaded hosts, as they require race conditions to arise between TCP timers and TCP close. I think I did manage to reproduce it at one stage, and left the counter in to see if we could spot it in production, and I have had (multiple) reports of it in deployed systems. I'm not sure it's worth trying to reproduce them, given that knowledge -- we should simply fix them. Wasn't this just fixed by Julien @ Verisign? I don't believe so, although it's the kind of thing Julien is very good at fixing! The issue here is that we can't call callout_drain() from contexts where we finalise TCP connection close and attempt to free the inpcb. The 'easy' fix is to create a taskqueue thread to do the callout_drain() in the event that we discover that callout_stop() isn't able to guarantee that pending callouts are neither in execution nor scheduled. We'd then defer the very tail of TCP teardown to that asynchronous context rather than trying to do it to completion in the current (and rather more sensitive) one. This would happen only very in frequently so have little overhead in practice, although one would want to carefully look at the sync behaviour to make sure it wasn't frequently enough that a backlog might build up. Ironically, I was not aware of this discussion before we: - (re)discovered this TCP timers callout race condition - followed Robert's XXXRW comments and advices in source code - proposed a patch for fixing it: Fix TCP timers use-after-free old race conditions https://reviews.freebsd.org/D2079 If the proposed patch follows most of Robert's advices, instead of using a taskqueue thread calling callout_drain(), we re-use directly the callout API to schedule tcpcb's uma_zfree() at appropriate time. Register to the review if your are interested in this fix proposition. But still, nice irony to find this thread _after_ proposing a fix... :) -- Julien signature.asc Description: OpenPGP digital signature
Re: VIMAGE UDP memory leak fix
On 21 Nov 2014, at 17:40, Adrian Chadd adr...@freebsd.org wrote: Skimming through a bunch of hosts with moderately loaded hosts with reasonably high uptime I couldn't find one where net.inet.tcp.timer_race was not zero. A ny suggestions how to best reproduce the race(s) in tcp_timer.c? They would likely occur only on very highly loaded hosts, as they require race conditions to arise between TCP timers and TCP close. I think I did manage to reproduce it at one stage, and left the counter in to see if we could spot it in production, and I have had (multiple) reports of it in deployed systems. I'm not sure it's worth trying to reproduce them, given that knowledge -- we should simply fix them. Wasn't this just fixed by Julien @ Verisign? I don't believe so, although it's the kind of thing Julien is very good at fixing! The issue here is that we can't call callout_drain() from contexts where we finalise TCP connection close and attempt to free the inpcb. The 'easy' fix is to create a taskqueue thread to do the callout_drain() in the event that we discover that callout_stop() isn't able to guarantee that pending callouts are neither in execution nor scheduled. We'd then defer the very tail of TCP teardown to that asynchronous context rather than trying to do it to completion in the current (and rather more sensitive) one. This would happen only very in frequently so have little overhead in practice, although one would want to carefully look at the sync behaviour to make sure it wasn't frequently enough that a backlog might build up. As for the vimage stability side of things - I'd really like to see some VIMAGE torture tests written. Stuff like do a high rate TCP connection test whilst creating and destroying VIMAGEs. ... and even for non-VIMAGE. :-) Robert ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: VIMAGE UDP memory leak fix
On 20 Nov 2014, at 23:29, Marko Zec z...@fer.hr wrote: Can folks take a look at this? https://reviews.freebsd.org/D1201 All UMA zones used in the network stack have been marked as UMA_ZONE_NOFREE for ages, probably for a reason, so perhaps it might not hurt to provide more insight why and how it suddenly became safe to remove that flag? Historically, this was (if I recall) a property of the way data was exported for netstat, which depended on memory stability of various data types. We have worked quite hard to remove the causes of those sorts of dependencies by introducing stronger reference and memory ownership models throughout the stack -- in the case of inpcbs, for example, we introduced a true reference model during the multiprocessing scalability work (which, it should be pointed out, was enormously painful and took years to shake the bugs out of due to complexity/subtlety). It may be that it is now safe to remove UMA_ZONE_NOFREE for some of the types where it was previously required -- but it's definitely something you want to do intentionally and in the context of a careful analysis to convince yourself that all the causes have been addressed. A fair amount of stress testing in low-memory conditions wouldn't hurt either... Robert ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: VIMAGE UDP memory leak fix
On Fri, 21 Nov 2014 08:25:48 + Robert N. M. Watson rwat...@freebsd.org wrote: On 20 Nov 2014, at 23:29, Marko Zec z...@fer.hr wrote: Can folks take a look at this? https://reviews.freebsd.org/D1201 All UMA zones used in the network stack have been marked as UMA_ZONE_NOFREE for ages, probably for a reason, so perhaps it might not hurt to provide more insight why and how it suddenly became safe to remove that flag? Historically, this was (if I recall) a property of the way data was exported for netstat, which depended on memory stability of various data types. We have worked quite hard to remove the causes of those sorts of dependencies by introducing stronger reference and memory ownership models throughout the stack -- in the case of inpcbs, for example, we introduced a true reference model during the multiprocessing scalability work (which, it should be pointed out, was enormously painful and took years to shake the bugs out of due to complexity/subtlety). It may be that it is now safe to remove UMA_ZONE_NOFREE for some of the types where it was previously required -- but it's definitely something you want to do intentionally and in the context of a careful analysis to convince yourself that all the causes have been addressed. A fair amount of stress testing in low-memory conditions wouldn't hurt either... If data stability for userland export was the only factor mandating UMA_ZONE_NOFREE flagging then indeed it may be safe to remove that flag, since the VNET / prison referencing model guarantees that no VNET teardown can commence as long as there are processes, sockets or ifnets attached to a particular VNET. But as you said that change would affect both VIMAGE and non-VIMAGE builds, so extensive testing would be warranted here. Nevertheless, I'd prefer most of network stack UMA zones to be de-virtualized, at least those which cannot cause interference between VNETs, and that excludes syncache, reassembly, hostcache and the likes. De-virtualization doesn't require touching the UMA_ZONE_NOFREE flag, so doesn't affect non-VIMAGE builds. Any objections to that approach? Marko ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: VIMAGE UDP memory leak fix
On 21 Nov 2014, at 08:58, Marko Zec z...@fer.hr wrote: Nevertheless, I'd prefer most of network stack UMA zones to be de-virtualized, at least those which cannot cause interference between VNETs, and that excludes syncache, reassembly, hostcache and the likes. De-virtualization doesn't require touching the UMA_ZONE_NOFREE flag, so doesn't affect non-VIMAGE builds. Any objections to that approach? To my mind, the only real concern is whether or not you lose access to resource allocation limits that would previously have been present. On the whole, we've tried to centralise resource limitations on kernel memory allocation in UMA, and it would be great if we could find a nice approach to implementing both per-vimage and per-system allocation limits. One thing I'd pondered in the past was whether we could move to a single zone, with its own limits/etc, but also the ability to pass an optional uma_resourcepool_t that allowed additional limits to be imposed based on some other criteria -- e.g., vimage membership. That strikes me as a somewhat complex proposal that would bring new performance/synchronisation concerns, so isn't necessarily something to act on. However, the upshot is that, although I do not oppose combining the zones, we should be aware that we're eliminating a form of resource partitioning between vimages that we may want to find some other solution for (ideally, an elegant one). Robert ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: VIMAGE UDP memory leak fix
On Fri, 21 Nov 2014 09:07:28 + Robert N. M. Watson rwat...@freebsd.org wrote: On 21 Nov 2014, at 09:05, Robert N. M. Watson rwat...@freebsd.org wrote: To my mind, the only real concern is whether or not you lose access to resource allocation limits that would previously have been present. On the whole, we've tried to centralise resource limitations on kernel memory allocation in UMA, and it would be great if we could find a nice approach to implementing both per-vimage and per-system allocation limits. One thing I'd pondered in the past was whether we could move to a single zone, with its own limits/etc, but also the ability to pass an optional uma_resourcepool_t that allowed additional limits to be imposed based on some other criteria -- e.g., vimage membership. That strikes me as a somewhat complex proposal that would bring new performance/synchronisation concerns, so isn't necessarily something to act on. However, the upshot is that, although I do not oppose combining the zones, we should be aware that we're eliminating a form of resource partitioning between vimages that we may want to find some other solution for (ideally, an elegant one). And, to respond to your more general comment: I agree that a decision about removing the NOFREE flag should be made independently of choices about devirtualisation. The former probably should be sorted out at this point, as eliminating NOFREE zones has more general benefits to the kernel, but it would be nice not to depend on that to resolve other problems. It could be that a few of us scratching our heads for a few hours each can resolve whether NOFREE can now be safely removed, in which case we should do so (and then allow quite a lot of baking time before it ships in a release). The important thing here is not to loose the momentum and energy which Craig is putting in cleaning up VIMAGE, so if we take the route of eliminating the UMA_ZONE_NOFREE flag (or not), that should be decided with rough consensus and without too much delays, and without too much bikeshed re. what could be nice to have in some distant future and whatnot. The current goal is cleaning up VIMAGE and making it palatable for 11.0, without taking too much risk at breaking other things. I'm strongly opposed to even thinking aloud about any new VNET-related features or concepts until VIMAGE finally becomes enabled in /head. Marko ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: VIMAGE UDP memory leak fix
On 21 Nov 2014, at 10:19, Robert N. M. Watson rwat...@freebsd.org wrote: The important thing here is not to loose the momentum and energy which Craig is putting in cleaning up VIMAGE, so if we take the route of eliminating the UMA_ZONE_NOFREE flag (or not), that should be decided with rough consensus and without too much delays, and without too much bikeshed re. what could be nice to have in some distant future and whatnot. The current goal is cleaning up VIMAGE and making it palatable for 11.0, without taking too much risk at breaking other things. I'm strongly opposed to even thinking aloud about any new VNET-related features or concepts until VIMAGE finally becomes enabled in /head. In the long list of things that are hard in operating systems, reasoning about data-structure reference counting and memory models in the presence of concurrency is pretty high on the list. I think you would not want to rush that sort of thing, and hence my feeling that you want to disentangle these two concerns. Throwing the switch on the UMA_ZONE_NOFREE flag setting is easy, but debugging race conditions involving use-after-free in the field is incredibly hard. If you can avoid depending on this, especially if we throw the switch and then discover in a few weeks or months that it wasn't as simple as we thought, then you put other more concrete goals substantially less at risk. We can go away and think hard about UMA_ZONE_NOFREE for a few weeks, and follow up, but I recommend you find another solution in the mean time if you are worried about stalling VIMAGE cleanups. (Alternatively, as I suggested in an earlier e-mail, you can convince us that you've done that analysis already by citing the revision history where all the pertinent problems were resolved, along with an informal argument that no other problems remain, which would mean much less waiting for us to try to make that argument for you. However, making the NOFREE change without going through that process is a recipe for disaster that we'd all rather avoid!) Robert ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: VIMAGE UDP memory leak fix
On 21 Nov 2014, at 08:25 , Robert N. M. Watson rwat...@freebsd.org wrote: On 20 Nov 2014, at 23:29, Marko Zec z...@fer.hr wrote: Can folks take a look at this? https://reviews.freebsd.org/D1201 All UMA zones used in the network stack have been marked as UMA_ZONE_NOFREE for ages, probably for a reason, so perhaps it might not hurt to provide more insight why and how it suddenly became safe to remove that flag? Historically, this was (if I recall) a property of the way data was exported for netstat, which depended on memory stability of various data types. We have worked quite hard to remove the causes of those sorts of dependencies by introducing stronger reference and memory ownership models throughout the stack -- in the case of inpcbs, for example, we introduced a true reference model during the multiprocessing scalability work (which, it should be pointed out, was enormously painful and took years to shake the bugs out of due to complexity/subtlety). It may be that it is now safe to remove UMA_ZONE_NOFREE for some of the types where it was previously required -- but it's definitely something you want to do intentionally and in the context of a careful analysis to convince yourself that all the causes have been addressed. A fair amount of stress testing in low-memory conditions wouldn't hurt either... I had convinced myself for UDP many years ago that it was ok to remove it. People have touched the code however so it’s definitively worth re-checking again. For TCP we clearly cannot do it (yet, and couldn’t back then). But TCP was the only of the few cases I had left unfixed back then. Can’t remember what the others were (was like 3 or 4 of them; could also be some of the fixes were indeed committed back then. — Bjoern A. Zeeb Come on. Learn, goddamn it., WarGames, 1983 ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: VIMAGE UDP memory leak fix
On 21 Nov 2014, at 11:02, Marko Zec z...@fer.hr wrote: Now that we've found ourselves in this discussion, I'm really becoming curious why exactly do we need UMA_ZONE_NOFREE for network stack zones at all? Admittedly, I always thought that the primary purpose of UMA_ZONE_NOFREE was to prevent uma_reclaim() from paging out _used_ zone pages, but reviewing the uma code reveals that this might not be the case, i.e. that NOFREE only prevents _unused_ pages to be freed by uma_reclaim(). Moreover, all uma_zalloc() calls as far as I can see are flagged as M_NOWAIT and are followed by checks for allocation failures, so that part seems to be covered. So, what's really the problem which UMA_ZONE_NOFREE flagging is supposed to solve these days? (you claim that we clearly need it for TCP - why)? UMA_ZONE_NOFREE tells UMA that it can't reclaim unused slabs for the zone to be returned to the VM system for reuse elsewhere under memory pressure. UMA memory isn't pageable, so there's no link to paging policy: although soft-TLB systems might experience TLB miss exceptions on UMA-allocated kernel memory, you should never experience a page fault against it (in absence of a bug). Reclaim of unused slabs can happen, for example, if VM discovers it is low on free pages, in which case it notifies various kernel subsystems that it is feeling a bit cramped -- that same mechanism that, for example, triggers TCP to throw away reassembly buffers that haven't yet been ACK'd (although might have been SACK'd). You might expect this to happen in situations where first a large load spike happens for a particular UMA type (e.g., a DDoS opens lots of TCP connections), and then they are freed, leading to lots of socket/incpb slabs lying around unused, which eventually VM will ask be returned. It is highly desirable for UMA_ZONE_NOFREE to be removed from zones wherever possible so that memory can be returned under such circumstances, and it is not a good feature that the flag is present anywhere. Subsystems pick up a dependence on UMA_ZONE_NOFREE if freed objects might be referenced after free. My understanding is that this is pretty old inherited behaviour from prior kernel memory allocators that didn't know how to return memory to VM. Given that property, it was safe to write code that might, for the purposes of efficiency, assume that it could walk data structures of the type with fewer synchronisation overheads -- or where synchronisation isn't possible (e.g., for direct access to kernel memory via /dev/kmem). We have been attempting to expunge those assumptions wherever possible -- these days, netstat uses sysctl()s that acquire references to all live inpcbs keeping them valid while they are copied out (you can't hold low-level locks during copyout() as sysctl might encounter a paging event writing to user memory). Convincing yourself that all such assumptions have been removed is a moderate amount of work, and if you get it wrong, you get use-after-free races that occur only in low-memory conditions, which are quite hard to figure out (read: almost impossible). Bjoern can say more about what motivated his specific comment -- I had hoped that we'd quietly lost dependence on NOFREE over the last decade and could finally garbage collect it, but perhaps not! Robert ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: VIMAGE UDP memory leak fix
On 21 Nov 2014, at 11:02, Marko Zec z...@fer.hr wrote: I had convinced myself for UDP many years ago that it was ok to remove it. People have touched the code however so it’s definitively worth re-checking again. For TCP we clearly cannot do it (yet, and couldn’t back then). But TCP was the only of the few cases I had left unfixed back then. Can’t remember what the others were (was like 3 or 4 of them; could also be some of the fixes were indeed committed back then. Now that we've found ourselves in this discussion, I'm really becoming curious why exactly do we need UMA_ZONE_NOFREE for network stack zones at all? Admittedly, I always thought that the primary purpose of UMA_ZONE_NOFREE was to prevent uma_reclaim() from paging out _used_ zone pages, but reviewing the uma code reveals that this might not be the case, i.e. that NOFREE only prevents _unused_ pages to be freed by uma_reclaim(). Moreover, all uma_zalloc() calls as far as I can see are flagged as M_NOWAIT and are followed by checks for allocation failures, so that part seems to be covered. So, what's really the problem which UMA_ZONE_NOFREE flagging is supposed to solve these days? (you claim that we clearly need it for TCP - why)? Bjoern and I chatted for the last twenty or so minutes about the code, and believe that as things stand, it is *not* safe to turn off UMA_ZONE_NOFREE for TCP due to a teardown race in TCP that has been known about and discussed for several years, but is some work to resolve and that we've not yet found time to do so. The XXXRW's in tcp_timer.c are related to this. We're pondering ways to fix it but think this is not something that can be rushed. Robert ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
VIMAGE UDP memory leak fix
Hi, Can folks take a look at this? https://reviews.freebsd.org/D1201 -- Craig ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: VIMAGE UDP memory leak fix
On Thu, 20 Nov 2014 10:02:46 -0800 Craig Rodrigues rodr...@freebsd.org wrote: Hi, Can folks take a look at this? https://reviews.freebsd.org/D1201 All UMA zones used in the network stack have been marked as UMA_ZONE_NOFREE for ages, probably for a reason, so perhaps it might not hurt to provide more insight why and how it suddenly became safe to remove that flag? One possible alternative is to de-virtualize V_udbinfo, V_udpcb_zone etc. which I have suggested a few times in the past but those proposals have been rejected based on expectations that one day our network stack may benefit from more parallelism of decoupled UMA zones for each VNET, though I'm not aware of further developments in that direction. The primary (and in many cases only) reason I have virtualized network stack zones was to simplify tracking of inter-VNET leaks. As bugs that caused such leaks seem to have been cleaned up some 7 years ago or so, I see no technical reason to maintain separate UMA zones for each VNET, especially not those which are size-limited to maxsockets global limit anyhow. Marko ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org