Re: VIMAGE UDP memory leak fix

2015-03-25 Thread Julien Charbon

 Hi,

On 22/11/14 18:09, Robert N. M. Watson wrote:
 On 21 Nov 2014, at 17:40, Adrian Chadd adr...@freebsd.org wrote:
 Skimming through a bunch of hosts with moderately loaded hosts
 with reasonably high uptime I couldn't find one where
 net.inet.tcp.timer_race was not zero. A ny suggestions how to
 best reproduce the race(s) in tcp_timer.c?
 
 They would likely occur only on very highly loaded hosts, as they
 require race conditions to arise between TCP timers and TCP
 close. I think I did manage to reproduce it at one stage, and
 left the counter in to see if we could spot it in production, and
 I have had (multiple) reports of it in deployed systems. I'm not
 sure it's worth trying to reproduce them, given that knowledge --
 we should simply fix them.
 
 Wasn't this just fixed by Julien @ Verisign?
 
 I don't believe so, although it's the kind of thing Julien is very
 good at fixing!
 
 The issue here is that we can't call callout_drain() from contexts
 where we finalise TCP connection close and attempt to free the inpcb.
 The 'easy' fix is to create a taskqueue thread to do the
 callout_drain() in the event that we discover that callout_stop()
 isn't able to guarantee that pending callouts are neither in
 execution nor scheduled. We'd then defer the very tail of TCP
 teardown to that asynchronous context rather than trying to do it to
 completion in the current (and rather more sensitive) one. This would
 happen only very in frequently so have little overhead in practice,
 although one would want to carefully look at the sync behaviour to
 make sure it wasn't frequently enough that a backlog might build up.

 Ironically, I was not aware of this discussion before we:

- (re)discovered this TCP timers callout race condition
- followed Robert's XXXRW comments and advices in source code
- proposed a patch for fixing it:

Fix TCP timers use-after-free old race conditions
https://reviews.freebsd.org/D2079

 If the proposed patch follows most of Robert's advices, instead of
using a taskqueue thread calling callout_drain(), we re-use directly the
callout API to schedule tcpcb's uma_zfree() at appropriate time.
Register to the review if your are interested in this fix proposition.

 But still, nice irony to find this thread _after_ proposing a fix... :)

--
Julien



signature.asc
Description: OpenPGP digital signature


Re: VIMAGE UDP memory leak fix

2014-11-22 Thread Robert N. M. Watson

On 21 Nov 2014, at 17:40, Adrian Chadd adr...@freebsd.org wrote:

 Skimming through a bunch of hosts with moderately loaded hosts with
 reasonably high uptime I couldn't find one where net.inet.tcp.timer_race
 was not zero. A ny suggestions how to best reproduce the race(s) in
 tcp_timer.c?
 
 They would likely occur only on very highly loaded hosts, as they require 
 race conditions to arise between TCP timers and TCP close. I think I did 
 manage to reproduce it at one stage, and left the counter in to see if we 
 could spot it in production, and I have had (multiple) reports of it in 
 deployed systems. I'm not sure it's worth trying to reproduce them, given 
 that knowledge -- we should simply fix them.
 
 Wasn't this just fixed by Julien @ Verisign?

I don't believe so, although it's the kind of thing Julien is very good at 
fixing!

The issue here is that we can't call callout_drain() from contexts where we 
finalise TCP connection close and attempt to free the inpcb. The 'easy' fix is 
to create a taskqueue thread to do the callout_drain() in the event that we 
discover that callout_stop() isn't able to guarantee that pending callouts are 
neither in execution nor scheduled. We'd then defer the very tail of TCP 
teardown to that asynchronous context rather than trying to do it to completion 
in the current (and rather more sensitive) one. This would happen only very in 
frequently so have little overhead in practice, although one would want to 
carefully look at the sync behaviour to make sure it wasn't frequently enough 
that a backlog might build up.

 As for the vimage stability side of things - I'd really like to see
 some VIMAGE torture tests written. Stuff like do a high rate TCP
 connection test whilst creating and destroying VIMAGEs.

... and even for non-VIMAGE. :-)

Robert

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: VIMAGE UDP memory leak fix

2014-11-21 Thread Robert N. M. Watson

On 20 Nov 2014, at 23:29, Marko Zec z...@fer.hr wrote:

 Can folks take a look at this?
 
 https://reviews.freebsd.org/D1201
 
 All UMA zones used in the network stack have been marked as
 UMA_ZONE_NOFREE for ages, probably for a reason, so perhaps it might
 not hurt to provide more insight why and how it suddenly became safe to
 remove that flag?

Historically, this was (if I recall) a property of the way data was exported 
for netstat, which depended on memory stability of various data types. We have 
worked quite hard to remove the causes of those sorts of dependencies by 
introducing stronger reference and memory ownership models throughout the stack 
-- in the case of inpcbs, for example, we introduced a true reference model 
during the multiprocessing scalability work (which, it should be pointed out, 
was enormously painful and took years to shake the bugs out of due to 
complexity/subtlety). It may be that it is now safe to remove UMA_ZONE_NOFREE 
for some of the types where it was previously required -- but it's definitely 
something you want to do intentionally and in the context of a careful analysis 
to convince yourself that all the causes have been addressed. A fair amount of 
stress testing in low-memory conditions wouldn't hurt either...

Robert
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: VIMAGE UDP memory leak fix

2014-11-21 Thread Marko Zec
On Fri, 21 Nov 2014 08:25:48 +
Robert N. M. Watson rwat...@freebsd.org wrote:

 
 On 20 Nov 2014, at 23:29, Marko Zec z...@fer.hr wrote:
 
  Can folks take a look at this?
  
  https://reviews.freebsd.org/D1201
  
  All UMA zones used in the network stack have been marked as
  UMA_ZONE_NOFREE for ages, probably for a reason, so perhaps it might
  not hurt to provide more insight why and how it suddenly became
  safe to remove that flag?
 
 Historically, this was (if I recall) a property of the way data was
 exported for netstat, which depended on memory stability of various
 data types. We have worked quite hard to remove the causes of those
 sorts of dependencies by introducing stronger reference and memory
 ownership models throughout the stack -- in the case of inpcbs, for
 example, we introduced a true reference model during the
 multiprocessing scalability work (which, it should be pointed out,
 was enormously painful and took years to shake the bugs out of due to
 complexity/subtlety). It may be that it is now safe to remove
 UMA_ZONE_NOFREE for some of the types where it was previously
 required -- but it's definitely something you want to do
 intentionally and in the context of a careful analysis to convince
 yourself that all the causes have been addressed. A fair amount of
 stress testing in low-memory conditions wouldn't hurt either...

If data stability for userland export was the only factor mandating
UMA_ZONE_NOFREE flagging then indeed it may be safe to remove that
flag, since the VNET / prison referencing model guarantees that no VNET
teardown can commence as long as there are processes, sockets or ifnets
attached to a particular VNET.  But as you said that change would
affect both VIMAGE and non-VIMAGE builds, so extensive testing would be
warranted here.

Nevertheless, I'd prefer most of network stack UMA zones to be
de-virtualized, at least those which cannot cause interference between
VNETs, and that excludes syncache, reassembly, hostcache and the likes.
De-virtualization doesn't require touching the UMA_ZONE_NOFREE flag, so
doesn't affect non-VIMAGE builds.  Any objections to that approach?

Marko
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: VIMAGE UDP memory leak fix

2014-11-21 Thread Robert N. M. Watson

On 21 Nov 2014, at 08:58, Marko Zec z...@fer.hr wrote:

 Nevertheless, I'd prefer most of network stack UMA zones to be
 de-virtualized, at least those which cannot cause interference between
 VNETs, and that excludes syncache, reassembly, hostcache and the likes.
 De-virtualization doesn't require touching the UMA_ZONE_NOFREE flag, so
 doesn't affect non-VIMAGE builds.  Any objections to that approach?

To my mind, the only real concern is whether or not you lose access to resource 
allocation limits that would previously have been present. On the whole, we've 
tried to centralise resource limitations on kernel memory allocation in UMA, 
and it would be great if we could find a nice approach to implementing both 
per-vimage and per-system allocation limits. One thing I'd pondered in the past 
was whether we could move to a single zone, with its own limits/etc, but also 
the ability to pass an optional uma_resourcepool_t that allowed additional 
limits to be imposed based on some other criteria -- e.g., vimage membership. 
That strikes me as a somewhat complex proposal that would bring new 
performance/synchronisation concerns, so isn't necessarily something to act on. 
However, the upshot is that, although I do not oppose combining the zones, we 
should be aware that we're eliminating a form of resource partitioning between 
vimages that we may want to find some other solution for (ideally, an 
 elegant one).

Robert
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: VIMAGE UDP memory leak fix

2014-11-21 Thread Marko Zec
On Fri, 21 Nov 2014 09:07:28 +
Robert N. M. Watson rwat...@freebsd.org wrote:

 
 On 21 Nov 2014, at 09:05, Robert N. M. Watson rwat...@freebsd.org
 wrote:
 
  To my mind, the only real concern is whether or not you lose access
  to resource allocation limits that would previously have been
  present. On the whole, we've tried to centralise resource
  limitations on kernel memory allocation in UMA, and it would be
  great if we could find a nice approach to implementing both
  per-vimage and per-system allocation limits. One thing I'd pondered
  in the past was whether we could move to a single zone, with its
  own limits/etc, but also the ability to pass an optional
  uma_resourcepool_t that allowed additional limits to be imposed
  based on some other criteria -- e.g., vimage membership. That
  strikes me as a somewhat complex proposal that would bring new
  performance/synchronisation concerns, so isn't necessarily
  something to act on. However, the upshot is that, although I do not
  oppose combining the zones, we should be aware that we're
  eliminating a form of resource partitioning between vimages that we
  may want to find some other solution for (ideally, an elegant one).
 
 And, to respond to your more general comment: I agree that a decision
 about removing the NOFREE flag should be made independently of
 choices about devirtualisation. The former probably should be sorted
 out at this point, as eliminating NOFREE zones has more general
 benefits to the kernel, but it would be nice not to depend on that to
 resolve other problems. It could be that a few of us scratching our
 heads for a few hours each can resolve whether NOFREE can now be
 safely removed, in which case we should do so (and then allow quite a
 lot of baking time before it ships in a release).

The important thing here is not to loose the momentum and energy which
Craig is putting in cleaning up VIMAGE, so if we take the route of
eliminating the UMA_ZONE_NOFREE flag (or not), that should be decided
with rough consensus and without too much delays, and without too much
bikeshed re. what could be nice to have in some distant future and
whatnot.  The current goal is cleaning up VIMAGE and making it
palatable for 11.0, without taking too much risk at breaking other
things.  I'm strongly opposed to even thinking aloud about any new
VNET-related features or concepts until VIMAGE finally becomes enabled
in /head.

Marko
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: VIMAGE UDP memory leak fix

2014-11-21 Thread Robert N. M. Watson

On 21 Nov 2014, at 10:19, Robert N. M. Watson rwat...@freebsd.org wrote:

 The important thing here is not to loose the momentum and energy which
 Craig is putting in cleaning up VIMAGE, so if we take the route of
 eliminating the UMA_ZONE_NOFREE flag (or not), that should be decided
 with rough consensus and without too much delays, and without too much
 bikeshed re. what could be nice to have in some distant future and
 whatnot.  The current goal is cleaning up VIMAGE and making it
 palatable for 11.0, without taking too much risk at breaking other
 things.  I'm strongly opposed to even thinking aloud about any new
 VNET-related features or concepts until VIMAGE finally becomes enabled
 in /head.
 
 In the long list of things that are hard in operating systems, reasoning 
 about data-structure reference counting and memory models in the presence of 
 concurrency is pretty high on the list. I think you would not want to rush 
 that sort of thing, and hence my feeling that you want to disentangle these 
 two concerns. Throwing the switch on the UMA_ZONE_NOFREE flag setting is 
 easy, but debugging race conditions involving use-after-free in the field is 
 incredibly hard. If you can avoid depending on this, especially if we throw 
 the switch and then discover in a few weeks or months that it wasn't as 
 simple as we thought, then you put other more concrete goals substantially 
 less at risk. We can go away and think hard about UMA_ZONE_NOFREE for a few 
 weeks, and follow up, but I recommend you find another solution in the mean 
 time if you are worried about stalling VIMAGE cleanups.

(Alternatively, as I suggested in an earlier e-mail, you can convince us that 
you've done that analysis already by citing the revision history where all the 
pertinent problems were resolved, along with an informal argument that no other 
problems remain, which would mean much less waiting for us to try to make that 
argument for you. However, making the NOFREE change without going through that 
process is a recipe for disaster that we'd all rather avoid!)

Robert
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: VIMAGE UDP memory leak fix

2014-11-21 Thread Bjoern A. Zeeb

On 21 Nov 2014, at 08:25 , Robert N. M. Watson rwat...@freebsd.org wrote:

 
 On 20 Nov 2014, at 23:29, Marko Zec z...@fer.hr wrote:
 
 Can folks take a look at this?
 
 https://reviews.freebsd.org/D1201
 
 All UMA zones used in the network stack have been marked as
 UMA_ZONE_NOFREE for ages, probably for a reason, so perhaps it might
 not hurt to provide more insight why and how it suddenly became safe to
 remove that flag?
 
 Historically, this was (if I recall) a property of the way data was exported 
 for netstat, which depended on memory stability of various data types. We 
 have worked quite hard to remove the causes of those sorts of dependencies by 
 introducing stronger reference and memory ownership models throughout the 
 stack -- in the case of inpcbs, for example, we introduced a true reference 
 model during the multiprocessing scalability work (which, it should be 
 pointed out, was enormously painful and took years to shake the bugs out of 
 due to complexity/subtlety). It may be that it is now safe to remove 
 UMA_ZONE_NOFREE for some of the types where it was previously required -- but 
 it's definitely something you want to do intentionally and in the context of 
 a careful analysis to convince yourself that all the causes have been 
 addressed. A fair amount of stress testing in low-memory conditions wouldn't 
 hurt either...

I had convinced myself for UDP many years ago that it was ok to remove it.  
People have touched the code however so it’s definitively worth re-checking 
again.

For TCP we clearly cannot do it (yet, and couldn’t back then).   But TCP was 
the only of the few cases I had left unfixed back then.  Can’t remember what 
the others were (was like 3 or 4 of them;  could also be some of the fixes were 
indeed committed back then.


— 
Bjoern A. Zeeb Come on. Learn, goddamn it., WarGames, 1983

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: VIMAGE UDP memory leak fix

2014-11-21 Thread Robert N. M. Watson

On 21 Nov 2014, at 11:02, Marko Zec z...@fer.hr wrote:

 Now that we've found ourselves in this discussion, I'm really
 becoming curious why exactly do we need UMA_ZONE_NOFREE for network
 stack zones at all?   Admittedly, I always thought that the primary
 purpose of UMA_ZONE_NOFREE was to prevent uma_reclaim() from paging out
 _used_ zone pages, but reviewing the uma code reveals that this might
 not be the case, i.e. that NOFREE only prevents _unused_ pages to be
 freed by uma_reclaim().
 
 Moreover, all uma_zalloc() calls as far as I can see are flagged as
 M_NOWAIT and are followed by checks for allocation failures, so that
 part seems to be covered.
 
 So, what's really the problem which UMA_ZONE_NOFREE flagging is supposed
 to solve these days? (you claim that we clearly need it for TCP - why)?

UMA_ZONE_NOFREE tells UMA that it can't reclaim unused slabs for the zone to be 
returned to the VM system for reuse elsewhere under memory pressure. UMA memory 
isn't pageable, so there's no link to paging policy: although soft-TLB systems 
might experience TLB miss exceptions on UMA-allocated kernel memory, you should 
never experience a page fault against it (in absence of a bug). Reclaim of 
unused slabs can happen, for example, if VM discovers it is low on free pages, 
in which case it notifies various kernel subsystems that it is feeling a bit 
cramped -- that same mechanism that, for example, triggers TCP to throw away 
reassembly buffers that haven't yet been ACK'd (although might have been 
SACK'd). You might expect this to happen in situations where first a large load 
spike happens for a particular UMA type (e.g., a DDoS opens lots of TCP 
connections), and then they are freed, leading to lots of socket/incpb slabs 
lying around unused, which eventually VM will ask be returned. It is 
 highly desirable for UMA_ZONE_NOFREE to be removed from zones wherever 
possible so that memory can be returned under such circumstances, and it is not 
a good feature that the flag is present anywhere.

Subsystems pick up a dependence on UMA_ZONE_NOFREE if freed objects might be 
referenced after free. My understanding is that this is pretty old inherited 
behaviour from prior kernel memory allocators that didn't know how to return 
memory to VM. Given that property, it was safe to write code that might, for 
the purposes of efficiency, assume that it could walk data structures of the 
type with fewer synchronisation overheads -- or where synchronisation isn't 
possible (e.g., for direct access to kernel memory via /dev/kmem). We have been 
attempting to expunge those assumptions wherever possible -- these days, 
netstat uses sysctl()s that acquire references to all live inpcbs keeping them 
valid while they are copied out (you can't hold low-level locks during 
copyout() as sysctl might encounter a paging event writing to user memory). 
Convincing yourself that all such assumptions have been removed is a moderate 
amount of work, and if you get it wrong, you get use-after-free races that occur
  only in low-memory conditions, which are quite hard to figure out (read: 
almost impossible).

Bjoern can say more about what motivated his specific comment -- I had hoped 
that we'd quietly lost dependence on NOFREE over the last decade and could 
finally garbage collect it, but perhaps not!

Robert
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: VIMAGE UDP memory leak fix

2014-11-21 Thread Robert N. M. Watson

On 21 Nov 2014, at 11:02, Marko Zec z...@fer.hr wrote:

 I had convinced myself for UDP many years ago that it was ok to
 remove it.  People have touched the code however so it’s definitively
 worth re-checking again.
 
 For TCP we clearly cannot do it (yet, and couldn’t back then).   But
 TCP was the only of the few cases I had left unfixed back then.
 Can’t remember what the others were (was like 3 or 4 of them;  could
 also be some of the fixes were indeed committed back then.
 
 Now that we've found ourselves in this discussion, I'm really
 becoming curious why exactly do we need UMA_ZONE_NOFREE for network
 stack zones at all?   Admittedly, I always thought that the primary
 purpose of UMA_ZONE_NOFREE was to prevent uma_reclaim() from paging out
 _used_ zone pages, but reviewing the uma code reveals that this might
 not be the case, i.e. that NOFREE only prevents _unused_ pages to be
 freed by uma_reclaim().
 
 Moreover, all uma_zalloc() calls as far as I can see are flagged as
 M_NOWAIT and are followed by checks for allocation failures, so that
 part seems to be covered.
 
 So, what's really the problem which UMA_ZONE_NOFREE flagging is supposed
 to solve these days? (you claim that we clearly need it for TCP - why)?

Bjoern and I chatted for the last twenty or so minutes about the code, and 
believe that as things stand, it is *not* safe to turn off UMA_ZONE_NOFREE for 
TCP due to a teardown race in TCP that has been known about and discussed for 
several years, but is some work to resolve and that we've not yet found time to 
do so. The XXXRW's in tcp_timer.c are related to this. We're pondering ways to 
fix it but think this is not something that can be rushed.

Robert
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


VIMAGE UDP memory leak fix

2014-11-20 Thread Craig Rodrigues
Hi,

Can folks take a look at this?

https://reviews.freebsd.org/D1201

--
Craig
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: VIMAGE UDP memory leak fix

2014-11-20 Thread Marko Zec
On Thu, 20 Nov 2014 10:02:46 -0800
Craig Rodrigues rodr...@freebsd.org wrote:

 Hi,
 
 Can folks take a look at this?
 
 https://reviews.freebsd.org/D1201

All UMA zones used in the network stack have been marked as
UMA_ZONE_NOFREE for ages, probably for a reason, so perhaps it might
not hurt to provide more insight why and how it suddenly became safe to
remove that flag?

One possible alternative is to de-virtualize V_udbinfo, V_udpcb_zone
etc. which I have suggested a few times in the past but those proposals
have been rejected based on expectations that one day our network stack
may benefit from more parallelism of decoupled UMA zones for each VNET,
though I'm not aware of further developments in that direction.

The primary (and in many cases only) reason I have virtualized network
stack zones was to simplify tracking of inter-VNET leaks.  As bugs that
caused such leaks seem to have been cleaned up some 7 years ago or so,
I see no technical reason to maintain separate UMA zones for each VNET,
especially not those which are size-limited to maxsockets global limit
anyhow.

Marko
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org