Re: how to read dynamic data structures from the kernel (was Re: reading routing table)
Luigi Rizzo wrote: do you know if any of the *BSD kernels implements some good mechanism to access a dynamic kernel data structure (e.g. the routing tree/trie, or even a list or hash table) without the flaws of the two approaches i indicate above ? Hahaha. I ran into an isomorphic problem with Net-SNMP at work last week. There's a need to export the BGP routing table via SNMP. Of course doing this in our framework at work requires some IPC calls which always require a select() (or WaitForMultipleObjects()) based continuation. Net-SNMP doesn't support continuations at the table iterator level, so somehow, we need to implement an iterator which can accomodate our blocking IPC mechanism. [No, we don't use threads, and that would actually create more problems than it solves -- running single-threaded with continuations lets us run lock free, and we rely on the OS's IPC primitives to serialize our code. works just fine for us so far...] So we would end up caching the whole primary key range in the SNMP sub-agent on a table OID access, a technique which would allow us to defer the IPC calls providing we walk the entire range of the iterator and cache the keys -- but even THAT is far too much data for the BGP table, which is a trie with ~250,000 entries. I hate SNMP GETNEXT. Back to the FreeBSD kernel, though. If you look at in_mcast.c, particularly in p4 bms_netdev, this is what happens for the per-socket multicast source filters -- there is the linearization of an RB-tree for setsourcefilter(). This is fine for something with a limit of ~256 entries per socket (why RB for something so small? this is for space vs time -- and also it has to merge into a larger filter list in the IGMPv3 paths.) And the lock granularity is per-socket. However it doesn't do for something as big as a BGP routing table. C++ lends itself well to expressing these kinds of smart-pointer idioms, though. I'm thinking perhaps we need the notion of a sysctl iterator, which allocates a token for walking a shared data structure, and is able to guarantee that the token maps to a valid pointer for the same entry, until its 'advance pointer' operation is called. Question is, who's going to pull the trigger? cheers BMS P.S. I'm REALLY getting fed up with the lack of openness and transparency largely incumbent in doing work in p4. Come one come all -- we shouldn't need accounts for folk to see and contribute what's going on, and the stagnation is getting silly. FreeBSD development should not be a committer or chum-of-committer in-crowd. ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: how to read dynamic data structures from the kernel (was Re: reading routing table)
On Tue, 2 Sep 2008, Luigi Rizzo wrote: The real problem is that these data structures are dynamic and potentially large, so the following approach (used e.g. in ipfw) enter kernel; get shared lock on the structure; navigate through the structure and make a linearized copy; unlock; copyout the linearized copy; is extremely expensive and has the potential to block other activities for a long time. Sockets, sysctl, kmem, etc, are all really just I/O mechanisms, with varying levels of abstraction, for pushing data, and all fundamentally suffer from the problem of a lack of general export abstraction. What we'd need is some internal representation of the data structure that could give us individual entries of the data structure on each call, together with extra info (a pointer if we can guarantee that it doesn't get stale, something more if we cannot make the guarantee) to allow the navigation to occur. I think there's necessarily implementation-specific details to all of these steps for any given kernel subsystem -- we have data structures, synchronization models, etc, that are all tuned to their common use requirements, and monitoring is very much an edge case. I don't think this is bad: this is an OS kernel, after all, but it does make things a bit more tricky. Even if we can't share code, sharing approaches across subsystems is a good idea. For an example of what you have in mind, take a look at the sysctl monitoring for UNIX domain sockets. First, we allocate an array of pointers sized to the number of unpcb's we have. Then we walk the list, bumping the references and adding pointers to the array. Then we release the global locks, and proceed lock, externalize, unlock, and copyout each individual entry, using a generation number fo manage staleness. Finally, we walk the list dropping the refcounts and free the array. This voids holding global locks for a long time, as well as the stale data issue. It's unideal in other ways -- long lists, reference count complexity, etc, but as I mentioned, it is very much an edge case, and much of that mechanism (especially refcounts) is something we need anyway for any moderately complex kernel data structure. Robert N M Watson Computer Laboratory University of Cambridge Accessing /dev/kmem and follow pointers there has probably the risk that you cannot lock the kernel data structure while you navigate on it, so you are likely to follow stale pointers. I believe this is a very old and common problem, so my question is: do you know if any of the *BSD kernels implements some good mechanism to access a dynamic kernel data structure (e.g. the routing tree/trie, or even a list or hash table) without the flaws of the two approaches i indicate above ? cheers luigi [original thread below just for reference, but i believe i made a fair summary above] On Tue, Sep 02, 2008 at 10:19:55AM +0100, Robert Watson wrote: On Tue, 2 Sep 2008, Debarshi Ray wrote: unfortunatly netstat -rn uses /dev/kmem Yes. I also found that FreeBSD's route(8) implementation does not have an equivalent of 'netstat -r'. NetBSD and GNU/Linux implementations have such an option. Any reason for this? Is it because you did not want to muck with /dev/kmem in route(8) and wanted it to work with PF_ROUTE only? I have not yet gone through NetBSD's route(8) code though. Usually the "reason" for things like this is that no one has written the code to do otherwise :-). PF_ROUTE is probably not a good mechanism for any bulk data transfer due to the constraints of being a datagram socket, although doing it via an interated dump rather than a simple dump operation would probably work. Sysctl is generally a better interface for monitoring for various reasona, although it also has limitations. Maintaining historic kmem support is important, since it is also the code used for interpreting core dumps, and we don't want to lose support for that. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]" ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]" ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: how to read dynamic data structures from the kernel (was Re: reading routing table)
On Tue, Sep 02, 2008 at 10:02:10PM +0100, Robert Watson wrote: > > On Tue, 2 Sep 2008, Luigi Rizzo wrote: > > >The real problem is that these data structures are dynamic and potentially > >large, so the following approach (used e.g. in ipfw) > > > > enter kernel; > > get shared lock on the structure; > > navigate through the structure and make a linearized copy; > > unlock; > > copyout the linearized copy; > > > >is extremely expensive and has the potential to block other activities for > >a long time. > > Sockets, sysctl, kmem, etc, are all really just I/O mechanisms, with > varying levels of abstraction, for pushing data, and all fundamentally > suffer from the problem of a lack of general export abstraction. yes, this is why i said we should not bother about which one is used. > >What we'd need is some internal representation of the data structure that > >could give us individual entries of the data structure on each call, > >together with extra info (a pointer if we can guarantee that it doesn't > >get stale, something more if we cannot make the guarantee) to allow the > >navigation to occur. > > I think there's necessarily implementation-specific details to all of these > steps for any given kernel subsystem -- we have data structures, > synchronization models, etc, that are all tuned to their common use > requirements, and monitoring is very much an edge case. I don't think this > is bad: this is an OS kernel, after all, but it does make things a bit more > tricky. Even if we can't share code, sharing approaches across subsystems > is a good idea. > > For an example of what you have in mind, take a look at the sysctl > monitoring for UNIX domain sockets. First, we allocate an array of > pointers sized to the number of unpcb's we have. Then we walk the list, > bumping the references and adding pointers to the array. Then we release > the global locks, and proceed lock, externalize, unlock, and copyout each > individual entry, using a generation number fo manage staleness. Finally, > we walk the list dropping the refcounts and free the array. This voids > holding global locks for a long time, as well as the stale data issue. > It's unideal in other ways -- long lists, reference count complexity, etc, > but as I mentioned, it is very much an edge case, and much of that > mechanism (especially refcounts) is something we need anyway for any > moderately complex kernel data structure. what you describe above is more efficient but not that different from what i described. The thing is, i always forget that in many cases an iterator doesn't care for the order in which elements are generated so you could use a solution like the one below, by just doing a tiny little bit of work while modifying the main data structure (this might well be a known solution, since it is so trivial...) [i already emailed the following to BMS, so apologies for the duplicate] if all you care is iterating the whole data structure, without a particular order, you could manage an additional array of pointers to all the objects in the data structure (the array should be implemented as a sparse, resizable array but that's a separate issue, and probably a relatively trivial one -- i am googling for it...). Navigation and iterators are simple: + When inserting a new element, append an entry to the array, and make it point to the newly added record. Each entry gets a fresh sequence numbers, and one should make sure that seqnumbers are not recycled (64 bit should suffice ?). + when deleting an element, logically remove the entry from the array + the iterator returns a copy of the object, and its sequence number; + getnext returns the existing element following the 'current' one in the sparse array. Complexity for most ops (data insertion, removal, lookup) would be O(1) plus whatever is needed to do housekeeping on the sparse array, and this depends on the usage of the main data structure and how much we worry for expensive 'getnext' ops. Sure you need a read lock on the main struct while you lookup the next element on the sparse array, but the nice thing is that you can release the lock at each step even if you have a poorly implemented sparse array. Makes sense ? cheers luigi ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: how to read dynamic data structures from the kernel (was Re: reading routing table)
Robert Watson wrote: On Tue, 2 Sep 2008, Luigi Rizzo wrote: The real problem is that these data structures are dynamic and potentially large, so the following approach (used e.g. in ipfw) enter kernel; get shared lock on the structure; navigate through the structure and make a linearized copy; unlock; copyout the linearized copy; is extremely expensive and has the potential to block other activities for a long time. Sockets, sysctl, kmem, etc, are all really just I/O mechanisms, with varying levels of abstraction, for pushing data, and all fundamentally suffer from the problem of a lack of general export abstraction. What we'd need is some internal representation of the data structure that could give us individual entries of the data structure on each call, together with extra info (a pointer if we can guarantee that it doesn't get stale, something more if we cannot make the guarantee) to allow the navigation to occur. In some code I have seen (and some I have written) there is always two levels of storage in some modules.. One that contains the majority of the information and one that contains "changes that occured since the main container was locked".. so for example the routing tables might be locked and if a routing change is requested thereafter, it gets stored in a transactional form on the side structure.. a routing lookup during the period that the structure is locked (if a read lock) simply goes ahead, and at completion checks if there is a better answer in the waiting list. A write request is stored as a transaction request on the waiting list. not saying it works for everything but If we had a kernel written in a high enough level language, such methods could be broadly used.. oh well. using reader-writer locking mitigates a lot of this.. I think there's necessarily implementation-specific details to all of these steps for any given kernel subsystem -- we have data structures, synchronization models, etc, that are all tuned to their common use requirements, and monitoring is very much an edge case. I don't think this is bad: this is an OS kernel, after all, but it does make things a bit more tricky. Even if we can't share code, sharing approaches across subsystems is a good idea. For an example of what you have in mind, take a look at the sysctl monitoring for UNIX domain sockets. First, we allocate an array of pointers sized to the number of unpcb's we have. Then we walk the list, bumping the references and adding pointers to the array. Then we release the global locks, and proceed lock, externalize, unlock, and copyout each individual entry, using a generation number fo manage staleness. Finally, we walk the list dropping the refcounts and free the array. This voids holding global locks for a long time, as well as the stale data issue. It's unideal in other ways -- long lists, reference count complexity, etc, but as I mentioned, it is very much an edge case, and much of that mechanism (especially refcounts) is something we need anyway for any moderately complex kernel data structure. refcounts are, unfortunatly a really bad thing for multiprocessors. refcounts, if they are actually incremented now and then are usually out of scope for any given CPU forcing lots of cache flushes and real reads from memory. There are some elaborate MP-refcount schemes we really should look at but most require a lot of memory. ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"