Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Chris Sturtivant wrote: > Shailabh Nagar wrote: > >> So here's the sequence of pids being used/hashed etc. Please let >> me know if my assumptions are correct ? >> >> 1. Same listener thread opens 2 sockets >> >> On sockfd1, does a bind() using >> sockaddr_nl.nl_pid = my_pid1 >> On sockfd2, does a bind() using >> sockaddr_nl.nl_pid = my_pid2 >> >> (one of my_pid1's could by its process pid but doesn't have to be) >> > > > For CSA, we are proposing to use a single (multi-threaded) demon that > combines both the userland components for job and CSA that used to be in > the kernel. In this case, the pid will be the same for two connections > along with the cpu range. Does what your saying here mean that we > should choose distinct values for my_pid1 and my_pid2 to avoid the two > sockets looking the same? Yes, that is my understanding and also whats mentioned in the bind() section in http://www.linuxjournal.com/article/7356 though I've yet to try it out myself (will do so shortly after making the other suggested changes to the basic patch) --Shailabh > I'm not too familiar with netlink, yet. > > Best regards, > > > --Chris > - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Shailabh Nagar wrote: So here's the sequence of pids being used/hashed etc. Please let me know if my assumptions are correct ? 1. Same listener thread opens 2 sockets On sockfd1, does a bind() using sockaddr_nl.nl_pid = my_pid1 On sockfd2, does a bind() using sockaddr_nl.nl_pid = my_pid2 (one of my_pid1's could by its process pid but doesn't have to be) For CSA, we are proposing to use a single (multi-threaded) demon that combines both the userland components for job and CSA that used to be in the kernel. In this case, the pid will be the same for two connections along with the cpu range. Does what your saying here mean that we should choose distinct values for my_pid1 and my_pid2 to avoid the two sockets looking the same? I'm not too familiar with netlink, yet. Best regards, --Chris -- - Chris Sturtivant, PhD, Linux System Software, SGI (650) 933-1703 - - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Jay Lan wrote: Shailabh Nagar wrote: Yes. If no one registers to listen on a particular CPU, data from tasks exiting on that cpu is not sent out at all. Shailabh also wrote: During task exit, kernel goes through each registered listener (small list) and decides which one needs to get this exit data and calls a genetlink_unicast to each one that does need it. Are we eliminating multicast taskstats data at exit time? Yes. Only unicasts to each listener now. A unicast exit data with cpumask will do for me, but just like to be sure where we are. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Shailabh Nagar wrote: > Yes. If no one registers to listen on a particular CPU, data from tasks > exiting on that cpu is not sent out at all. Shailabh also wrote: > During task exit, kernel goes through each registered listener (small > list) and decides which > one needs to get this exit data and calls a genetlink_unicast to each > one that does need it. Are we eliminating multicast taskstats data at exit time? A unicast exit data with cpumask will do for me, but just like to be sure where we are. Thanks, - jay - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
jamal wrote: > Shailabh, > > On Tue, 2006-04-07 at 12:37 -0400, Shailabh Nagar wrote: > [..] > >>Here's a strawman for the problem we're trying to solve: get >>notification of the close of a NETLINK_GENERIC socket that had >>been used to register interest for some cpus within taskstats. >> >> From looking at the netlink code, the way to go seems to be >> >>- it maintains a pidhash of nl_pids that are currently >>registered to listen to atleast one cpu. It also stores the >>cpumask used. >>- taskstats registers a notifier block within netlink_chain >>and receives a callback on the NETLINK_URELEASE event, similar >>to drivers/scsci/scsi_transport_iscsi.c: iscsi_rcv_nl_event() >> >>- the callback checks to see that the protocol is NETLINK_GENERIC >>and that the nl_pid for the socket is in taskstat's pidhash. If so, it >>does a cleanup using the stored cpumask and releases the nl_pid >>from the pidhash. >> > > > Sound quiet reasonable. I am beginning to wonder whether we should do > do the NETLINK_URELEASE in general for NETLINK_GENERIC I'd initially thought that might be useful but since NETLINK_GENERIC is only "virtually" multiplexing the sockfd amongst each of its users, I don't know what benefits a generic notifier at NETLINK_GENERIC layer would bring (as opposed to each NETLINK_GENERIC user directly registering its callback with netlink). Perhaps simplicity ? >>We can even do away with the deregister command altogether and >>simply rely on this autocleanup. > > > I think if you may still need the register if you are going to allow > multiple sockets per listener process, no? The register command, yes. But an explicit deregister, as opposed to auto cleanup on fd close, may not be used all that much :-) > The other question is how do you correlate pid -> fd? For the notifier callback, I thought netlink_release will provide the nl_pid correspoding to the fd being closed ? I can just do a search for that nl_pid in the taskstats-private pidhash. The nl_pid gets into the pidhash using the genl_info->pid field when the listener issues the register command. Will that be correct ? So here's the sequence of pids being used/hashed etc. Please let me know if my assumptions are correct ? 1. Same listener thread opens 2 sockets On sockfd1, does a bind() using sockaddr_nl.nl_pid = my_pid1 On sockfd2, does a bind() using sockaddr_nl.nl_pid = my_pid2 (one of my_pid1's could by its process pid but doesn't have to be) 2. Listener supplies cpumasks on each of the sockets through a register command sent on sockfd1. In the kernel, when the command is received, the genl_info->pid field contains my_pid1 my_pid1 is stored in a pidhash alongwith the corresponding cpumask. cpumask is used to store the my_pid1 into per-cpu lists for each cpu in the mask. 3. When an exit event happens on one of those cpus in the mask, it is sent to this listener using genlmsg_unicast(, my_pid1) 4. When the listener closes sockfd1, netlink_release() gets called and that calls a taskstats notifier callback (say taskstats_cb) with struct netlink_notify n = { .protocol = NETLINK_GENERIC, .pid = my_pid1 } and using the .pid within, taskstats_cb can do a lookup within its pidhash. If its present, use the cpumask stored alongside to go clean up my_pid1 stored in the listener list of each cpu in the mask. --Shailabh - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
pj wrote: > writes the code gets to Never mind that last incomplete post - I hit Send when I meant to hit Cancel. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Andrew wrote: > OK, so we're passing in an ASCII string. Fair enough, I think. Paul would > know better. Not sure if I know better - just got stronger opinions. I like the ASCII here - but this is one of those "he who writes the code gets to -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Shailabh wrote: > Perhaps I should use the the other ascii format for specifying cpumasks > since its more amenable > to specifying an upper bound for the length of the ascii string and is > more compact ? Eh - basically - I don't have a strong opinion either way. I have a slight esthetic preference toward using list of ranges format from shell scripts and shell prompts, and using the 32-bit hex words from C code: 17-26,44-47 # shell - list of ranges f000,07fe # C - 32-bit hex words Since the primary interface you are working with is C code, that would mean I'd slightly prefer the 32-bit hex word variant. >From what I've seen neither of the reasons you gave for preferring the 32-bit hex word format are persuasive (even though they both lead to the same conclusion as I preferred ;): Which is more compact depends on that particular bit pattern you need to represent. See for example the examples above. The lack of a perfect upper bound on the list of ranges format is a theoretical problem that I have never seen in practice. Only pathological constructs exceed six ascii characters per set bit. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Shailabh, On Tue, 2006-04-07 at 12:37 -0400, Shailabh Nagar wrote: [..] > Here's a strawman for the problem we're trying to solve: get > notification of the close of a NETLINK_GENERIC socket that had > been used to register interest for some cpus within taskstats. > > From looking at the netlink code, the way to go seems to be > > - it maintains a pidhash of nl_pids that are currently > registered to listen to atleast one cpu. It also stores the > cpumask used. > - taskstats registers a notifier block within netlink_chain > and receives a callback on the NETLINK_URELEASE event, similar > to drivers/scsci/scsi_transport_iscsi.c: iscsi_rcv_nl_event() > > - the callback checks to see that the protocol is NETLINK_GENERIC > and that the nl_pid for the socket is in taskstat's pidhash. If so, it > does a cleanup using the stored cpumask and releases the nl_pid > from the pidhash. > Sound quiet reasonable. I am beginning to wonder whether we should do do the NETLINK_URELEASE in general for NETLINK_GENERIC > We can even do away with the deregister command altogether and > simply rely on this autocleanup. I think if you may still need the register if you are going to allow multiple sockets per listener process, no? The other question is how do you correlate pid -> fd? cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Shailabh Nagar wrote: jamal wrote: On Mon, 2006-03-07 at 18:01 -0700, Andrew Morton wrote: On Mon, 03 Jul 2006 20:54:37 -0400 Shailabh Nagar <[EMAIL PROTECTED]> wrote: What happens when a listener exits without doing deregistration (or if the listener attempts to register another cpumask while a current registration is still active). ( Jamal, your thoughts on this problem would be appreciated) Problem is that we have a listener task which has "registered" with taskstats and caused its pid to be stored in various per-cpu lists of listeners. Later, when some other task exits on a given cpu, its exit data is sent using genlmsg_unicast on each pid present on that cpu's list. If the listener exits without doing a "deregister", its pid continues to be kept around, obviously not a good thing. So we need some way of detecting the situation (task is no longer listening on these cpus events) that is efficient. Also need to address the case where the listener has closed off his file descriptor but continues to run. So hooking into listener's exit() isn't appropriate - the teardown is associated with the lifetime of the fd, not of the process. If we do that, exit() gets handled for free. If you are always going to send unicast messages, then -ECONNREFUSED will tell you the listener has closed their fd - this doesnt meant it has exited. Thats good. So we have atleast one way of detecting the "closed fd without deregistering" within taskstats itself. Besides that one process could open several sockets. I know that would not be the app you would write - but it doesnt stop other people from doing it. As far as API is concerned, even a taskstats listener is not being prevented from opening multiple sockets. As Andrew also pointed out, everything needs to be done per-socket. I think i may not follow what you are doing - for some reason i thought you may have many listeners in user space and these messages get multicast to them? That was the design earlier. In the past week, the design has changed to one where there are still many listeners in user space but messages get unicast to each of them. Earlier listeners would get messages generated on task exit from every cpu, now they get it only from cpus for which they have explicitly registered interest (via a cpumask passed in through another genetlink command). Does the user space program somehow communicate its pid to the kernel? Yes. When the listener registers interest in a set of cpus, as described above, its (genl_info->pid) is being stored in the per-cpu list of listeners for those cpus. When a task exits on one of those cpus, the exit data is only sent via genetlink_unicast to those pids (really, nl_pids) who are on that cpu's listener list. Now that I think more about it, netlink is really maintaining a pidhash of nl_pids, not process pids, right ? So if one userapp were to open multiple sockets using NETLINK_GENERIC protocol (regardless of how many of those are for the taskstats), each of them would have to use a different nl_pid. Hence, it would be valid for the taskstats layer to use netlink_lookup() at any time to see if the corresponding socket were closed ? Here's a strawman for the problem we're trying to solve: get notification of the close of a NETLINK_GENERIC socket that had been used to register interest for some cpus within taskstats. From looking at the netlink code, the way to go seems to be - it maintains a pidhash of nl_pids that are currently registered to listen to atleast one cpu. It also stores the cpumask used. - taskstats registers a notifier block within netlink_chain and receives a callback on the NETLINK_URELEASE event, similar to drivers/scsci/scsi_transport_iscsi.c: iscsi_rcv_nl_event() - the callback checks to see that the protocol is NETLINK_GENERIC and that the nl_pid for the socket is in taskstat's pidhash. If so, it does a cleanup using the stored cpumask and releases the nl_pid from the pidhash. We can even do away with the deregister command altogether and simply rely on this autocleanup. --Shailabh - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
jamal wrote: On Mon, 2006-03-07 at 18:01 -0700, Andrew Morton wrote: On Mon, 03 Jul 2006 20:54:37 -0400 Shailabh Nagar <[EMAIL PROTECTED]> wrote: What happens when a listener exits without doing deregistration (or if the listener attempts to register another cpumask while a current registration is still active). ( Jamal, your thoughts on this problem would be appreciated) Problem is that we have a listener task which has "registered" with taskstats and caused its pid to be stored in various per-cpu lists of listeners. Later, when some other task exits on a given cpu, its exit data is sent using genlmsg_unicast on each pid present on that cpu's list. If the listener exits without doing a "deregister", its pid continues to be kept around, obviously not a good thing. So we need some way of detecting the situation (task is no longer listening on these cpus events) that is efficient. Also need to address the case where the listener has closed off his file descriptor but continues to run. So hooking into listener's exit() isn't appropriate - the teardown is associated with the lifetime of the fd, not of the process. If we do that, exit() gets handled for free. If you are always going to send unicast messages, then -ECONNREFUSED will tell you the listener has closed their fd - this doesnt meant it has exited. Thats good. So we have atleast one way of detecting the "closed fd without deregistering" within taskstats itself. Besides that one process could open several sockets. I know that would not be the app you would write - but it doesnt stop other people from doing it. As far as API is concerned, even a taskstats listener is not being prevented from opening multiple sockets. As Andrew also pointed out, everything needs to be done per-socket. I think i may not follow what you are doing - for some reason i thought you may have many listeners in user space and these messages get multicast to them? That was the design earlier. In the past week, the design has changed to one where there are still many listeners in user space but messages get unicast to each of them. Earlier listeners would get messages generated on task exit from every cpu, now they get it only from cpus for which they have explicitly registered interest (via a cpumask passed in through another genetlink command). Does the user space program somehow communicate its pid to the kernel? Yes. When the listener registers interest in a set of cpus, as described above, its (genl_info->pid) is being stored in the per-cpu list of listeners for those cpus. When a task exits on one of those cpus, the exit data is only sent via genetlink_unicast to those pids (really, nl_pids) who are on that cpu's listener list. Now that I think more about it, netlink is really maintaining a pidhash of nl_pids, not process pids, right ? So if one userapp were to open multiple sockets using NETLINK_GENERIC protocol (regardless of how many of those are for the taskstats), each of them would have to use a different nl_pid. Hence, it would be valid for the taskstats layer to use netlink_lookup() at any time to see if the corresponding socket were closed ? --Shailabh - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
On Mon, 2006-03-07 at 18:01 -0700, Andrew Morton wrote: > On Mon, 03 Jul 2006 20:54:37 -0400 > Shailabh Nagar <[EMAIL PROTECTED]> wrote: > > > > What happens when a listener exits without doing deregistration > > > (or if the listener attempts to register another cpumask while a current > > > registration is still active). > > > > > ( Jamal, your thoughts on this problem would be appreciated) > > > > Problem is that we have a listener task which has "registered" with > > taskstats and caused > > its pid to be stored in various per-cpu lists of listeners. Later, when > > some other task exits on a given cpu, its exit data is sent using > > genlmsg_unicast on each pid present on that cpu's list. > > > > If the listener exits without doing a "deregister", its pid continues to > > be kept around, obviously not a good thing. So we need some way of > > detecting the situation (task is no longer listening on > > these cpus events) that is efficient. > > Also need to address the case where the listener has closed off his file > descriptor but continues to run. > > So hooking into listener's exit() isn't appropriate - the teardown is > associated with the lifetime of the fd, not of the process. If we do that, > exit() gets handled for free. If you are always going to send unicast messages, then -ECONNREFUSED will tell you the listener has closed their fd - this doesnt meant it has exited. Besides that one process could open several sockets. I know that would not be the app you would write - but it doesnt stop other people from doing it. I think i may not follow what you are doing - for some reason i thought you may have many listeners in user space and these messages get multicast to them? Does the user space program somehow communicate its pid to the kernel? cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
On Mon, 03 Jul 2006 20:54:37 -0400 Shailabh Nagar <[EMAIL PROTECTED]> wrote: > > What happens when a listener exits without doing deregistration > > (or if the listener attempts to register another cpumask while a current > > registration is still active). > > > ( Jamal, your thoughts on this problem would be appreciated) > > Problem is that we have a listener task which has "registered" with > taskstats and caused > its pid to be stored in various per-cpu lists of listeners. Later, when > some other task exits on a given cpu, its exit data is sent using > genlmsg_unicast on each pid present on that cpu's list. > > If the listener exits without doing a "deregister", its pid continues to > be kept around, obviously not a good thing. So we need some way of > detecting the situation (task is no longer listening on > these cpus events) that is efficient. Also need to address the case where the listener has closed off his file descriptor but continues to run. So hooking into listener's exit() isn't appropriate - the teardown is associated with the lifetime of the fd, not of the process. If we do that, exit() gets handled for free. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Shailabh Nagar wrote: Andrew Morton wrote: On Fri, 30 Jun 2006 23:37:10 -0400 Shailabh Nagar <[EMAIL PROTECTED]> wrote: Set aside the implementation details and ask "what is a good design"? A kernel-wide constant, whether determined at build-time or by a /proc poke isn't a nice design. Can we permit userspace to send in a netlink message describing a cpumask? That's back-compatible. Yes, that should be doable. And passing in a cpumask is much better since we no longer have to maintain mappings. So the strawman is: Listener bind()s to genetlink using its real pid. Sends a separate "registration" message with cpumask to listen to. Kernel stores (real) pid and cpumask. During task exit, kernel goes through each registered listener (small list) and decides which one needs to get this exit data and calls a genetlink_unicast to each one that does need it. If number of listeners is small, the lookups should be swift enough. If it grows large, we can consider a fancier lookup (but there I go again, delving into implementation too early :-) We'll need a map. 1024 CPUs, 1024 listeners, 1000 exits/sec/CPU and we're up to a million operations per second per CPU. Meltdown. But it's a pretty simple map. A per-cpu array of pointers to the head of a linked list. One lock for each CPU's list. Here's a patch that implements the above ideas. A listener register's interest by specifying a cpumask in the cpulist format (comma separated ranges of cpus). The listener's pid is entered into per-cpu lists for those cpus and exit events from those cpus go to the listeners using netlink unicasts. Please comment. Andrew, this is not being proposed for inclusion yet since there is atleast one more issue that needs to be resolved: What happens when a listener exits without doing deregistration (or if the listener attempts to register another cpumask while a current registration is still active). ( Jamal, your thoughts on this problem would be appreciated) Problem is that we have a listener task which has "registered" with taskstats and caused its pid to be stored in various per-cpu lists of listeners. Later, when some other task exits on a given cpu, its exit data is sent using genlmsg_unicast on each pid present on that cpu's list. If the listener exits without doing a "deregister", its pid continues to be kept around, obviously not a good thing. So we need some way of detecting the situation (task is no longer listening on these cpus events) that is efficient. Two solutions come to mind: 1. During the exit of every task check to see if it is is already "registered" with taskstats. If so, do a cleanup of its pid on various per-cpu lists. 2. Before doing a genlmsg_unicast to a pid on one of the per-cpu lists (or if genlmsg_unicast fails with a -ECONNREFUSED, a result of netlink_lookup failing for that pid), then just delete it from that cpu's list and continue. 1 is more desirable because its the right place to catch this and happens relatively rarely (few listener exits compared to all exits). However, how can we check whether a task/pid has registered with taskstats earlier ? Again, two possibilities - Maintain a list of registered listeners within taskstats and check that. - try to leverage netlink's nl_pid_hash which maintains the same kind of info for each protocol. Thus a netlink_lookup of the pid would save a lot of work. However, the netlink layer's hashtable appears to be for the entire NETLINK_GENERIC protocol and not just for the taskstats client of NETLINK_GENERIC. So even if a task has deregistered with taskstats, as long as it has some other NETLINK_GENERIC socket open, it will still show up as "connected" as far as netlink is concerned. Jamal - is my interpretation correct ? Do I need to essentially replicate the pidhash at the taskstats layer ? Thoughts on whether there's any way genetlink can provide support for this or whether its desirable etc. (we appear to be the second user of genetlink - this may not be a common need going forward). 1 has the disadvantage that if such a situation is detected, one has to iterate over all cpus in the system, deleting that pid from any per-cpu list it happens to be in. One could store the cpumask that the listener originally used to optimize this search. usual tradeoff of storage vs. time. 2 avoids the problem just mentioned since it delegates the task of cleanup to each cpu at the cost of incurring an extra check for each listener for each exit on that cpu. By storing the task_struct instead of the pid in the per-cpu lists, the check can be made quite cheap. But one problem with 2 is the issue of recycled task_structs and pids. Since the stale task on the per-cpu listener list could have exited a while back, its possible its alive at the time of the check and has even registered with a different interest list ! So it'll receive events it didn't register for. I guess this again
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
On Mon, 03 Jul 2006 20:13:36 -0400 Shailabh Nagar <[EMAIL PROTECTED]> wrote: > >>+ if (!s) > >>+ return -ENOMEM; > >>+ s->pid = pid; > >>+ INIT_LIST_HEAD(&s->list); > >>+ > >>+ down_write(sem); > >>+ list_add(&s->list, head); > >>+ up_write(sem); > >>+ > >>+ if (cpu == mycpu) > >>+ preempt_enable(); > >> > >> > > > >Actually, I don't understand the tricks which are going on with the local > >CPU here. > >What's it all for? > > > > > I was wanting to do a get_cpu_var for listener_list & sem > for the current cpu and per_cpu otherwise (since thats what I thought > was the recommendation > for accessing the local cpu's variable). Perhaps the preempt_disable is > uncalled for ? Well we have a problem. You want to grab this CPU's list, and then lock a semaphore. But taking a semaphore is a sleeping operation. Fortunately, there's really no need to stay on-CPU at all. When userspace is setting or clearing entries in the map, userspace _told_ us which CPU to manipulate, so this code can be running on any CPU at all. So just go grab the Nth entry in the array and acquire the lock. And when the time comes to send some statistics, just use raw_smp_processor_id() and don't use preempt_disable() at all. If we end up hopping over to another CPU, well at least we tried. All we can do here is to run raw_smp_processor_id() as early as possible to reduce the possibility that we'll get a different CPU from the one which this task really exited on. IOW: in all cases we were provided with explicit CPU numbers from other sources. So no preemption disabling is required. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Andrew Morton wrote: On Mon, 03 Jul 2006 17:11:59 -0400 Shailabh Nagar <[EMAIL PROTECTED]> wrote: static inline void taskstats_exit_alloc(struct taskstats **ptidstats) { *ptidstats = NULL; - if (taskstats_has_listeners()) + if (!list_empty(&get_cpu_var(listener_list))) *ptidstats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL); + put_cpu_var(listener_list); } It's time to uninline this function.. static inline void taskstats_exit_free(struct taskstats *tidstats) Index: linux-2.6.17-mm3equiv/kernel/taskstats.c === --- linux-2.6.17-mm3equiv.orig/kernel/taskstats.c 2006-06-30 23:38:39.0 -0400 +++ linux-2.6.17-mm3equiv/kernel/taskstats.c2006-07-02 00:16:18.0 -0400 @@ -19,6 +19,8 @@ #include #include #include +#include +#include #include #include @@ -26,6 +28,9 @@ static DEFINE_PER_CPU(__u32, taskstats_s static int family_registered = 0; kmem_cache_t *taskstats_cache; +DEFINE_PER_CPU(struct list_head, listener_list); +static DEFINE_PER_CPU(struct rw_semaphore, listener_list_sem); Which will permit listener_list to become static - it wasn't a good name for a global anyway. I suggest you implement a new struct whatever { struct rw_semaphore sem; struct list_head list; }; Ok. The listener_list was a global to allow taskstats_exit_alloc to access but this is better. static DEFINE_PER_CPU(struct whatever, listener_aray); static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp, void **replyp, size_t size) { @@ -77,6 +92,8 @@ static int prepare_reply(struct genl_inf static int send_reply(struct sk_buff *skb, pid_t pid, int event) { struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data); + struct rw_semaphore *sem; + struct list_head *p, *head; void *reply; int rc; @@ -88,9 +105,30 @@ static int send_reply(struct sk_buff *sk return rc; } - if (event == TASKSTATS_MSG_MULTICAST) - return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP); - return genlmsg_unicast(skb, pid); + if (event == TASKSTATS_MSG_UNICAST) + return genlmsg_unicast(skb, pid); + + /* +* Taskstats multicast is unicasts to listeners who have registered +* interest in this cpu +*/ + sem = &get_cpu_var(listener_list_sem); + head = &get_cpu_var(listener_list); This has a double preempt_disable(), but the above will fix that. + down_read(sem); + list_for_each(p, head) { + int ret; + struct listener *s = list_entry(p, struct listener, list); + ret = genlmsg_unicast(skb, s->pid); + if (ret) + rc = ret; + } + up_read(sem); + + put_cpu_var(listener_list); + put_cpu_var(listener_list_sem); + + return rc; } static int fill_pid(pid_t pid, struct task_struct *pidtsk, @@ -201,8 +239,73 @@ ret: return; } +static int add_del_listener(pid_t pid, cpumask_t *maskp, int isadd) +{ + struct listener *s; + unsigned int cpu, mycpu; + cpumask_t mask; + struct rw_semaphore *sem; + struct list_head *head, *p; -static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info) + memcpy(&mask, maskp, sizeof(cpumask_t)); + if (cpus_empty(mask)) + return -EINVAL; + + mycpu = get_cpu(); + put_cpu(); This is effectively raw_smp_processor_id(). And after the put_cpu(), `mycpu' is meaningless. Hmm. + if (isadd == REGISTER) { + for_each_cpu_mask(cpu, mask) { + if (!cpu_possible(cpu)) + continue; + if (cpu == mycpu) + preempt_disable(); + + sem = &per_cpu(listener_list_sem, cpu); + head = &per_cpu(listener_list, cpu); + + s = kmalloc(sizeof(struct listener), GFP_KERNEL); Cannot do GFP_KERNEL inside preempt_disable(). There's no easy solution to this problem. GFP_ATOMIC is not a good fix at all. One approach would be to run lock_cpu_hotplug(), then allocate (with GFP_KERNEL) all the memory which will be needed within the locked region, then take the lock, then use that preallocated memory. You should use kmalloc_node() here, to ensure that the memory on each CPU's list resides with that CPU's local memory (not _this_ CPU's local memory). Ok. + if (!s) + return -ENOMEM; + s->pid = pid; + INIT_LIST_HEAD(&s->list); + + down_write(sem); + list_add(&s->list, head); + up_write(sem); + +
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Paul Jackson wrote: Shailabh wrote: I don't know if there are buffer overflow issues in passing a string I don't know if this comment applies to "the standard netlink way of passing it up using NLA_STRING", but the way I deal with buffer length issues in the cpuset code is to insist that the user code express the list in no fewer than 100 + 6 * NR_CPUS bytes: From kernel/cpuset.c: /* Crude upper limit on largest legitimate cpulist user might write. */ if (nbytes > 100 + 6 * NR_CPUS) return -E2BIG; This lets the user specify the buffer size passed in, but prevents them from trying a denial of service attack on the kernel by trying to pass in a huge buffer. If the user can't figure out how to write the desired cpulist in that size, then tough toenails. Paul, Perhaps I should use the the other ascii format for specifying cpumasks since its more amenable to specifying an upper bound for the length of the ascii string and is more compact ? That format (the one used in lib/bitmap.c:bitmap_parse) is comma separated chunks of hex digits with each chunk specifying 32 bits of the desired cpumask. So ((NR_CPUS + 32) / 32) * 8 + 1 (8 hex characters for each 32 cpus, and 1 extra character for null terminator) would be an upper bound that would accomodate all the cpus for sure. Thoughts ? --Shailabh --Shailabh - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
On Mon, 03 Jul 2006 17:11:59 -0400 Shailabh Nagar <[EMAIL PROTECTED]> wrote: > >>So the strawman is: > >>Listener bind()s to genetlink using its real pid. > >>Sends a separate "registration" message with cpumask to listen to. > >>Kernel stores (real) pid and cpumask. > >>During task exit, kernel goes through each registered listener (small > >>list) and decides which > >>one needs to get this exit data and calls a genetlink_unicast to each > >>one that does need it. > >> > >>If number of listeners is small, the lookups should be swift enough. If > >>it grows large, we > >>can consider a fancier lookup (but there I go again, delving into > >>implementation too early :-) > >> > >> > > > >We'll need a map. > > > >1024 CPUs, 1024 listeners, 1000 exits/sec/CPU and we're up to a million > >operations per second per CPU. Meltdown. > > > >But it's a pretty simple map. A per-cpu array of pointers to the head of a > >linked list. One lock for each CPU's list. > > > > > Here's a patch that implements the above ideas. > > A listener register's interest by specifying a cpumask in the > cpulist format (comma separated ranges of cpus). The listener's pid > is entered into per-cpu lists for those cpus and exit events from those > cpus go to the listeners using netlink unicasts. > > ... > > On systems with a large number of cpus, with even a modest rate of > tasks exiting per cpu, the volume of taskstats data sent on thread exit > can overflow a userspace listener's buffers. > > One approach to avoiding overflow is to allow listeners to get data for > a limited and specific set of cpus. By scaling the number of listeners > and/or the cpus they monitor, userspace can handle the statistical data > overload more gracefully. > > In this patch, each listener registers to listen to a specific set of > cpus by specifying a cpumask. The interest is recorded per-cpu. When > a task exits on a cpu, its taskstats data is unicast to each listener > interested in that cpu. I think the approach is sane. The impementation needs work, as you say. > +++ linux-2.6.17-mm3equiv/include/linux/taskstats_kern.h 2006-07-01 > 23:53:01.0 -0400 > @@ -19,20 +19,14 @@ enum { > #ifdef CONFIG_TASKSTATS > extern kmem_cache_t *taskstats_cache; > extern struct mutex taskstats_exit_mutex; > - > -static inline int taskstats_has_listeners(void) > -{ > - if (!genl_sock) > - return 0; > - return netlink_has_listeners(genl_sock, TASKSTATS_LISTEN_GROUP); > -} > - > +DECLARE_PER_CPU(struct list_head, listener_list); > > static inline void taskstats_exit_alloc(struct taskstats **ptidstats) > { > *ptidstats = NULL; > - if (taskstats_has_listeners()) > + if (!list_empty(&get_cpu_var(listener_list))) > *ptidstats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL); > + put_cpu_var(listener_list); > } It's time to uninline this function.. > static inline void taskstats_exit_free(struct taskstats *tidstats) > Index: linux-2.6.17-mm3equiv/kernel/taskstats.c > === > --- linux-2.6.17-mm3equiv.orig/kernel/taskstats.c 2006-06-30 > 23:38:39.0 -0400 > +++ linux-2.6.17-mm3equiv/kernel/taskstats.c 2006-07-02 00:16:18.0 > -0400 > @@ -19,6 +19,8 @@ > #include > #include > #include > +#include > +#include > #include > #include > > @@ -26,6 +28,9 @@ static DEFINE_PER_CPU(__u32, taskstats_s > static int family_registered = 0; > kmem_cache_t *taskstats_cache; > > +DEFINE_PER_CPU(struct list_head, listener_list); > +static DEFINE_PER_CPU(struct rw_semaphore, listener_list_sem); Which will permit listener_list to become static - it wasn't a good name for a global anyway. I suggest you implement a new struct whatever { struct rw_semaphore sem; struct list_head list; }; static DEFINE_PER_CPU(struct whatever, listener_aray); > static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff > **skbp, > void **replyp, size_t size) > { > @@ -77,6 +92,8 @@ static int prepare_reply(struct genl_inf > static int send_reply(struct sk_buff *skb, pid_t pid, int event) > { > struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data); > + struct rw_semaphore *sem; > + struct list_head *p, *head; > void *reply; > int rc; > > @@ -88,9 +105,30 @@ static int send_reply(struct sk_buff *sk > return rc; > } > > - if (event == TASKSTATS_MSG_MULTICAST) > - return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP); > - return genlmsg_unicast(skb, pid); > + if (event == TASKSTATS_MSG_UNICAST) > + return genlmsg_unicast(skb, pid); > + > + /* > + * Taskstats multicast is unicasts to listeners who have registered > + * interest in this cpu > + */ > + sem = &get_cpu_var(listener_list_sem); > + head = &get_cpu_var(listener_list); This has a do
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Andrew Morton wrote: On Fri, 30 Jun 2006 23:37:10 -0400 Shailabh Nagar <[EMAIL PROTECTED]> wrote: Set aside the implementation details and ask "what is a good design"? A kernel-wide constant, whether determined at build-time or by a /proc poke isn't a nice design. Can we permit userspace to send in a netlink message describing a cpumask? That's back-compatible. Yes, that should be doable. And passing in a cpumask is much better since we no longer have to maintain mappings. So the strawman is: Listener bind()s to genetlink using its real pid. Sends a separate "registration" message with cpumask to listen to. Kernel stores (real) pid and cpumask. During task exit, kernel goes through each registered listener (small list) and decides which one needs to get this exit data and calls a genetlink_unicast to each one that does need it. If number of listeners is small, the lookups should be swift enough. If it grows large, we can consider a fancier lookup (but there I go again, delving into implementation too early :-) We'll need a map. 1024 CPUs, 1024 listeners, 1000 exits/sec/CPU and we're up to a million operations per second per CPU. Meltdown. But it's a pretty simple map. A per-cpu array of pointers to the head of a linked list. One lock for each CPU's list. Here's a patch that implements the above ideas. A listener register's interest by specifying a cpumask in the cpulist format (comma separated ranges of cpus). The listener's pid is entered into per-cpu lists for those cpus and exit events from those cpus go to the listeners using netlink unicasts. Please comment. Andrew, this is not being proposed for inclusion yet since there is atleast one more issue that needs to be resolved: What happens when a listener exits without doing deregistration (or if the listener attempts to register another cpumask while a current registration is still active). More on that in a separate thread. --Shailabh On systems with a large number of cpus, with even a modest rate of tasks exiting per cpu, the volume of taskstats data sent on thread exit can overflow a userspace listener's buffers. One approach to avoiding overflow is to allow listeners to get data for a limited and specific set of cpus. By scaling the number of listeners and/or the cpus they monitor, userspace can handle the statistical data overload more gracefully. In this patch, each listener registers to listen to a specific set of cpus by specifying a cpumask. The interest is recorded per-cpu. When a task exits on a cpu, its taskstats data is unicast to each listener interested in that cpu. Thanks to Andrew Morton for pointing out the various scalability and general concerns of previous attempts and for suggesting this design. Signed-Off-By: Shailabh Nagar <[EMAIL PROTECTED]> include/linux/taskstats.h |4 - include/linux/taskstats_kern.h | 12 --- kernel/taskstats.c | 136 +++-- 3 files changed, 135 insertions(+), 17 deletions(-) Index: linux-2.6.17-mm3equiv/include/linux/taskstats.h === --- linux-2.6.17-mm3equiv.orig/include/linux/taskstats.h2006-06-30 19:03:40.0 -0400 +++ linux-2.6.17-mm3equiv/include/linux/taskstats.h 2006-07-01 23:53:01.0 -0400 @@ -87,8 +87,6 @@ struct taskstats { }; -#define TASKSTATS_LISTEN_GROUP 0x1 - /* * Commands sent from userspace * Not versioned. New commands should only be inserted at the enum's end @@ -120,6 +118,8 @@ enum { TASKSTATS_CMD_ATTR_UNSPEC = 0, TASKSTATS_CMD_ATTR_PID, TASKSTATS_CMD_ATTR_TGID, + TASKSTATS_CMD_ATTR_REGISTER_CPUMASK, + TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK, __TASKSTATS_CMD_ATTR_MAX, }; Index: linux-2.6.17-mm3equiv/include/linux/taskstats_kern.h === --- linux-2.6.17-mm3equiv.orig/include/linux/taskstats_kern.h 2006-06-30 11:57:14.0 -0400 +++ linux-2.6.17-mm3equiv/include/linux/taskstats_kern.h2006-07-01 23:53:01.0 -0400 @@ -19,20 +19,14 @@ enum { #ifdef CONFIG_TASKSTATS extern kmem_cache_t *taskstats_cache; extern struct mutex taskstats_exit_mutex; - -static inline int taskstats_has_listeners(void) -{ - if (!genl_sock) - return 0; - return netlink_has_listeners(genl_sock, TASKSTATS_LISTEN_GROUP); -} - +DECLARE_PER_CPU(struct list_head, listener_list); static inline void taskstats_exit_alloc(struct taskstats **ptidstats) { *ptidstats = NULL; - if (taskstats_has_listeners()) + if (!list_empty(&get_cpu_var(listener_list))) *ptidstats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL); + put_cpu_var(listener_list); } static inline void taskstats_exit_free(struct taskstats *tidstats) Index: linux-2.6.17-mm3equiv/kernel/taskstats.c ==
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Shailabh wrote: > I don't know if there are buffer overflow > issues in passing a string I don't know if this comment applies to "the standard netlink way of passing it up using NLA_STRING", but the way I deal with buffer length issues in the cpuset code is to insist that the user code express the list in no fewer than 100 + 6 * NR_CPUS bytes: >From kernel/cpuset.c: /* Crude upper limit on largest legitimate cpulist user might write. */ if (nbytes > 100 + 6 * NR_CPUS) return -E2BIG; This lets the user specify the buffer size passed in, but prevents them from trying a denial of service attack on the kernel by trying to pass in a huge buffer. If the user can't figure out how to write the desired cpulist in that size, then tough toenails. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Shailabh wrote: > Yes. If no one registers to listen on a particular CPU, data from tasks > exiting on that cpu is not sent out at all. Excellent. > So I chose to use the "cpulist" ascii format that has been helpfully > provided in include/linux/cpumask.h (by whom I wonder :-) Excellent. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Paul Jackson wrote: Shailabh wrote: Sends a separate "registration" message with cpumask to listen to. Kernel stores (real) pid and cpumask. Question: = Ah - good. So this means that I could configure a system with a fork/exit intensive, performance critical job on some dedicated CPUs, and be able to collect taskstat data from tasks exiting on the -other- CPUS, while avoiding collecting data from this special job, thus avoiding any taskstat collection performance impact on said job. If I'm understanding this correctly, excellent. Yes. If no one registers to listen on a particular CPU, data from tasks exiting on that cpu is not sent out at all. Caveat: === Passing cpumasks across the kernel-user boundary can be tricky. Historically, Unix has a long tradition of boloxing up the passing of variable length data types across the kernel-user boundary. We've got perhaps a half dozen ways of getting these masks out of the kernel, and three ways of getting them (or the similar nodemasks) back into the kernel. The three ways being used in the sched_setaffinity system call, the mbind and set_mempolicy system calls, and the cpuset file system. All three of these ways have their controversial details: * The kernel cpumask mask size needed for sched_setaffinity calls is not trivially available to userland. * The nodemask bit size is off by one in the mbind and set_mempolicy calls. * The CPU and Node masks are ascii, not binary, in the cpuset calls. One option that might make sense for these task stat registrations would be to: 1) make the kernel/sched.c get_user_cpu_mask() routine generic, moving it to non-static lib/*.c code, and 2) provide a sensible way for user space to query the size of the kernel cpumask (and perhaps nodemask while you're at it.) Currently, the best way I know for user space to query the kernels cpumask and nodemask size is to examine the length of the ascii string values labeled "Cpus_allowed:" and "Mems_allowed:" in the file /proc/self/status. These ascii strings always require exactly nine ascii chars to express each 32 bits of kernel mask code, if you include in the count the trailing ',' comma or '\n' newline after each eight ascii character word. Probing /proc/self/status fields for these mask sizes is rather unobvious and indirect, and requires caching the result if you care at all about performance. Userland code in support of your taskstat facility might be better served by a more obvious way to size cpumasks. ... unless of course you're inclined to pass cpumasks formatted as ascii strings, in which case speak up, as I'd be delighted to throw in my 2 cents on how to do that ;). Thanks for the size info. I did hit it while coding this up. So I chose to use the "cpulist" ascii format that has been helpfully provided in include/linux/cpumask.h (by whom I wonder :-) User specified the cpumask as an ascii string containing comma separated cpu ranges. Kernel parses the same and stores it as a cpumask_t after which we can iterate over the mask using standard helpers. Since registration/deregistration is not a common operation, the overhead of parsing ascii strings should be acceptable and avoids the hassles of trying to determine kernel cpumask size. I don't know if there are buffer overflow issues in passing a string (though I'm using the standard netlink way of passing it up using NLA_STRING). Will post the patch shortly. --Shailabh - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Shailabh wrote: > Sends a separate "registration" message with cpumask to listen to. > Kernel stores (real) pid and cpumask. Question: = Ah - good. So this means that I could configure a system with a fork/exit intensive, performance critical job on some dedicated CPUs, and be able to collect taskstat data from tasks exiting on the -other- CPUS, while avoiding collecting data from this special job, thus avoiding any taskstat collection performance impact on said job. If I'm understanding this correctly, excellent. Caveat: === Passing cpumasks across the kernel-user boundary can be tricky. Historically, Unix has a long tradition of boloxing up the passing of variable length data types across the kernel-user boundary. We've got perhaps a half dozen ways of getting these masks out of the kernel, and three ways of getting them (or the similar nodemasks) back into the kernel. The three ways being used in the sched_setaffinity system call, the mbind and set_mempolicy system calls, and the cpuset file system. All three of these ways have their controversial details: * The kernel cpumask mask size needed for sched_setaffinity calls is not trivially available to userland. * The nodemask bit size is off by one in the mbind and set_mempolicy calls. * The CPU and Node masks are ascii, not binary, in the cpuset calls. One option that might make sense for these task stat registrations would be to: 1) make the kernel/sched.c get_user_cpu_mask() routine generic, moving it to non-static lib/*.c code, and 2) provide a sensible way for user space to query the size of the kernel cpumask (and perhaps nodemask while you're at it.) Currently, the best way I know for user space to query the kernels cpumask and nodemask size is to examine the length of the ascii string values labeled "Cpus_allowed:" and "Mems_allowed:" in the file /proc/self/status. These ascii strings always require exactly nine ascii chars to express each 32 bits of kernel mask code, if you include in the count the trailing ',' comma or '\n' newline after each eight ascii character word. Probing /proc/self/status fields for these mask sizes is rather unobvious and indirect, and requires caching the result if you care at all about performance. Userland code in support of your taskstat facility might be better served by a more obvious way to size cpumasks. ... unless of course you're inclined to pass cpumasks formatted as ascii strings, in which case speak up, as I'd be delighted to throw in my 2 cents on how to do that ;). -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
On Fri, 30 Jun 2006 23:37:10 -0400 Shailabh Nagar <[EMAIL PROTECTED]> wrote: > >Set aside the implementation details and ask "what is a good design"? > > > >A kernel-wide constant, whether determined at build-time or by a /proc poke > >isn't a nice design. > > > >Can we permit userspace to send in a netlink message describing a cpumask? > >That's back-compatible. > > > > > Yes, that should be doable. And passing in a cpumask is much better > since we no longer > have to maintain mappings. > > So the strawman is: > Listener bind()s to genetlink using its real pid. > Sends a separate "registration" message with cpumask to listen to. > Kernel stores (real) pid and cpumask. > During task exit, kernel goes through each registered listener (small > list) and decides which > one needs to get this exit data and calls a genetlink_unicast to each > one that does need it. > > If number of listeners is small, the lookups should be swift enough. If > it grows large, we > can consider a fancier lookup (but there I go again, delving into > implementation too early :-) We'll need a map. 1024 CPUs, 1024 listeners, 1000 exits/sec/CPU and we're up to a million operations per second per CPU. Meltdown. But it's a pretty simple map. A per-cpu array of pointers to the head of a linked list. One lock for each CPU's list. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Andrew Morton wrote: On Fri, 30 Jun 2006 22:20:23 -0400 Shailabh Nagar <[EMAIL PROTECTED]> wrote: If we're going to abuse nl_pid then how about we design things so that nl_pid is treated as two 16-bit words - one word is the start CPU and the other word is the end cpu? Or, if a 65536-CPU limit is too scary, make the bottom 8 bits of nl_pid be the number of CPUS (ie: TASKSTATS_CPUS_PER_SET) and the top 24 bits is the starting CPU. It'd be better to use a cpumask, of course.. All these options mean each listener gets to pick a "custom" range of cpus to listen on, rather than choose one of pre-defined ranges (even if the pre-defined ranges can change by a configurable TASKSTATS_CPUS_PER_SET). Which means the kernel side has to figure out which of the listeners cpu range includes the currently exiting task's cpu. To do this, we'll need a callback from the binding of the netlink socket (so taskstats can maintain the cpu -> nl_pid mappings at any exit). The current genetlink interface doesn't have that kind of flexibility (though it can be added I'm sure). Seems a bit involved if the primary aim is to restrict the number of cpus that one listener wants to listen, rather than be able to pick which ones. A configurable range won't suffice ? Set aside the implementation details and ask "what is a good design"? A kernel-wide constant, whether determined at build-time or by a /proc poke isn't a nice design. Can we permit userspace to send in a netlink message describing a cpumask? That's back-compatible. Yes, that should be doable. And passing in a cpumask is much better since we no longer have to maintain mappings. So the strawman is: Listener bind()s to genetlink using its real pid. Sends a separate "registration" message with cpumask to listen to. Kernel stores (real) pid and cpumask. During task exit, kernel goes through each registered listener (small list) and decides which one needs to get this exit data and calls a genetlink_unicast to each one that does need it. If number of listeners is small, the lookups should be swift enough. If it grows large, we can consider a fancier lookup (but there I go again, delving into implementation too early :-) Sounds good to me ! --Shailabh - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
On Fri, 30 Jun 2006 22:20:23 -0400 Shailabh Nagar <[EMAIL PROTECTED]> wrote: > >If we're going to abuse nl_pid then how about we design things so that > >nl_pid is treated as two 16-bit words - one word is the start CPU and the > >other word is the end cpu? > > > >Or, if a 65536-CPU limit is too scary, make the bottom 8 bits of nl_pid be > >the number of CPUS (ie: TASKSTATS_CPUS_PER_SET) and the top 24 bits is the > >starting CPU. > > > > > > > >It'd be better to use a cpumask, of course.. > > > > > All these options mean each listener gets to pick a "custom" range of > cpus to listen on, > rather than choose one of pre-defined ranges (even if the pre-defined > ranges can change > by a configurable TASKSTATS_CPUS_PER_SET). Which means the kernel side > has to > figure out which of the listeners cpu range includes the currently > exiting task's cpu. To do > this, we'll need a callback from the binding of the netlink socket (so > taskstats can maintain > the cpu -> nl_pid mappings at any exit). > The current genetlink interface doesn't have that kind of flexibility > (though it can be added > I'm sure). > > Seems a bit involved if the primary aim is to restrict the number of > cpus that one listener > wants to listen, rather than be able to pick which ones. > > A configurable range won't suffice ? > Set aside the implementation details and ask "what is a good design"? A kernel-wide constant, whether determined at build-time or by a /proc poke isn't a nice design. Can we permit userspace to send in a netlink message describing a cpumask? That's back-compatible. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Andrew Morton wrote: Shailabh Nagar <[EMAIL PROTECTED]> wrote: +/* + * Per-task exit data sent from the kernel to user space + * is tagged by an id based on grouping of cpus. + * + * If userspace specifies a non-zero P as the nl_pid field of + * the sockaddr_nl structure while binding to a netlink socket, + * it will receive exit data from threads that exited on cpus in the range + * + *[(P-1)*Y, P*Y-1] + * + * where Y = TASKSTATS_CPUS_PER_SET + * i.e. if TASKSTATS_CPUS_PER_SET is 16, + * to listen to data from cpus 0..15, specify P=1 + * for cpus 16..32, specify P=2 etc. + * + * To listen to data from all cpus, userspace should use P=0 + */ + +#define TASKSTATS_CPUS_PER_SET 16 The constant is unpleasant. I was planning to make it configurable. But that would still not be as flexible as below... If we're going to abuse nl_pid then how about we design things so that nl_pid is treated as two 16-bit words - one word is the start CPU and the other word is the end cpu? Or, if a 65536-CPU limit is too scary, make the bottom 8 bits of nl_pid be the number of CPUS (ie: TASKSTATS_CPUS_PER_SET) and the top 24 bits is the starting CPU. It'd be better to use a cpumask, of course.. All these options mean each listener gets to pick a "custom" range of cpus to listen on, rather than choose one of pre-defined ranges (even if the pre-defined ranges can change by a configurable TASKSTATS_CPUS_PER_SET). Which means the kernel side has to figure out which of the listeners cpu range includes the currently exiting task's cpu. To do this, we'll need a callback from the binding of the netlink socket (so taskstats can maintain the cpu -> nl_pid mappings at any exit). The current genetlink interface doesn't have that kind of flexibility (though it can be added I'm sure). Seems a bit involved if the primary aim is to restrict the number of cpus that one listener wants to listen, rather than be able to pick which ones. A configurable range won't suffice ? --Shailabh - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Shailabh Nagar <[EMAIL PROTECTED]> wrote: > > Based on previous discussions, the above solutions can be expanded/modified > to: > > a) allow userspace to listen to a group of cpus instead of all. Multiple > collection daemons can distribute the load as you pointed out. Doing > collection > by cpu groups rather than individual cpus reduces the aggregation burden on > userspace (and scales better with NR_CPUS) > > b) do flow control on the kernel send side. This can involve buffering and > sending > later (to handle bursty case) or dropping (to handle sustained load) as > pointed out > by you, Jamal in other threads. > > c) increase receiver's socket buffer. This can and should always be done but > no > involvement needed. > > > With regards to taskstats changes to handle the problem and its impact on > userspace > visible changes, > > a) will change userspace > b) will be transparent. > c) is immaterial going forward (except perhaps as a change in Documentation) > > > I'm sending a patch that demonstrates how a) can be done quite simply > and a patch for b) is in progress. > > If the approach suggested in patch a) is acceptable (and I'll provide the > testing, stability > results once comments on it are largely over), could taskstats acceptance in > 2.6.18 go ahead > and patch b) be added later (solution outline has already been provided and a > prelim patch should > be out by eod) Throwing more CPUs at the problem makes heaps of sense. It's not necessarily a userspace-incompatible change. As long as userspace sets nl_pid to 0x, future kernel revisions can treat that as "all CPUs". Or userspace can be forward-compatible by setting nl_pid to 0x, or whatever. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Shailabh Nagar <[EMAIL PROTECTED]> wrote: > > +/* > + * Per-task exit data sent from the kernel to user space > + * is tagged by an id based on grouping of cpus. > + * > + * If userspace specifies a non-zero P as the nl_pid field of > + * the sockaddr_nl structure while binding to a netlink socket, > + * it will receive exit data from threads that exited on cpus in the range > + * > + *[(P-1)*Y, P*Y-1] > + * > + * where Y = TASKSTATS_CPUS_PER_SET > + * i.e. if TASKSTATS_CPUS_PER_SET is 16, > + * to listen to data from cpus 0..15, specify P=1 > + * for cpus 16..32, specify P=2 etc. > + * > + * To listen to data from all cpus, userspace should use P=0 > + */ > + > +#define TASKSTATS_CPUS_PER_SET 16 The constant is unpleasant. If we're going to abuse nl_pid then how about we design things so that nl_pid is treated as two 16-bit words - one word is the start CPU and the other word is the end cpu? Or, if a 65536-CPU limit is too scary, make the bottom 8 bits of nl_pid be the number of CPUS (ie: TASKSTATS_CPUS_PER_SET) and the top 24 bits is the starting CPU. It'd be better to use a cpumask, of course.. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
On Fri, 2006-30-06 at 15:10 -0400, Shailabh Nagar wrote: > > Also to get feedback on this kind of usage of the nl_pid field, the > approach etc. > It does not look unreasonable. I think you may have issues when you have multiple such sockets opened within a single process. But do some testing and see how it goes. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Shailabh Nagar wrote: > Shailabh Nagar wrote: > > > Index: linux-2.6.17-mm3equiv/kernel/taskstats.c > === > --- linux-2.6.17-mm3equiv.orig/kernel/taskstats.c 2006-06-30 > 11:57:14.0 -0400 > +++ linux-2.6.17-mm3equiv/kernel/taskstats.c 2006-06-30 13:58:36.0 > -0400 > @@ -266,7 +266,7 @@ void taskstats_exit_send(struct task_str > struct sk_buff *rep_skb; > void *reply; > size_t size; > - int is_thread_group; > + int is_thread_group, setid; > struct nlattr *na; > > if (!family_registered || !tidstats) > @@ -320,7 +320,8 @@ void taskstats_exit_send(struct task_str > nla_nest_end(rep_skb, na); > > send: > - send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); > + setid = (smp_processor_id()%TASKSTATS_CPUS_PER_SET)+1; > + send_reply(rep_skb, setid, TASKSTATS_MSG_MULTICAST); This should be send_reply(rep_skb, setid, TASKSTATS_MSG_UNICAST); > return; > > nla_put_failure: > Index: linux-2.6.17-mm3equiv/Documentation/accounting/getdelays.c > === > --- linux-2.6.17-mm3equiv.orig/Documentation/accounting/getdelays.c > 2006-06-28 16:08:56.0 -0400 > +++ linux-2.6.17-mm3equiv/Documentation/accounting/getdelays.c > 2006-06-30 14:09:28.0 -0400 > @@ -40,7 +40,7 @@ int done = 0; > /* > * Create a raw netlink socket and bind > */ > -static int create_nl_socket(int protocol, int groups) > +static int create_nl_socket(int protocol, int cpugroup) > { > socklen_t addr_len; > int fd; > @@ -52,7 +52,8 @@ static int create_nl_socket(int protocol > > memset(&local, 0, sizeof(local)); > local.nl_family = AF_NETLINK; > -local.nl_groups = groups; > +local.nl_groups = TASKSTATS_LISTEN_GROUP; > +local.nl_pid = cpugroup; > > if (bind(fd, (struct sockaddr *) &local, sizeof(local)) < 0) > goto error; > @@ -203,7 +204,7 @@ int main(int argc, char *argv[]) > pid_t rtid = 0; > int cmd_type = TASKSTATS_TYPE_TGID; > int c, status; > -int forking = 0; > +int forking = 0, cpugroup = 0; > struct sigaction act = { > .sa_handler = SIG_IGN, > .sa_mask = SA_NOMASK, > @@ -222,7 +223,7 @@ int main(int argc, char *argv[]) > > while (1) { > > - c = getopt(argc, argv, "t:p:c:"); > + c = getopt(argc, argv, "t:p:c:g:l"); > if (c < 0) > break; > > @@ -252,8 +253,14 @@ int main(int argc, char *argv[]) > } > forking = 1; > break; > + case 'g': > + cpugroup = atoi(optarg); > + break; > + case 'l': > + loop = 1; > + break; > default: > - printf("usage %s [-t tgid][-p pid][-c cmd]\n", argv[0]); > + printf("usage %s [-t tgid][-p pid][-c cmd][-g cpugroup][-l]\n", > argv[0]); > exit(-1); > break; > } > @@ -266,7 +273,7 @@ int main(int argc, char *argv[]) > /* Send Netlink request message & get reply */ > > if ((nl_sd = > - create_nl_socket(NETLINK_GENERIC, TASKSTATS_LISTEN_GROUP)) < 0) > + create_nl_socket(NETLINK_GENERIC, cpugroup)) < 0) > err(1, "error creating Netlink socket\n"); > > > @@ -287,10 +294,10 @@ int main(int argc, char *argv[]) > > > if (!forking && sendto_fd(nl_sd, (char *) &req, req.n.nlmsg_len) < 0) > +if ((!forking && !loop) && > + sendto_fd(nl_sd, (char *) &req, req.n.nlmsg_len) < 0) > err(1, "error sending message via Netlink\n"); > > -act.sa_handler = SIG_IGN; > -sigemptyset(&act.sa_mask); > if (sigaction(SIGINT, &act, NULL) < 0) > err(1, "sigaction failed for SIGINT\n"); > > @@ -349,10 +356,11 @@ int main(int argc, char *argv[]) > rtid = *(int *) NLA_DATA(na); > break; > case TASKSTATS_TYPE_STATS: > - if (rtid == tid) { > + if (rtid == tid || loop) { > print_taskstats((struct taskstats *) > NLA_DATA(na)); > - done = 1; > + if (!loop) > + done = 1; > } > break; > } > @@ -369,7 +377,7 @@ int main(int argc, char *argv[]) > if (done) > break; > } > -while (1); > +while (loop); > > close(nl_sd); > return 0; > - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Shailabh Nagar wrote: > Andrew, > > Based on previous discussions, the above solutions can be expanded/modified > to: > > a) allow userspace to listen to a group of cpus instead of all. Multiple > collection daemons can distribute the load as you pointed out. Doing > collection > by cpu groups rather than individual cpus reduces the aggregation burden on > userspace (and scales better with NR_CPUS) > I'm sending a patch that demonstrates how a) can be done quite simply > and a patch for b) is in progress. > Here's the patch. Testing etc. need to be done (an earlier version that did per-cpu queues has worked) but the main point is to show how small a change is needed in the interface (on both the kernel and user side) and current codebase to achieve the a) solution. Also to get feedback on this kind of usage of the nl_pid field, the approach etc. Thanks, Shailabh === On systems with a large number of cpus, with even a modest rate of tasks exiting per cpu, the volume of taskstats data sent on thread exit can overflow a userspace listener's buffers. One approach to avoiding overflow is to allow listeners to get data for a limited number of cpus. By scaling the number of listening programs, each listening to a different set of cpus, userspace can avoid more overflow situations. This patch implements this idea by creating simple grouping of cpus and allowing userspace to listen to any cpu group it chooses. Alternative designs considered and rejected were: - creating a separate netlink group for each group of cpus. Since only 32 netlink groups can be specified by a user, this option will not scale with number of cpus. - aligning the grouping of cpus with cpusets. The unnecessary tying together of the two functionalities was not merited. Thanks to Balbir Singh for discovering the potential use of the pid field of sockaddr_nl as a communication subchannel in the same socket, Paul Jackson and Vivek Kashyap for suggesting cpus be grouped together for data send purposes. Signed-Off-By: Shailabh Nagar <[EMAIL PROTECTED]> Signed-Off-By: Balbir Singh <[EMAIL PROTECTED]> Documentation/accounting/getdelays.c | 30 +++--- include/linux/taskstats.h| 22 ++ kernel/taskstats.c |5 +++-- 3 files changed, 44 insertions(+), 13 deletions(-) Index: linux-2.6.17-mm3equiv/include/linux/taskstats.h === --- linux-2.6.17-mm3equiv.orig/include/linux/taskstats.h2006-06-30 11:57:14.0 -0400 +++ linux-2.6.17-mm3equiv/include/linux/taskstats.h 2006-06-30 14:24:49.0 -0400 @@ -89,6 +89,28 @@ struct taskstats { #define TASKSTATS_LISTEN_GROUP 0x1 + +/* + * Per-task exit data sent from the kernel to user space + * is tagged by an id based on grouping of cpus. + * + * If userspace specifies a non-zero P as the nl_pid field of + * the sockaddr_nl structure while binding to a netlink socket, + * it will receive exit data from threads that exited on cpus in the range + * + *[(P-1)*Y, P*Y-1] + * + * where Y = TASKSTATS_CPUS_PER_SET + * i.e. if TASKSTATS_CPUS_PER_SET is 16, + * to listen to data from cpus 0..15, specify P=1 + * for cpus 16..32, specify P=2 etc. + * + * To listen to data from all cpus, userspace should use P=0 + */ + +#define TASKSTATS_CPUS_PER_SET 16 + + /* * Commands sent from userspace * Not versioned. New commands should only be inserted at the enum's end Index: linux-2.6.17-mm3equiv/kernel/taskstats.c === --- linux-2.6.17-mm3equiv.orig/kernel/taskstats.c 2006-06-30 11:57:14.0 -0400 +++ linux-2.6.17-mm3equiv/kernel/taskstats.c2006-06-30 13:58:36.0 -0400 @@ -266,7 +266,7 @@ void taskstats_exit_send(struct task_str struct sk_buff *rep_skb; void *reply; size_t size; - int is_thread_group; + int is_thread_group, setid; struct nlattr *na; if (!family_registered || !tidstats) @@ -320,7 +320,8 @@ void taskstats_exit_send(struct task_str nla_nest_end(rep_skb, na); send: - send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); + setid = (smp_processor_id()%TASKSTATS_CPUS_PER_SET)+1; + send_reply(rep_skb, setid, TASKSTATS_MSG_MULTICAST); return; nla_put_failure: Index: linux-2.6.17-mm3equiv/Documentation/accounting/getdelays.c === --- linux-2.6.17-mm3equiv.orig/Documentation/accounting/getdelays.c 2006-06-28 16:08:56.0 -0400 +++ linux-2.6.17-mm3equiv/Documentation/accounting/getdelays.c 2006-06-30 14:09:28.0 -0400 @@ -40,7 +40,7 @@ int done = 0; /* * Create a raw netlink socket and bind */ -static int create_nl_socket(int protocol, int groups) +static int create_nl_socket(int protocol, int cpugrou
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Andrew Morton wrote: > On Thu, 29 Jun 2006 09:44:08 -0700 > Paul Jackson <[EMAIL PROTECTED]> wrote: > > >>>You're probably correct on that model. However, it all depends on the actual >>>workload. Are people who actually have large-CPU (>256) systems actually >>>running fork()-heavy things like webservers on them, or are they running >>>things >>>like database servers and computations, which tend to have persistent >>>processes? >> >>It may well be mostly as you say - the large-CPU systems not running >>the fork() heavy jobs. >> >>Sooner or later, someone will want to run a fork()-heavy job on a >>large-CPU system. On a 1024 CPU system, it would apparently take >>just 14 exits/sec/CPU to hit this bottleneck, if Jay's number of >>14000 applied. >> >>Chris Sturdivant's reply is reasonable -- we'll hit it sooner or later, >>and deal with it then. >> > > > I agree, and I'm viewing this as blocking the taskstats merge. Because if > this _is_ a problem then it's a big one because fixing it will be > intrusive, and might well involve userspace-visible changes. > > The only ways I can see of fixing the problem generally are to either > > a) throw more CPU(s) at stats collection: allow userspace to register for >"stats generated by CPU N", then run a stats collection daemon on each >CPU or > > b) make the kernel recognise when it's getting overloaded and switch to >some degraded mode where it stops trying to send all the data to >userspace - just send a summary, or a "we goofed" message or something. Andrew, Based on previous discussions, the above solutions can be expanded/modified to: a) allow userspace to listen to a group of cpus instead of all. Multiple collection daemons can distribute the load as you pointed out. Doing collection by cpu groups rather than individual cpus reduces the aggregation burden on userspace (and scales better with NR_CPUS) b) do flow control on the kernel send side. This can involve buffering and sending later (to handle bursty case) or dropping (to handle sustained load) as pointed out by you, Jamal in other threads. c) increase receiver's socket buffer. This can and should always be done but no involvement needed. With regards to taskstats changes to handle the problem and its impact on userspace visible changes, a) will change userspace b) will be transparent. c) is immaterial going forward (except perhaps as a change in Documentation) I'm sending a patch that demonstrates how a) can be done quite simply and a patch for b) is in progress. If the approach suggested in patch a) is acceptable (and I'll provide the testing, stability results once comments on it are largely over), could taskstats acceptance in 2.6.18 go ahead and patch b) be added later (solution outline has already been provided and a prelim patch should be out by eod) --Shailabh - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
On Thu, 2006-29-06 at 23:01 -0400, Shailabh Nagar wrote: > jamal wrote: > > > > > >>As long as the user is willing to pay the price in terms of memory, > >> > >> > > > >You may wanna draw a line to the upper limit - maybe even allocate slab > >space. > > > > > Didn't quite understand...could you please elaborate ? > Today we have a slab cache from which the taskstats structure gets > allocated at the beginning > of the exit() path. > The upper limit to which you refer is the amount of slab memory the user > is willing to be used > to store the bursty traffic ? > I think you have it fine already if you have a slab - as long as you know you will run out of space and have some strategy to deal with such boundary conditions. I was only reacting to your statement "As long as the user is willing to pay the price in terms of memory" I think you meant that a user could adjust the slab size on bootup etc, but it is finite in size. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
jamal wrote: On Thu, 2006-29-06 at 21:11 -0400, Shailabh Nagar wrote: Andrew Morton wrote: Shailabh Nagar <[EMAIL PROTECTED]> wrote: [..] So if we can detect the silly sustained-high-exit-rate scenario then it seems to me quite legitimate to do some aggressive data reduction on that. Like, a single message which says "20,000 sub-millisecond-runtime tasks exited in the past second" or something. The "buffering within taskstats" might be a way out then. Thats what it looks like. As long as the user is willing to pay the price in terms of memory, You may wanna draw a line to the upper limit - maybe even allocate slab space. Didn't quite understand...could you please elaborate ? Today we have a slab cache from which the taskstats structure gets allocated at the beginning of the exit() path. The upper limit to which you refer is the amount of slab memory the user is willing to be used to store the bursty traffic ? we can collect the exiting task's taskstats data but not send it immediately (taskstats_cache would grow) unless a high water mark had been crossed. Otherwise a timer event would do the sends of accumalated taskstats (not all at once but iteratively if necessary). Sounds reasonable. Thats what xfrm events do. Try to have those parameters settable because different machines or users may have different view as to what is proper - maybe even as simple as sysctl. Sounds good. At task exit, despite doing a few rounds of sending of pending data, if netlink were still reporting errors then it would be a sign of unsustainable rate and the pending queue could be dropped and a message like you suggest could be sent. When you send inside the kernel - you will get an error if there's problems sending to the socket queue. So you may wanna use that info to release the kernel allocated entries or keep them for a little longer. Hopefully that helps. Yes it does. Thanks for the tips. Will code up something and send out so this can become more concrete. --Shailabh cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Andrew wrote: > Nah. Stick it in the same cacheline as tasklist_lock (I'm amazed that > we've continued to get away with a global lock for that). Yes - a bit amazing. But no sense compounding the problem now. We shouldn't be adding global locks/modifiable data in the fork/exit code path if we can help it, without at least providing some simple way to ameliorate the problem when folks do start hitting it. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
On Thu, 29 Jun 2006 19:25:26 -0700 Paul Jackson <[EMAIL PROTECTED]> wrote: > Andrew wrote: > > Like, a single message which says "20,000 sub-millisecond-runtime tasks > > exited in the past second" or something. > > System wide accumulation of such data in the exit() code path still > risks being a bottleneck, just a bit later on. Nah. Stick it in the same cacheline as tasklist_lock (I'm amazed that we've continued to get away with a global lock for that). - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Andrew wrote: > Like, a single message which says "20,000 sub-millisecond-runtime tasks > exited in the past second" or something. System wide accumulation of such data in the exit() code path still risks being a bottleneck, just a bit later on. I'm more inclined now to look for ways to disable collection on some CPUs, and/or to allow for multiple streams in the future, as need be, along the lines of Shailabh's multiple TASKSTATS_LISTEN_GROUPs. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
On Thu, 2006-29-06 at 21:11 -0400, Shailabh Nagar wrote: > Andrew Morton wrote: > > >Shailabh Nagar <[EMAIL PROTECTED]> wrote: [..] > >So if we can detect the silly sustained-high-exit-rate scenario then it > >seems to me quite legitimate to do some aggressive data reduction on that. > >Like, a single message which says "20,000 sub-millisecond-runtime tasks > >exited in the past second" or something. > > > > > The "buffering within taskstats" might be a way out then. Thats what it looks like. > As long as the user is willing to pay the price in terms of memory, You may wanna draw a line to the upper limit - maybe even allocate slab space. > we can collect the exiting task's taskstats data but not send it > immediately (taskstats_cache would grow) > unless a high water mark had been crossed. Otherwise a timer event would do > the > sends of accumalated taskstats (not all at once but > iteratively if necessary). > Sounds reasonable. Thats what xfrm events do. Try to have those parameters settable because different machines or users may have different view as to what is proper - maybe even as simple as sysctl. > At task exit, despite doing a few rounds of sending of pending data, if > netlink were still reporting errors > then it would be a sign of unsustainable rate and the pending queue > could be dropped and a message like you suggest could be sent. > When you send inside the kernel - you will get an error if there's problems sending to the socket queue. So you may wanna use that info to release the kernel allocated entries or keep them for a little longer. Hopefully that helps. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Andrew Morton wrote: Shailabh Nagar <[EMAIL PROTECTED]> wrote: The rates (or upper bounds) that are being discussed here, as of now, are 1000 exits/sec/CPU for 1024 CPU systems. That would be roughly 1M exits/system * 248Bytes/message = 248 MB/sec. I think it's worth differentiating between burst rates and sustained rates here. One could easily imagine 10,000 threads all exiting at once, and the user being interested in reliably collecting the results. But if the machine is _sustaining_ such a high rate then that means that these exiting tasks all have a teeny runtime and the user isn't going to be interested in the per-thread statistics. So if we can detect the silly sustained-high-exit-rate scenario then it seems to me quite legitimate to do some aggressive data reduction on that. Like, a single message which says "20,000 sub-millisecond-runtime tasks exited in the past second" or something. The "buffering within taskstats" might be a way out then. As long as the user is willing to pay the price in terms of memory, we can collect the exiting task's taskstats data but not send it immediately (taskstats_cache would grow) unless a high water mark had been crossed. Otherwise a timer event would do the sends of accumalated taskstats (not all at once but iteratively if necessary). At task exit, despite doing a few rounds of sending of pending data, if netlink were still reporting errors then it would be a sign of unsustainable rate and the pending queue could be dropped and a message like you suggest could be sent. Thoughts ? --Shailabh - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Shailabh Nagar <[EMAIL PROTECTED]> wrote: > > The rates (or upper bounds) that are being discussed here, as of now, > are 1000 exits/sec/CPU for > 1024 CPU systems. That would be roughly 1M exits/system * > 248Bytes/message = 248 MB/sec. I think it's worth differentiating between burst rates and sustained rates here. One could easily imagine 10,000 threads all exiting at once, and the user being interested in reliably collecting the results. But if the machine is _sustaining_ such a high rate then that means that these exiting tasks all have a teeny runtime and the user isn't going to be interested in the per-thread statistics. So if we can detect the silly sustained-high-exit-rate scenario then it seems to me quite legitimate to do some aggressive data reduction on that. Like, a single message which says "20,000 sub-millisecond-runtime tasks exited in the past second" or something. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
jamal wrote: On Thu, 2006-29-06 at 16:01 -0400, Shailabh Nagar wrote: Jamal, any thoughts on the flow control capabilities of netlink that apply here ? Usage of the connection is to supply statistics data to userspace. if you want reliable delivery, then you cant just depend on async events from the kernel -> user - which i am assuming is the way stats get delivered as processes exit? Yes. Sorry, i dont remember the details. You need some synchronous scheme to ask the kernel to do a "get" or "dump". Oh, yes. Dump is synchronous. So it won't be useful unless we buffer task exit records within taskstats. Lets be clear about one thing: The problem really has nothing to do with gen/netlink or any other scheme you use;-> It has everything to do with reliability implications and the fact that you need to assume memory is a finite resource - at one point or another you will run out of memory ;-> And of course then messages will be lost. So for gen/netlink, just make sure you have large socket buffer and you would most likely be fine. I havent seen how the numbers were reached: But if you say you receive 14K exits/sec each of which is a 50B message, I would think a 1M socket buffer would be plenty. The rates (or upper bounds) that are being discussed here, as of now, are 1000 exits/sec/CPU for 1024 CPU systems. That would be roughly 1M exits/system * 248Bytes/message = 248 MB/sec. You can find out about lack of memory in netlink when you get a ENOBUFS. As an example, you should then do a kernel query. Clearly if you do a query of that sort, you may not want to find obsolete info. Therefore, as a suggestion, you may want to keep sequence numbers of sorts as markers. Perhaps keep a 32-bit field which monotically increases per process exit or use the pid as the sequence number etc.. As for throttling - Shailabh, I think we talked about this: - You could maintain info using some thresholds and timer. Then when a timer expires or threshold is exceeded send to user space. Hmm. So we could buffer the per-task exit data within taskstats (the mem consumption would grow but thats probably not a problem) and then send it out later. Jay - would not getting exit data soon after exit be a problem for CSA ? I'm guessing not, if the timeout is kept small enough. Internally, taskstats could always pace its sends so that "too much" isn't sent out at one shot. --Shailabh - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
On Thu, 2006-29-06 at 18:13 -0400, Shailabh Nagar wrote: > > And now I remember why I didn't go down that path earlier. Relayfs is one-way > kernel->user and lacks the ability to send query commands from user space > that we need. Either we would need to send commands up through a separate > interface > (even a syscall) or try and ensure that the exiting genetlink interface can > scale better with message volume (including throttling). Refer to my other email - whatever it takes to store "bulk" data in the kernel is subject to the constraint of the fact memory is finite. You can send messages from the kernel in sizes constrained by the memory socket size. You can tune the socket size. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
On Thu, 2006-29-06 at 16:01 -0400, Shailabh Nagar wrote: > > Jamal, > any thoughts on the flow control capabilities of netlink that apply here > ? Usage of the connection is to supply statistics data to userspace. > if you want reliable delivery, then you cant just depend on async events from the kernel -> user - which i am assuming is the way stats get delivered as processes exit? Sorry, i dont remember the details. You need some synchronous scheme to ask the kernel to do a "get" or "dump". Lets be clear about one thing: The problem really has nothing to do with gen/netlink or any other scheme you use;-> It has everything to do with reliability implications and the fact that you need to assume memory is a finite resource - at one point or another you will run out of memory ;-> And of course then messages will be lost. So for gen/netlink, just make sure you have large socket buffer and you would most likely be fine. I havent seen how the numbers were reached: But if you say you receive 14K exits/sec each of which is a 50B message, I would think a 1M socket buffer would be plenty. You can find out about lack of memory in netlink when you get a ENOBUFS. As an example, you should then do a kernel query. Clearly if you do a query of that sort, you may not want to find obsolete info. Therefore, as a suggestion, you may want to keep sequence numbers of sorts as markers. Perhaps keep a 32-bit field which monotically increases per process exit or use the pid as the sequence number etc.. As for throttling - Shailabh, I think we talked about this: - You could maintain info using some thresholds and timer. Then when a timer expires or threshold is exceeded send to user space. BTW, where is the doc fixes ? ;-> cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Andrew Morton wrote: >>Yup...the per-cpu, high speed requirements are up relayfs' alley, unless >>Jamal or netlink folks >>are planning something (or can shed light on) how large flows can be >>managed over netlink. I suspect >>this discussion has happened before :-) > > > yeah. And now I remember why I didn't go down that path earlier. Relayfs is one-way kernel->user and lacks the ability to send query commands from user space that we need. Either we would need to send commands up through a separate interface (even a syscall) or try and ensure that the exiting genetlink interface can scale better with message volume (including throttling). --Shailabh - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Shailabh wrote: > How much memory do these 1024 CPU machines have From: http://www.hpcwire.com/hpc/653963.html (May 12, 2006) SGI has already shipped more than a dozen SGI systems with over a terabyte of memory and about a hundred systems of half a terabyte or larger. But the new Altix will have much larger memory capacities. The systems SGI has in mind will scale to tens of terabytes and beyond. In fact, a few SGI customers are already testing with systems in the 10-terabyte range. "The largest we have shipped is a 13-terabyte memory system for the Japan Atomic Energy Agency," said [SGI CTO Dr. Eng Lim] Goh. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Andrew Morton wrote: On Thu, 29 Jun 2006 15:10:31 -0400 Shailabh Nagar <[EMAIL PROTECTED]> wrote: I agree, and I'm viewing this as blocking the taskstats merge. Because if this _is_ a problem then it's a big one because fixing it will be intrusive, and might well involve userspace-visible changes. First off, just a reminder that this is inherently a netlink flow control issue...which was being exacerbated earlier by taskstats decision to send per-tgid data (no longer the case). But I'd like to know whats our target here ? How many messages per second do we want to be able to be sent and received without risking any loss of data ? Netlink will lose messages at a high enough rate so the design point will need to be known a bit. For statistics type usage of the genetlink/netlink, I would have thought that userspace, provided it is reliably informed about the loss of data through ENOBUFS, could take measures to just account for the missing data and carry on ? Could be so. But we need to understand how significant the impact of this will be in practice. We could find, once this is deployed is real production environments on large machines that the data loss is sufficiently common and sufficiently serious that the feature needs a lot of rework. Now there's always a risk of that sort of thing happening with all features, but it's usually not this evident so early in the development process. We need to get a better understanding of the risk before proceeding too far. Ok. I suppose we should first determine what number of tasks can be forked/exited at a sustained rate on these m/c's and that would be one upper bound. Paul, Chris, Jay, What total exit rate would be a good upper bound ? How much memory do these 1024 CPU machines have (in high end configurations, not just based on 64-bit addressability) and how many tasks can actually be forked/exited in such a machine ? And there's always a 100% reliable fix for this: throttling. Make the sender of the messages block until the consumer can catch up. In some situations, that is what people will want to be able to do. Is this really an option for taskstats ? Allowing exits to get throttled ? I suppose its one way but seems like overkill for something like stats. I suspect a good implementation would be to run a collection daemon on each CPU and make the delivery be cpu-local. That's sounding more like relayfs than netlink. Yup...per-cpu, high speed delivery is looking like relayfs alright. One option that we've not explored in detail is the "dump" functionality of genetlink which allows kernel space to keep getting called with skb's to fill until its done. How much buffering that affords us in the face of a slow user is not known. But if we're discussing large exit rates happening in a burst, not a sustained way, that may be one way out. Jamal, any thoughts on the flow control capabilities of netlink that apply here ? Usage of the connection is to supply statistics data to userspace. --Shailabh - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
On Thu, 29 Jun 2006 15:43:41 -0400 Shailabh Nagar <[EMAIL PROTECTED]> wrote: > >Could be so. But we need to understand how significant the impact of this > >will be in practice. > > > >We could find, once this is deployed is real production environments on > >large machines that the data loss is sufficiently common and sufficiently > >serious that the feature needs a lot of rework. > > > >Now there's always a risk of that sort of thing happening with all > >features, but it's usually not this evident so early in the development > >process. We need to get a better understanding of the risk before > >proceeding too far. > > > > > > >And there's always a 100% reliable fix for this: throttling. Make the > >sender of the messages block until the consumer can catch up. > > > Is blocking exits an option ? I think it has to be an option. I'm sure that some peope under some circumstances will just want to collect all the data, thank you very much. And I doubt if it'll be a performance problem for them - the amount of CPU time per exit will be small - if you're exitting at great frequency then the stats collecion overhead rises proportionately. That is to be expected. There will be buffering in the channel, so we'd expect to gather thousands of records per context switch. > > In some > >situations, that is what people will want to be able to do. I suspect a > >good implementation would be to run a collection daemon on each CPU and > >make the delivery be cpu-local. That's sounding more like relayfs than > >netlink. > > > > > Yup...the per-cpu, high speed requirements are up relayfs' alley, unless > Jamal or netlink folks > are planning something (or can shed light on) how large flows can be > managed over netlink. I suspect > this discussion has happened before :-) yeah. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Andrew Morton wrote: On Thu, 29 Jun 2006 15:10:31 -0400 Shailabh Nagar <[EMAIL PROTECTED]> wrote: I agree, and I'm viewing this as blocking the taskstats merge. Because if this _is_ a problem then it's a big one because fixing it will be intrusive, and might well involve userspace-visible changes. First off, just a reminder that this is inherently a netlink flow control issue...which was being exacerbated earlier by taskstats decision to send per-tgid data (no longer the case). But I'd like to know whats our target here ? How many messages per second do we want to be able to be sent and received without risking any loss of data ? Netlink will lose messages at a high enough rate so the design point will need to be known a bit. For statistics type usage of the genetlink/netlink, I would have thought that userspace, provided it is reliably informed about the loss of data through ENOBUFS, could take measures to just account for the missing data and carry on ? Could be so. But we need to understand how significant the impact of this will be in practice. We could find, once this is deployed is real production environments on large machines that the data loss is sufficiently common and sufficiently serious that the feature needs a lot of rework. Now there's always a risk of that sort of thing happening with all features, but it's usually not this evident so early in the development process. We need to get a better understanding of the risk before proceeding too far. And there's always a 100% reliable fix for this: throttling. Make the sender of the messages block until the consumer can catch up. Is blocking exits an option ? In some situations, that is what people will want to be able to do. I suspect a good implementation would be to run a collection daemon on each CPU and make the delivery be cpu-local. That's sounding more like relayfs than netlink. Yup...the per-cpu, high speed requirements are up relayfs' alley, unless Jamal or netlink folks are planning something (or can shed light on) how large flows can be managed over netlink. I suspect this discussion has happened before :-) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
On Thu, 29 Jun 2006 15:10:31 -0400 Shailabh Nagar <[EMAIL PROTECTED]> wrote: > >I agree, and I'm viewing this as blocking the taskstats merge. Because if > >this _is_ a problem then it's a big one because fixing it will be > >intrusive, and might well involve userspace-visible changes. > > > > > First off, just a reminder that this is inherently a netlink flow > control issue...which was being exacerbated > earlier by taskstats decision to send per-tgid data (no longer the case). > > But I'd like to know whats our target here ? How many messages per > second do we want to be able to be sent > and received without risking any loss of data ? Netlink will lose > messages at a high enough rate so the design point > will need to be known a bit. > > For statistics type usage of the genetlink/netlink, I would have thought > that userspace, provided it is reliably informed > about the loss of data through ENOBUFS, could take measures to just > account for the missing data and carry on ? Could be so. But we need to understand how significant the impact of this will be in practice. We could find, once this is deployed is real production environments on large machines that the data loss is sufficiently common and sufficiently serious that the feature needs a lot of rework. Now there's always a risk of that sort of thing happening with all features, but it's usually not this evident so early in the development process. We need to get a better understanding of the risk before proceeding too far. And there's always a 100% reliable fix for this: throttling. Make the sender of the messages block until the consumer can catch up. In some situations, that is what people will want to be able to do. I suspect a good implementation would be to run a collection daemon on each CPU and make the delivery be cpu-local. That's sounding more like relayfs than netlink. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Shailabh wrote: > First off, just a reminder that this is inherently a netlink flow > control issue...which was being exacerbated earlier by taskstats > decision to send per-tgid data (no longer the case). > > But I'd like to know whats our target here ? How many messages > per second do we want to be able to be sent and received without > risking any loss of data ? Netlink will lose messages at a high > enough rate so the design point will need to be known a bit. Perhaps its not so much an issue of the design rate, as an issue of how we deal with hitting the limit. Sooner or later, perhaps due to operator error, almost any implementable rate will be exceeded. Ideally, we would both of the remedies that Andrew mentioned, rephrasing: 1) a way for a customer who needs a higher rate to scale the useful resources he can apply to the collection, and 2) a clear indicator when the supported rate was exceeded anyway. > For statistics type usage of the genetlink/netlink, I would have > thought that userspace, provided it is reliably informed about the loss > of data through ENOBUFS, could take measures to just account for the > missing data and carry on ? If that's so, then the ENOBUFS error may well meet my remedy (2) above, leaving just the question of how a customer could scale to higher rates, if they found it was worth doing so. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Andrew Morton wrote: On Thu, 29 Jun 2006 09:44:08 -0700 Paul Jackson <[EMAIL PROTECTED]> wrote: You're probably correct on that model. However, it all depends on the actual workload. Are people who actually have large-CPU (>256) systems actually running fork()-heavy things like webservers on them, or are they running things like database servers and computations, which tend to have persistent processes? It may well be mostly as you say - the large-CPU systems not running the fork() heavy jobs. Sooner or later, someone will want to run a fork()-heavy job on a large-CPU system. On a 1024 CPU system, it would apparently take just 14 exits/sec/CPU to hit this bottleneck, if Jay's number of 14000 applied. Chris Sturdivant's reply is reasonable -- we'll hit it sooner or later, and deal with it then. I agree, and I'm viewing this as blocking the taskstats merge. Because if this _is_ a problem then it's a big one because fixing it will be intrusive, and might well involve userspace-visible changes. First off, just a reminder that this is inherently a netlink flow control issue...which was being exacerbated earlier by taskstats decision to send per-tgid data (no longer the case). But I'd like to know whats our target here ? How many messages per second do we want to be able to be sent and received without risking any loss of data ? Netlink will lose messages at a high enough rate so the design point will need to be known a bit. For statistics type usage of the genetlink/netlink, I would have thought that userspace, provided it is reliably informed about the loss of data through ENOBUFS, could take measures to just account for the missing data and carry on ? The only ways I can see of fixing the problem generally are to either a) throw more CPU(s) at stats collection: allow userspace to register for "stats generated by CPU N", then run a stats collection daemon on each CPU or b) make the kernel recognise when it's getting overloaded and switch to some degraded mode where it stops trying to send all the data to userspace - just send a summary, or a "we goofed" message or something. One of the unused features of genetlink that's meant for high volume data output from the kernel is the "dump" callback of a genetlink connection. Essentially kernel space keeps getting provided sk_buffs to fill which the netlink layer then supplies to user space (over time I guess ?) But whatever we do, there's going to be some limit so its useful to decide what the design point should be ? Adding Jamal for his thoughts on netlink's flow control in the context of genetlink. --Shailabh - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html