Re: [DOC]: generic netlink

2006-06-19 Thread Shailabh Nagar
jamal wrote:
> On Mon, 2006-19-06 at 11:13 -0400, James Morris wrote:
> 
> 
>>It seems that TIPC is multiplexing all of it's commands through  
>>TIPC_GENL_CMD.
> 
> 
> 
> TIPC is a deviation; they had the 100 ioctls and therefore did a direct
> one-to-one mapping.
> 
> 
>>I wonder, if this is how other protocols are likely to utilize genl, then 
>>we could possibly drop the command registration code completely and one 
>>command op can be registered by the protocol during 
>>genl_register_family().
>>
> 
> 
> The intent is to have a handful of commands as in classical netlink
> (eg route or qdisc etc) where you are controlling data that sits in the
> kernel; i.e when you have an attribute or a vector of attributes, then
> the commands will be of the semantics: ADD/DEL/GET/DUMP only. 
> Other that TIPC the two other users i have seen use it in this manner.
> But, you are right if usage tends to lean in some other way we could get
> rid of it (I think TIPC is a bad example).

The taskstats interface, currently in -mm, is one user of genetlink
http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.17-rc6/2.6.17-rc6-mm2/broken-out/per-task-delay-accounting-taskstats-interface.patch

Based on Jamal's suggestions, we found it useful to have the "limited"
set of commands model and ended up with having to register just one GET
command. And in subsequent discussions, a SET command would also be handy.

But I'm not too clear about what are the advantages of trying to limit the
number of commands registered by a given exploiter of genetlink (say TIPC or 
taskstats),
other than the conventional usage of netlink.

e.g in the taskstats code, userspace needs to GET data on a per-pid and 
per-tgid basis
from the kernel and supplies the specific pid or tgid. We could either have 
registered
two commands (say GET_PID and GET_TGID) and then the parsing of the supplied 
uint32 would
be implicit in the command. But we went with the model where we have only one 
GET command
and the type of the parameter is specified via netlink attributes.

In our case, it didn't matter and since the type of data returned is very 
similar and so is
the parameter supplied (pid/tgid), one GET suffices. But I'm wondering if 
userspace should
consciously try and limit the commands or would it be better from a performance 
standpoint,
to permit a reasonably larger "fan-out" to happen at the genetlink command 
level (for each exploiter).
I guess this introduces more overhead for in-kernel structures (the linked list 
of command structures
that needs to be kept around) while saving time on doing a second level of 
parsing within the
exploiter-defined function that services the GET command.

The "small" set model looks like a good compromise. Reducing number of commands 
to one is not a good
idea IMHOfor reasons similar to why ioctl type syscalls aren't 
encouraged...since the genetlink
layer anyway has code for demultiplexing, might as well use it and avoid an 
extra level of indirection.

--Shailabh


>>This would both simplify the genl code and API, and help ensure 
>>consistency of users.
>>
> 
> 
> You are talking from an SELinux perspective i take it?
> My view: If you want to have ACLs against such commands
> then it becomes easier to say "can only do ADD but not DEL" for example
> (We need to resolve genl_rcv_msg() check on commands to be in sync with
> SELinux as was pointed by Thomas)
> 
> cheers,
> jamal
> 

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [DOC]: generic netlink

2006-06-19 Thread Shailabh Nagar
jamal wrote:
> Folks,
> 
> Attached is a document that should help people wishing to use generic
> netlink interface. It is a WIP so a lot more to go if i see interest.
> The doc has been around for a while, i spent part of yesterday and this
> morning cleaning it up. If you have sent me comments before, please
> forgive me for having misplaced them - just send again. 

Jamal,

Completing the documentation on generic netlink usage will definitely be
useful. I'd be happy to help out with this since I've recently gone through
trying to understand and use genetlink for the taskstats interface. Hopefully
this will help other users like me who aren't netlink experts to begin with !

I've sent you a patch to the document that attempts to cover the following
TODOS (didn't see any point sending it to the whole list since its harder to
read patches to documentation). Pls use as you see fit.

> TODO:
> a) Add a more complete compiling kernel module with events.
> Have Thomas put his Mashimaro example and point to it.
(not the Mashimaro example, nor a completly compiled module but snippets
of pseudo code taken from the user space program used in taskstats development,
modified to the foobar example you've used)
> b) Describe some details on how user space -> kernel works
> probably using libnl??
> c) Describe discovery using the controller..

I'll provide another patch that will cover d) and e) in the set below, again
in the context of the foobar example, which might need to be modified a bit.

> d) talk about policies etc
> e) talk about how something coming from user space eventually
> gets to you.
> f) Talk about the TLV manipulation stuff from Thomas.
> g) submit controller patch to iproute2

One point...does d), f) etc. belong in a separate doc describing usage
of netlink attributes ? Its useful here too but not directly related to
genetlink perhaps.

> PS:- I dont have a good place to put this doc and point to, hence the
> 17K attachment
>

http://www.kernel.org/pub/linux/kernel/people/hadi/ ?

(unless your permissions have been revoked for lack of use ! :-)

Having the current document will be useful to see what edits have been accepted
and work on that instead of the original.

--Shailabh
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-06-29 Thread Shailabh Nagar

Andrew Morton wrote:


On Thu, 29 Jun 2006 09:44:08 -0700
Paul Jackson <[EMAIL PROTECTED]> wrote:

 


You're probably correct on that model. However, it all depends on the actual
workload. Are people who actually have large-CPU (>256) systems actually
running fork()-heavy things like webservers on them, or are they running things
like database servers and computations, which tend to have persistent
processes?
 


It may well be mostly as you say - the large-CPU systems not running
the fork() heavy jobs.

Sooner or later, someone will want to run a fork()-heavy job on a
large-CPU system.  On a 1024 CPU system, it would apparently take
just 14 exits/sec/CPU to hit this bottleneck, if Jay's number of
14000 applied.

Chris Sturdivant's reply is reasonable -- we'll hit it sooner or later,
and deal with it then.

   



I agree, and I'm viewing this as blocking the taskstats merge.  Because if
this _is_ a problem then it's a big one because fixing it will be
intrusive, and might well involve userspace-visible changes.
 

First off, just a reminder that this is inherently a netlink flow 
control issue...which was being exacerbated

earlier by taskstats decision to send per-tgid data (no longer the case).

But I'd like to know whats our target here ? How many messages per 
second do we want to be able to be sent
and received without risking any loss of data ? Netlink will lose 
messages at a high enough rate so the design point

will need to be known a bit.

For statistics type usage of the genetlink/netlink, I would have thought 
that userspace, provided it is reliably informed
about the loss of data through ENOBUFS, could take measures to just 
account for the missing data and carry on ?






The only ways I can see of fixing the problem generally are to either

a) throw more CPU(s) at stats collection: allow userspace to register for
  "stats generated by CPU N", then run a stats collection daemon on each
  CPU or
 


b) make the kernel recognise when it's getting overloaded and switch to
  some degraded mode where it stops trying to send all the data to
  userspace - just send a summary, or a "we goofed" message or something.
 

One of the unused features of genetlink that's meant for high volume 
data output from the kernel is
the "dump" callback of a genetlink connection. Essentially kernel space 
keeps getting provided sk_buffs
to fill which the netlink layer then supplies to user space (over time I 
guess ?)


But whatever we do, there's going to be some limit so its useful to 
decide what the design point should be ?


Adding Jamal for his thoughts on netlink's flow control in the context 
of genetlink.


--Shailabh
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-06-29 Thread Shailabh Nagar

Andrew Morton wrote:


On Thu, 29 Jun 2006 15:10:31 -0400
Shailabh Nagar <[EMAIL PROTECTED]> wrote:

 


I agree, and I'm viewing this as blocking the taskstats merge.  Because if
this _is_ a problem then it's a big one because fixing it will be
intrusive, and might well involve userspace-visible changes.


 

First off, just a reminder that this is inherently a netlink flow 
control issue...which was being exacerbated

earlier by taskstats decision to send per-tgid data (no longer the case).

But I'd like to know whats our target here ? How many messages per 
second do we want to be able to be sent
and received without risking any loss of data ? Netlink will lose 
messages at a high enough rate so the design point

will need to be known a bit.

For statistics type usage of the genetlink/netlink, I would have thought 
that userspace, provided it is reliably informed
about the loss of data through ENOBUFS, could take measures to just 
account for the missing data and carry on ?
   



Could be so.  But we need to understand how significant the impact of this
will be in practice.

We could find, once this is deployed is real production environments on
large machines that the data loss is sufficiently common and sufficiently
serious that the feature needs a lot of rework.

Now there's always a risk of that sort of thing happening with all
features, but it's usually not this evident so early in the development
process.  We need to get a better understanding of the risk before
proceeding too far.
 




And there's always a 100% reliable fix for this: throttling.  Make the
sender of the messages block until the consumer can catch up. 


Is blocking exits an option ?


In some
situations, that is what people will want to be able to do.  I suspect a
good implementation would be to run a collection daemon on each CPU and
make the delivery be cpu-local.  That's sounding more like relayfs than
netlink.
 

Yup...the per-cpu, high speed requirements are up relayfs' alley, unless 
Jamal or netlink folks
are planning something (or can shed light on) how large flows can be 
managed over netlink. I suspect

this discussion has happened before :-)


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-06-29 Thread Shailabh Nagar

Andrew Morton wrote:


On Thu, 29 Jun 2006 15:10:31 -0400
Shailabh Nagar <[EMAIL PROTECTED]> wrote:

 


I agree, and I'm viewing this as blocking the taskstats merge.  Because if
this _is_ a problem then it's a big one because fixing it will be
intrusive, and might well involve userspace-visible changes.


 

First off, just a reminder that this is inherently a netlink flow 
control issue...which was being exacerbated

earlier by taskstats decision to send per-tgid data (no longer the case).

But I'd like to know whats our target here ? How many messages per 
second do we want to be able to be sent
and received without risking any loss of data ? Netlink will lose 
messages at a high enough rate so the design point

will need to be known a bit.

For statistics type usage of the genetlink/netlink, I would have thought 
that userspace, provided it is reliably informed
about the loss of data through ENOBUFS, could take measures to just 
account for the missing data and carry on ?
   



Could be so.  But we need to understand how significant the impact of this
will be in practice.

We could find, once this is deployed is real production environments on
large machines that the data loss is sufficiently common and sufficiently
serious that the feature needs a lot of rework.

Now there's always a risk of that sort of thing happening with all
features, but it's usually not this evident so early in the development
process.  We need to get a better understanding of the risk before
proceeding too far.
 


Ok.

I suppose we should first determine what number of tasks can be 
forked/exited at a sustained rate

on these m/c's and that would be one upper bound.

Paul, Chris, Jay,
What total exit rate would be a good upper bound ? How much memory do 
these 1024 CPU machines
have (in high end configurations, not just based on 64-bit 
addressability) and how many tasks can actually be

forked/exited in such a machine ?


And there's always a 100% reliable fix for this: throttling.  Make the
sender of the messages block until the consumer can catch up.  In some
situations, that is what people will want to be able to do.  

Is this really an option for taskstats ? Allowing exits to get throttled 
? I suppose its one way

but seems like overkill for something like stats.


I suspect a
good implementation would be to run a collection daemon on each CPU and
make the delivery be cpu-local.  That's sounding more like relayfs than
netlink.
 


Yup...per-cpu, high speed delivery is looking like relayfs alright.

One option that we've not explored in detail is the "dump" functionality 
of genetlink which allows
kernel space to keep getting called with skb's to fill until its done. 
How much buffering that affords us
in the face of a slow user is not known. But if we're discussing large 
exit rates happening in a burst, not

a sustained way, that may be one way out.

Jamal,
any thoughts on the flow control capabilities of netlink that apply here 
? Usage of the connection is to

supply statistics data to userspace.

--Shailabh

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-06-29 Thread Shailabh Nagar
Andrew Morton wrote:

>>Yup...the per-cpu, high speed requirements are up relayfs' alley, unless 
>>Jamal or netlink folks
>>are planning something (or can shed light on) how large flows can be 
>>managed over netlink. I suspect
>>this discussion has happened before :-)
> 
> 
> yeah.

And now I remember why I didn't go down that path earlier. Relayfs is one-way
kernel->user and lacks the ability to send query commands from user space
that we need. Either we would need to send commands up through a separate 
interface
(even a syscall) or try and ensure that the exiting genetlink interface can
scale better with message volume (including throttling).

--Shailabh



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-06-29 Thread Shailabh Nagar

jamal wrote:


On Thu, 2006-29-06 at 16:01 -0400, Shailabh Nagar wrote:

 


Jamal,
any thoughts on the flow control capabilities of netlink that apply here 
? Usage of the connection is to supply statistics data to userspace.


   



if you want reliable delivery, then you cant just depend on async events
from the kernel -> user - which i am assuming is the way stats get
delivered as processes exit? 


Yes.


Sorry, i dont remember the details. You
need some synchronous scheme to ask the kernel to do a "get" or "dump".
 

Oh, yes. Dump is synchronous. So it won't be useful unless we buffer 
task exit records within

taskstats.


Lets be clear about one thing:
The problem really has nothing to do with gen/netlink or any other
scheme you use;->
It has everything to do with reliability implications and the fact
that you need to assume memory is a finite resource - at one point
or another you will run out of memory ;-> And of course then messages
will be lost.  So for gen/netlink, just make sure you have large socket
buffer and you would most likely be fine. 
I havent seen how the numbers were reached: But if you say you receive

14K exits/sec each of which is a 50B message, I would think a 1M socket
buffer would be plenty.
 

The rates (or upper bounds) that are being discussed here, as of now, 
are 1000 exits/sec/CPU for
1024 CPU systems. That would be roughly 1M exits/system * 
248Bytes/message  = 248 MB/sec.



You can find out about lack of memory in netlink when you get a ENOBUFS.
As an example, you should then do a kernel query. Clearly if you do a
query of that sort, you may not want to find obsolete info. Therefore,
as a suggestion, you may want to keep sequence numbers of sorts as
markers. Perhaps keep a 32-bit field which monotically increases per
process exit or use the pid as the sequence number etc..

As for throttling - Shailabh, I think we talked about this:
- You could maintain info using some thresholds and timer. Then
when a timer expires or threshold is exceeded send to user space.
 

Hmm. So we could buffer the per-task exit data within taskstats (the mem 
consumption would grow

but thats probably not a problem) and then send it out later.

Jay - would not getting exit data soon after exit be a problem for CSA ? 
I'm guessing not, if the
timeout is kept small enough. Internally, taskstats could always pace 
its sends so that "too much"

isn't sent out at one shot.

--Shailabh


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-06-29 Thread Shailabh Nagar

Andrew Morton wrote:


Shailabh Nagar <[EMAIL PROTECTED]> wrote:
 

The rates (or upper bounds) that are being discussed here, as of now, 
are 1000 exits/sec/CPU for
1024 CPU systems. That would be roughly 1M exits/system * 
248Bytes/message  = 248 MB/sec.
   



I think it's worth differentiating between burst rates and sustained rates
here.

One could easily imagine 10,000 threads all exiting at once, and the user
being interested in reliably collecting the results.

But if the machine is _sustaining_ such a high rate then that means that
these exiting tasks all have a teeny runtime and the user isn't going to be
interested in the per-thread statistics.

So if we can detect the silly sustained-high-exit-rate scenario then it
seems to me quite legitimate to do some aggressive data reduction on that. 
Like, a single message which says "20,000 sub-millisecond-runtime tasks

exited in the past second" or something.
 


The "buffering within taskstats" might be a way out then.
As long as the user is willing to pay the price in terms of memory, we 
can collect the exiting task's
taskstats data but not send it immediately (taskstats_cache would grow) 
unless a high water mark had
been crossed. Otherwise a timer event would do the sends of accumalated 
taskstats (not all at once but

iteratively if necessary).

At task exit, despite doing a few rounds of sending of pending data, if 
netlink were still reporting errors
then it would be a sign of unsustainable rate and the pending queue 
could be dropped and a message

like you suggest could be sent.

Thoughts ?


--Shailabh

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-06-29 Thread Shailabh Nagar

jamal wrote:


On Thu, 2006-29-06 at 21:11 -0400, Shailabh Nagar wrote:
 


Andrew Morton wrote:

   


Shailabh Nagar <[EMAIL PROTECTED]> wrote:
 


[..]
 


So if we can detect the silly sustained-high-exit-rate scenario then it
seems to me quite legitimate to do some aggressive data reduction on that. 
Like, a single message which says "20,000 sub-millisecond-runtime tasks

exited in the past second" or something.


 


The "buffering within taskstats" might be a way out then.
   



Thats what it looks like.

 


As long as the user is willing to pay the price in terms of memory,
   



You may wanna draw a line to the upper limit - maybe even allocate slab
space.
 


Didn't quite understand...could you please elaborate ?
Today we have a slab cache from which the taskstats structure gets 
allocated at the beginning

of the exit() path.
The upper limit to which you refer is the amount of slab memory the user 
is willing to be used

to store the bursty traffic ?


we can collect the exiting task's taskstats data but not send it 
immediately (taskstats_cache would grow) 
unless a high water mark had been crossed. Otherwise a timer event would do the 
sends of accumalated  taskstats (not all at once but

iteratively if necessary).

   



Sounds reasonable. Thats what xfrm events do. 




Try to have those
parameters settable because different machines or users may have
different view as to what is proper - maybe even as simple as sysctl.
 


Sounds good.

 

At task exit, despite doing a few rounds of sending of pending data, if 
netlink were still reporting errors
then it would be a sign of unsustainable rate and the pending queue 
could be dropped and a message like you suggest could be sent.


   



When you send inside the kernel - you will get an error if there's
problems sending to the socket queue. So you may wanna use that info
to release the kernel allocated entries or keep them for a little
longer.

Hopefully that helps.
 


Yes it does. Thanks for the tips.

Will code up something and send out so this can become more concrete.

--Shailabh


cheers,
jamal


 



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-06-30 Thread Shailabh Nagar
Andrew Morton wrote:
> On Thu, 29 Jun 2006 09:44:08 -0700
> Paul Jackson <[EMAIL PROTECTED]> wrote:
>
>
>>>You're probably correct on that model. However, it all depends on the actual
>>>workload. Are people who actually have large-CPU (>256) systems actually
>>>running fork()-heavy things like webservers on them, or are they running 
>>>things
>>>like database servers and computations, which tend to have persistent
>>>processes?
>>
>>It may well be mostly as you say - the large-CPU systems not running
>>the fork() heavy jobs.
>>
>>Sooner or later, someone will want to run a fork()-heavy job on a
>>large-CPU system.  On a 1024 CPU system, it would apparently take
>>just 14 exits/sec/CPU to hit this bottleneck, if Jay's number of
>>14000 applied.
>>
>>Chris Sturdivant's reply is reasonable -- we'll hit it sooner or later,
>>and deal with it then.
>>
>
>
> I agree, and I'm viewing this as blocking the taskstats merge.  Because if
> this _is_ a problem then it's a big one because fixing it will be
> intrusive, and might well involve userspace-visible changes.
>
> The only ways I can see of fixing the problem generally are to either
>
> a) throw more CPU(s) at stats collection: allow userspace to register for
>"stats generated by CPU N", then run a stats collection daemon on each
>CPU or
>
> b) make the kernel recognise when it's getting overloaded and switch to
>some degraded mode where it stops trying to send all the data to
>userspace - just send a summary, or a "we goofed" message or something.



Andrew,

Based on previous discussions, the above solutions can be expanded/modified to:

a) allow userspace to listen to a group of cpus instead of all. Multiple
collection daemons can distribute the load as you pointed out. Doing collection
by cpu groups rather than individual cpus reduces the aggregation burden on
userspace (and scales better with NR_CPUS)

b) do flow control on the kernel send side. This can involve buffering and 
sending
later (to handle bursty case) or dropping (to handle sustained load) as pointed 
out
by you, Jamal in other threads.

c) increase receiver's socket buffer. This can and should always be done but no
involvement needed.


With regards to taskstats changes to handle the problem and its impact on 
userspace
visible changes,

a) will change userspace
b) will be transparent.
c) is immaterial going forward (except perhaps as a change in Documentation)


I'm sending a patch that demonstrates how a) can be done quite simply
and a patch for b) is in progress.

If the approach suggested in patch a) is acceptable (and I'll provide the 
testing, stability
results once comments on it are largely over), could taskstats acceptance in 
2.6.18 go ahead
and patch b) be added later (solution outline has already been provided and a 
prelim patch should
be out by eod)

--Shailabh



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-06-30 Thread Shailabh Nagar
Shailabh Nagar wrote:

> Andrew,
>
> Based on previous discussions, the above solutions can be expanded/modified 
> to:
>
> a) allow userspace to listen to a group of cpus instead of all. Multiple
> collection daemons can distribute the load as you pointed out. Doing 
> collection
> by cpu groups rather than individual cpus reduces the aggregation burden on
> userspace (and scales better with NR_CPUS)


> I'm sending a patch that demonstrates how a) can be done quite simply
> and a patch for b) is in progress.
>

Here's the patch.
Testing etc. need to be done (an earlier version that did per-cpu queues
has worked) but the main point is to show how small a change is needed
in the interface (on both the kernel and user side)
and current codebase to achieve the a) solution.

Also to get feedback on this kind of usage of the nl_pid field, the
approach etc.

Thanks,
Shailabh






===


On systems with a large number of cpus, with even a modest rate of tasks
exiting per cpu, the volume of taskstats data sent on thread exit can overflow
a userspace listener's buffers.

One approach to avoiding overflow is to allow listeners to get data for a
limited number of cpus. By scaling the number of listening programs, each
listening to a different set of cpus, userspace can avoid more overflow
situations.

This patch implements this idea by creating simple grouping of cpus and
allowing userspace to listen to any cpu group it chooses.

Alternative designs considered and rejected were:

- creating a separate netlink group for each group of cpus. Since only 32
netlink groups can be specified by a user, this option will not scale with
number of cpus.

- aligning the grouping of cpus with cpusets. The unnecessary tying together
of the two functionalities was not merited.

Thanks to Balbir Singh for discovering the potential use of the pid field of
sockaddr_nl as a communication subchannel in the same socket, Paul Jackson and
Vivek Kashyap for suggesting cpus be grouped together for data send purposes.

Signed-Off-By: Shailabh Nagar <[EMAIL PROTECTED]>
Signed-Off-By: Balbir Singh <[EMAIL PROTECTED]>

 Documentation/accounting/getdelays.c |   30 +++---
 include/linux/taskstats.h|   22 ++
 kernel/taskstats.c   |5 +++--
 3 files changed, 44 insertions(+), 13 deletions(-)

Index: linux-2.6.17-mm3equiv/include/linux/taskstats.h
===
--- linux-2.6.17-mm3equiv.orig/include/linux/taskstats.h2006-06-30 
11:57:14.0 -0400
+++ linux-2.6.17-mm3equiv/include/linux/taskstats.h 2006-06-30 
14:24:49.0 -0400
@@ -89,6 +89,28 @@ struct taskstats {

 #define TASKSTATS_LISTEN_GROUP 0x1

+
+/*
+ * Per-task exit data sent from the kernel to user space
+ * is tagged by an id based on grouping of cpus.
+ *
+ * If userspace specifies a non-zero P as the nl_pid field of
+ * the sockaddr_nl structure while binding to a netlink socket,
+ * it will receive exit data from threads that exited on cpus in the range
+ *
+ *[(P-1)*Y, P*Y-1]
+ *
+ *  where Y = TASKSTATS_CPUS_PER_SET
+ *  i.e. if TASKSTATS_CPUS_PER_SET is 16,
+ *  to listen to data from cpus 0..15, specify P=1
+ *  for cpus 16..32, specify P=2 etc.
+ *
+ *  To listen to data from all cpus, userspace should use P=0
+ */
+
+#define TASKSTATS_CPUS_PER_SET 16
+
+
 /*
  * Commands sent from userspace
  * Not versioned. New commands should only be inserted at the enum's end
Index: linux-2.6.17-mm3equiv/kernel/taskstats.c
===
--- linux-2.6.17-mm3equiv.orig/kernel/taskstats.c   2006-06-30 
11:57:14.0 -0400
+++ linux-2.6.17-mm3equiv/kernel/taskstats.c2006-06-30 13:58:36.0 
-0400
@@ -266,7 +266,7 @@ void taskstats_exit_send(struct task_str
struct sk_buff *rep_skb;
void *reply;
size_t size;
-   int is_thread_group;
+   int is_thread_group, setid;
struct nlattr *na;

if (!family_registered || !tidstats)
@@ -320,7 +320,8 @@ void taskstats_exit_send(struct task_str
nla_nest_end(rep_skb, na);

 send:
-   send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+   setid = (smp_processor_id()%TASKSTATS_CPUS_PER_SET)+1;
+   send_reply(rep_skb, setid, TASKSTATS_MSG_MULTICAST);
return;

 nla_put_failure:
Index: linux-2.6.17-mm3equiv/Documentation/accounting/getdelays.c
===
--- linux-2.6.17-mm3equiv.orig/Documentation/accounting/getdelays.c 
2006-06-28 16:08:56.0 -0400
+++ linux-2.6.17-mm3equiv/Documentation/accounting/getdelays.c  2006-06-30 
14:09:28.0 -0400
@@ -40,7 +40,7 @@ int done = 0;
 /*
  * Create a raw netlink socket and bind
  */
-static int create_nl_socket(i

Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-06-30 Thread Shailabh Nagar
Shailabh Nagar wrote:
> Shailabh Nagar wrote:
> 
> 


> Index: linux-2.6.17-mm3equiv/kernel/taskstats.c
> ===
> --- linux-2.6.17-mm3equiv.orig/kernel/taskstats.c 2006-06-30 
> 11:57:14.0 -0400
> +++ linux-2.6.17-mm3equiv/kernel/taskstats.c  2006-06-30 13:58:36.0 
> -0400
> @@ -266,7 +266,7 @@ void taskstats_exit_send(struct task_str
>   struct sk_buff *rep_skb;
>   void *reply;
>   size_t size;
> - int is_thread_group;
> + int is_thread_group, setid;
>   struct nlattr *na;
> 
>   if (!family_registered || !tidstats)
> @@ -320,7 +320,8 @@ void taskstats_exit_send(struct task_str
>   nla_nest_end(rep_skb, na);
> 
>  send:
> - send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
> + setid = (smp_processor_id()%TASKSTATS_CPUS_PER_SET)+1;
> + send_reply(rep_skb, setid, TASKSTATS_MSG_MULTICAST);

This should be
send_reply(rep_skb, setid, TASKSTATS_MSG_UNICAST);

>   return;
> 
>  nla_put_failure:
> Index: linux-2.6.17-mm3equiv/Documentation/accounting/getdelays.c
> ===
> --- linux-2.6.17-mm3equiv.orig/Documentation/accounting/getdelays.c   
> 2006-06-28 16:08:56.0 -0400
> +++ linux-2.6.17-mm3equiv/Documentation/accounting/getdelays.c
> 2006-06-30 14:09:28.0 -0400
> @@ -40,7 +40,7 @@ int done = 0;
>  /*
>   * Create a raw netlink socket and bind
>   */
> -static int create_nl_socket(int protocol, int groups)
> +static int create_nl_socket(int protocol, int cpugroup)
>  {
>  socklen_t addr_len;
>  int fd;
> @@ -52,7 +52,8 @@ static int create_nl_socket(int protocol
> 
>  memset(&local, 0, sizeof(local));
>  local.nl_family = AF_NETLINK;
> -local.nl_groups = groups;
> +local.nl_groups = TASKSTATS_LISTEN_GROUP;
> +local.nl_pid = cpugroup;
> 
>  if (bind(fd, (struct sockaddr *) &local, sizeof(local)) < 0)
>   goto error;
> @@ -203,7 +204,7 @@ int main(int argc, char *argv[])
>  pid_t rtid = 0;
>  int cmd_type = TASKSTATS_TYPE_TGID;
>  int c, status;
> -int forking = 0;
> +int forking = 0, cpugroup = 0;
>  struct sigaction act = {
>   .sa_handler = SIG_IGN,
>   .sa_mask = SA_NOMASK,
> @@ -222,7 +223,7 @@ int main(int argc, char *argv[])
> 
>  while (1) {
> 
> - c = getopt(argc, argv, "t:p:c:");
> + c = getopt(argc, argv, "t:p:c:g:l");
>   if (c < 0)
>   break;
> 
> @@ -252,8 +253,14 @@ int main(int argc, char *argv[])
>   }
>   forking = 1;
>   break;
> + case 'g':
> + cpugroup = atoi(optarg);
> + break;
> + case 'l':
> + loop = 1;
> + break;
>   default:
> - printf("usage %s [-t tgid][-p pid][-c cmd]\n", argv[0]);
> + printf("usage %s [-t tgid][-p pid][-c cmd][-g cpugroup][-l]\n", 
> argv[0]);
>   exit(-1);
>   break;
>   }
> @@ -266,7 +273,7 @@ int main(int argc, char *argv[])
>  /* Send Netlink request message & get reply */
> 
>  if ((nl_sd =
> -  create_nl_socket(NETLINK_GENERIC, TASKSTATS_LISTEN_GROUP)) < 0)
> +  create_nl_socket(NETLINK_GENERIC, cpugroup)) < 0)
>   err(1, "error creating Netlink socket\n");
> 
> 
> @@ -287,10 +294,10 @@ int main(int argc, char *argv[])
> 
> 
>  if (!forking && sendto_fd(nl_sd, (char *) &req, req.n.nlmsg_len) < 0)
> +if ((!forking && !loop) &&
> + sendto_fd(nl_sd, (char *) &req, req.n.nlmsg_len) < 0)
>   err(1, "error sending message via Netlink\n");
> 
> -act.sa_handler = SIG_IGN;
> -sigemptyset(&act.sa_mask);
>  if (sigaction(SIGINT, &act, NULL) < 0)
>   err(1, "sigaction failed for SIGINT\n");
> 
> @@ -349,10 +356,11 @@ int main(int argc, char *argv[])
>   rtid = *(int *) NLA_DATA(na);
>   break;
>   case TASKSTATS_TYPE_STATS:
> - if (rtid == tid) {
> + if (rtid == tid || loop) {
>   print_taskstats((struct taskstats *)
>   NLA_DATA(na));
> - done = 1;
> + if (!loop)
> + done = 1;
>   }
>   break;
>   }
> @@ -369,7 +377,7 @@ int main(int argc, char *argv[])
>   if (done)
>   break;
>  }
> -while (1);
> +while (loop);
> 
>  close(nl_sd);
>  return 0;
> 

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-06-30 Thread Shailabh Nagar

Andrew Morton wrote:


Shailabh Nagar <[EMAIL PROTECTED]> wrote:
 


+/*
+ * Per-task exit data sent from the kernel to user space
+ * is tagged by an id based on grouping of cpus.
+ *
+ * If userspace specifies a non-zero P as the nl_pid field of
+ * the sockaddr_nl structure while binding to a netlink socket,
+ * it will receive exit data from threads that exited on cpus in the range
+ *
+ *[(P-1)*Y, P*Y-1]
+ *
+ *  where Y = TASKSTATS_CPUS_PER_SET
+ *  i.e. if TASKSTATS_CPUS_PER_SET is 16,
+ *  to listen to data from cpus 0..15, specify P=1
+ *  for cpus 16..32, specify P=2 etc.
+ *
+ *  To listen to data from all cpus, userspace should use P=0
+ */
+
+#define TASKSTATS_CPUS_PER_SET 16
   



The constant is unpleasant.
 

I was planning to make it configurable. But that would still not be as 
flexible as below...



If we're going to abuse nl_pid then how about we design things so that
nl_pid is treated as two 16-bit words - one word is the start CPU and the
other word is the end cpu?

Or, if a 65536-CPU limit is too scary, make the bottom 8 bits of nl_pid be
the number of CPUS (ie: TASKSTATS_CPUS_PER_SET) and the top 24 bits is the
starting CPU.  




It'd be better to use a cpumask, of course..
 

All these options mean each listener gets to pick a "custom" range of 
cpus to listen on, 
rather than choose one of pre-defined ranges (even if the pre-defined 
ranges can change
by a configurable TASKSTATS_CPUS_PER_SET). Which means the kernel side 
has to
figure out which of the listeners cpu range includes the currently 
exiting task's cpu. To do
this, we'll need a callback from the binding of the netlink socket (so 
taskstats can maintain

the cpu -> nl_pid mappings at any exit).
The current genetlink interface doesn't have that kind of flexibility 
(though it can be added

I'm sure).

Seems a bit involved if the primary aim is to restrict the number of 
cpus that one listener

wants to listen, rather than be able to pick which ones.

A configurable range won't suffice ?

--Shailabh
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-06-30 Thread Shailabh Nagar

Andrew Morton wrote:


On Fri, 30 Jun 2006 22:20:23 -0400
Shailabh Nagar <[EMAIL PROTECTED]> wrote:

 


If we're going to abuse nl_pid then how about we design things so that
nl_pid is treated as two 16-bit words - one word is the start CPU and the
other word is the end cpu?

Or, if a 65536-CPU limit is too scary, make the bottom 8 bits of nl_pid be
the number of CPUS (ie: TASKSTATS_CPUS_PER_SET) and the top 24 bits is the
starting CPU.  




It'd be better to use a cpumask, of course..


 

All these options mean each listener gets to pick a "custom" range of 
cpus to listen on, 
rather than choose one of pre-defined ranges (even if the pre-defined 
ranges can change
by a configurable TASKSTATS_CPUS_PER_SET). Which means the kernel side 
has to
figure out which of the listeners cpu range includes the currently 
exiting task's cpu. To do
this, we'll need a callback from the binding of the netlink socket (so 
taskstats can maintain

the cpu -> nl_pid mappings at any exit).
The current genetlink interface doesn't have that kind of flexibility 
(though it can be added

I'm sure).

Seems a bit involved if the primary aim is to restrict the number of 
cpus that one listener

wants to listen, rather than be able to pick which ones.

A configurable range won't suffice ?

   



Set aside the implementation details and ask "what is a good design"?

A kernel-wide constant, whether determined at build-time or by a /proc poke
isn't a nice design.

Can we permit userspace to send in a netlink message describing a cpumask? 
That's back-compatible.
 

Yes, that should be doable. And passing in a cpumask is much better 
since we no longer

have to maintain mappings.

So the strawman is:
Listener bind()s to genetlink using its real pid.
Sends a separate "registration" message with cpumask to listen to. 
Kernel stores (real) pid and cpumask.
During task exit, kernel goes through each registered listener (small 
list) and decides which
one needs to get this exit data and calls a genetlink_unicast to each 
one that does need it.


If number of listeners is small, the lookups should be swift enough. If 
it grows large, we
can consider a fancier lookup (but there I go again, delving into 
implementation too early :-)



Sounds good to me !

--Shailabh







-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-07-03 Thread Shailabh Nagar

Paul Jackson wrote:


Shailabh wrote:
 

Sends a separate "registration" message with cpumask to listen to. 
Kernel stores (real) pid and cpumask.
   



Question:
=

Ah - good.

So this means that I could configure a system with a fork/exit
intensive, performance critical job on some dedicated CPUs, and be able
to collect taskstat data from tasks exiting on the -other- CPUS, while
avoiding collecting data from this special job, thus avoiding any
taskstat collection performance impact on said job.

If I'm understanding this correctly, excellent.
 

Yes. If no one registers to listen on a particular CPU, data from tasks 
exiting on that cpu is

not sent out at all.


Caveat:
===

Passing cpumasks across the kernel-user boundary can be tricky.

Historically, Unix has a long tradition of boloxing up the passing
of variable length data types across the kernel-user boundary.

We've got perhaps a half dozen ways of getting these masks out of the
kernel, and three ways of getting them (or the similar nodemasks) back
into the kernel.  The three ways being used in the sched_setaffinity
system call, the mbind and set_mempolicy system calls, and the cpuset
file system.

All three of these ways have their controversial details:
* The kernel cpumask mask size needed for sched_setaffinity calls is
  not trivially available to userland.
* The nodemask bit size is off by one in the mbind and set_mempolicy
  calls.
* The CPU and Node masks are ascii, not binary, in the cpuset calls.

One option that might make sense for these task stat registrations
would be to:
1) make the kernel/sched.c get_user_cpu_mask() routine generic,
   moving it to non-static lib/*.c code, and
2) provide a sensible way for user space to query the size of
   the kernel cpumask (and perhaps nodemask while you're at it.)

Currently, the best way I know for user space to query the kernels
cpumask and nodemask size is to examine the length of the ascii
string values labeled "Cpus_allowed:" and "Mems_allowed:" in the file
/proc/self/status.  These ascii strings always require exactly nine
ascii chars to express each 32 bits of kernel mask code, if you include
in the count the trailing ',' comma or '\n' newline after each eight
ascii character word.

Probing /proc/self/status fields for these mask sizes is rather
unobvious and indirect, and requires caching the result if you care at
all about performance.  Userland code in support of your taskstat
facility might be better served by a more obvious way to size cpumasks.

... unless of course you're inclined to pass cpumasks formatted as
   ascii strings, in which case speak up, as I'd be delighted to
   throw in my 2 cents on how to do that ;).
 


Thanks for the size info. I did hit it while coding this up.

So I chose to use the "cpulist" ascii format that has been helpfully 
provided in include/linux/cpumask.h (by whom I wonder :-)


User specified the cpumask as an ascii string containing comma separated 
cpu ranges.
Kernel parses the same and stores it as a cpumask_t after which we can 
iterate over the

mask using standard helpers.

Since registration/deregistration is not a common operation, the 
overhead of parsing
ascii strings should be acceptable and avoids the hassles of trying to 
determine kernel cpumask size. I don't know if there are buffer overflow 
issues in passing a string (though I'm using the

standard netlink way of passing it up using NLA_STRING).

Will post the patch shortly.

--Shailabh
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-07-03 Thread Shailabh Nagar

Andrew Morton wrote:


On Fri, 30 Jun 2006 23:37:10 -0400
Shailabh Nagar <[EMAIL PROTECTED]> wrote:

 


Set aside the implementation details and ask "what is a good design"?

A kernel-wide constant, whether determined at build-time or by a /proc poke
isn't a nice design.

Can we permit userspace to send in a netlink message describing a cpumask? 
That's back-compatible.



 

Yes, that should be doable. And passing in a cpumask is much better 
since we no longer

have to maintain mappings.

So the strawman is:
Listener bind()s to genetlink using its real pid.
Sends a separate "registration" message with cpumask to listen to. 
Kernel stores (real) pid and cpumask.
During task exit, kernel goes through each registered listener (small 
list) and decides which
one needs to get this exit data and calls a genetlink_unicast to each 
one that does need it.


If number of listeners is small, the lookups should be swift enough. If 
it grows large, we
can consider a fancier lookup (but there I go again, delving into 
implementation too early :-)
   



We'll need a map.

1024 CPUs, 1024 listeners, 1000 exits/sec/CPU and we're up to a million
operations per second per CPU.  Meltdown.

But it's a pretty simple map.  A per-cpu array of pointers to the head of a
linked list.  One lock for each CPU's list.
 


Here's a patch that implements the above ideas.

A listener register's interest by specifying a cpumask in the
cpulist format (comma separated ranges of cpus). The listener's pid
is entered into per-cpu lists for those cpus and exit events from those
cpus go to the listeners using netlink unicasts.

Please comment.

Andrew, this is not being proposed for inclusion yet since there is 
atleast one more issue that needs to be resolved:


What happens when a listener exits without doing deregistration
(or if the listener attempts to register another cpumask while a current
registration is still active).

More on that in a separate thread.

--Shailabh



On systems with a large number of cpus, with even a modest rate of
tasks exiting per cpu, the volume of taskstats data sent on thread exit
can overflow a userspace listener's buffers.

One approach to avoiding overflow is to allow listeners to get data for
a limited and specific set of cpus. By scaling the number of listeners
and/or the cpus they monitor, userspace can handle the statistical data
overload more gracefully.

In this patch, each listener registers to listen to a specific set of
cpus by specifying a cpumask.  The interest is recorded per-cpu. When
a task exits on a cpu, its taskstats data is unicast to each listener
interested in that cpu.

Thanks to Andrew Morton for pointing out the various scalability and
general concerns of previous attempts and for suggesting this design.

Signed-Off-By: Shailabh Nagar <[EMAIL PROTECTED]>

include/linux/taskstats.h  |4 -
include/linux/taskstats_kern.h |   12 ---
kernel/taskstats.c |  136 +++--
3 files changed, 135 insertions(+), 17 deletions(-)

Index: linux-2.6.17-mm3equiv/include/linux/taskstats.h
===
--- linux-2.6.17-mm3equiv.orig/include/linux/taskstats.h2006-06-30 
19:03:40.0 -0400
+++ linux-2.6.17-mm3equiv/include/linux/taskstats.h 2006-07-01 
23:53:01.0 -0400
@@ -87,8 +87,6 @@ struct taskstats {
};


-#define TASKSTATS_LISTEN_GROUP 0x1
-
/*
 * Commands sent from userspace
 * Not versioned. New commands should only be inserted at the enum's end
@@ -120,6 +118,8 @@ enum {
TASKSTATS_CMD_ATTR_UNSPEC = 0,
TASKSTATS_CMD_ATTR_PID,
TASKSTATS_CMD_ATTR_TGID,
+   TASKSTATS_CMD_ATTR_REGISTER_CPUMASK,
+   TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK,
__TASKSTATS_CMD_ATTR_MAX,
};

Index: linux-2.6.17-mm3equiv/include/linux/taskstats_kern.h
===
--- linux-2.6.17-mm3equiv.orig/include/linux/taskstats_kern.h   2006-06-30 
11:57:14.0 -0400
+++ linux-2.6.17-mm3equiv/include/linux/taskstats_kern.h2006-07-01 
23:53:01.0 -0400
@@ -19,20 +19,14 @@ enum {
#ifdef CONFIG_TASKSTATS
extern kmem_cache_t *taskstats_cache;
extern struct mutex taskstats_exit_mutex;
-
-static inline int taskstats_has_listeners(void)
-{
-   if (!genl_sock)
-   return 0;
-   return netlink_has_listeners(genl_sock, TASKSTATS_LISTEN_GROUP);
-}
-
+DECLARE_PER_CPU(struct list_head, listener_list);

static inline void taskstats_exit_alloc(struct taskstats **ptidstats)
{
*ptidstats = NULL;
-   if (taskstats_has_listeners())
+   if (!list_empty(&get_cpu_var(listener_list)))
*ptidstats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL);
+   put_cpu_var(listener_list);
}

static inline void task

Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-07-03 Thread Shailabh Nagar

Paul Jackson wrote:


Shailabh wrote:
 

I don't know if there are buffer overflow 
issues in passing a string
   



I don't know if this comment applies to "the standard netlink way of
passing it up using NLA_STRING", but the way I deal with buffer length
issues in the cpuset code is to insist that the user code express the
list in no fewer than 100 + 6 * NR_CPUS bytes:

From kernel/cpuset.c:

   /* Crude upper limit on largest legitimate cpulist user might write. */
   if (nbytes > 100 + 6 * NR_CPUS)
   return -E2BIG;

This lets the user specify the buffer size passed in, but prevents
them from trying a denial of service attack on the kernel by trying
to pass in a huge buffer.

If the user can't figure out how to write the desired cpulist in
that size, then tough toenails.
 


Paul,

Perhaps I should use the the other ascii format for specifying cpumasks 
since its more amenable
to specifying an upper bound for the length of the ascii string and is 
more compact ?


That format (the one used in lib/bitmap.c:bitmap_parse) is comma 
separated chunks of hex digits

with each chunk specifying 32 bits of the desired cpumask.

So
((NR_CPUS + 32) / 32) * 8 + 1
(8 hex characters for each 32 cpus, and 1 extra character for null 
terminator)

would be an upper bound that would accomodate all the cpus for sure.

Thoughts ?

--Shailabh

--Shailabh
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-07-03 Thread Shailabh Nagar

Andrew Morton wrote:


On Mon, 03 Jul 2006 17:11:59 -0400
Shailabh Nagar <[EMAIL PROTECTED]> wrote:
 



static inline void taskstats_exit_alloc(struct taskstats **ptidstats)
{
*ptidstats = NULL;
-   if (taskstats_has_listeners())
+   if (!list_empty(&get_cpu_var(listener_list)))
*ptidstats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL);
+   put_cpu_var(listener_list);
}
   



It's time to uninline this function..

 


static inline void taskstats_exit_free(struct taskstats *tidstats)
Index: linux-2.6.17-mm3equiv/kernel/taskstats.c
===
--- linux-2.6.17-mm3equiv.orig/kernel/taskstats.c   2006-06-30 
23:38:39.0 -0400
+++ linux-2.6.17-mm3equiv/kernel/taskstats.c2006-07-02 00:16:18.0 
-0400
@@ -19,6 +19,8 @@
#include 
#include 
#include 
+#include 
+#include 
#include 
#include 

@@ -26,6 +28,9 @@ static DEFINE_PER_CPU(__u32, taskstats_s
static int family_registered = 0;
kmem_cache_t *taskstats_cache;

+DEFINE_PER_CPU(struct list_head, listener_list);
+static DEFINE_PER_CPU(struct rw_semaphore, listener_list_sem);
   



Which will permit listener_list to become static - it wasn't a good name
for a global anyway.

I suggest you implement a new

struct whatever {
struct rw_semaphore sem;
struct list_head list;
};
 

Ok. The listener_list was a global to allow taskstats_exit_alloc to 
access but this is better.



static DEFINE_PER_CPU(struct whatever, listener_aray);


 


static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
void **replyp, size_t size)
{
@@ -77,6 +92,8 @@ static int prepare_reply(struct genl_inf
static int send_reply(struct sk_buff *skb, pid_t pid, int event)
{
struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+   struct rw_semaphore *sem;
+   struct list_head *p, *head;
void *reply;
int rc;

@@ -88,9 +105,30 @@ static int send_reply(struct sk_buff *sk
return rc;
}

-   if (event == TASKSTATS_MSG_MULTICAST)
-   return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP);
-   return genlmsg_unicast(skb, pid);
+   if (event == TASKSTATS_MSG_UNICAST)
+   return genlmsg_unicast(skb, pid);
+
+   /*
+* Taskstats multicast is unicasts to listeners who have registered
+* interest in this cpu
+*/
+   sem = &get_cpu_var(listener_list_sem);
+   head = &get_cpu_var(listener_list);
   



This has a double preempt_disable(), but the above will fix that.

 


+   down_read(sem);
+   list_for_each(p, head) {
+   int ret;
+   struct listener *s = list_entry(p, struct listener, list);
+   ret = genlmsg_unicast(skb, s->pid);
+   if (ret)
+   rc = ret;
+   }
+   up_read(sem);
+
+   put_cpu_var(listener_list);
+   put_cpu_var(listener_list_sem);
+
+   return rc;
}

static int fill_pid(pid_t pid, struct task_struct *pidtsk,
@@ -201,8 +239,73 @@ ret:
return;
}

+static int add_del_listener(pid_t pid, cpumask_t *maskp, int isadd)
+{
+   struct listener *s;
+   unsigned int cpu, mycpu;
+   cpumask_t mask;
+   struct rw_semaphore *sem;
+   struct list_head *head, *p;

-static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
+   memcpy(&mask, maskp, sizeof(cpumask_t));
+   if (cpus_empty(mask))
+   return -EINVAL;
+
+   mycpu = get_cpu();
+   put_cpu();
   



This is effectively raw_smp_processor_id().  And after the put_cpu(),
`mycpu' is meaningless.
 


Hmm.

 


+   if (isadd == REGISTER) {
+   for_each_cpu_mask(cpu, mask) {
+   if (!cpu_possible(cpu))
+   continue;
+   if (cpu == mycpu)
+   preempt_disable();
+
+   sem = &per_cpu(listener_list_sem, cpu);
+   head = &per_cpu(listener_list, cpu);
+
+   s = kmalloc(sizeof(struct listener), GFP_KERNEL);
   



Cannot do GFP_KERNEL inside preempt_disable().

There's no easy solution to this problem.  GFP_ATOMIC is not a good fix at
all.  One approach would be to run lock_cpu_hotplug(), then allocate (with
GFP_KERNEL) all the memory which will be needed within the locked region,
then take the lock, then use that preallocated memory.
 


You should use kmalloc_node() here, to ensure that the memory on each CPU's
list resides with that CPU's local memory (not _this_ CPU's local memory).
 


Ok.

 


+   if (!s)
+   return -ENOMEM;
+   s->pid = pid;
+   INIT_LIST_HEAD(&s->list);
+
+   down_write(sem);
+

Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-07-03 Thread Shailabh Nagar

Shailabh Nagar wrote:


Andrew Morton wrote:


On Fri, 30 Jun 2006 23:37:10 -0400
Shailabh Nagar <[EMAIL PROTECTED]> wrote:

 


Set aside the implementation details and ask "what is a good design"?

A kernel-wide constant, whether determined at build-time or by a 
/proc poke

isn't a nice design.

Can we permit userspace to send in a netlink message describing a 
cpumask? That's back-compatible.






Yes, that should be doable. And passing in a cpumask is much better 
since we no longer

have to maintain mappings.

So the strawman is:
Listener bind()s to genetlink using its real pid.
Sends a separate "registration" message with cpumask to listen to. 
Kernel stores (real) pid and cpumask.
During task exit, kernel goes through each registered listener 
(small list) and decides which
one needs to get this exit data and calls a genetlink_unicast to 
each one that does need it.


If number of listeners is small, the lookups should be swift enough. 
If it grows large, we
can consider a fancier lookup (but there I go again, delving into 
implementation too early :-)
  



We'll need a map.

1024 CPUs, 1024 listeners, 1000 exits/sec/CPU and we're up to a million
operations per second per CPU.  Meltdown.

But it's a pretty simple map.  A per-cpu array of pointers to the 
head of a

linked list.  One lock for each CPU's list.
 


Here's a patch that implements the above ideas.

A listener register's interest by specifying a cpumask in the
cpulist format (comma separated ranges of cpus). The listener's pid
is entered into per-cpu lists for those cpus and exit events from those
cpus go to the listeners using netlink unicasts.

Please comment.

Andrew, this is not being proposed for inclusion yet since there is 
atleast one more issue that needs to be resolved:


What happens when a listener exits without doing deregistration
(or if the listener attempts to register another cpumask while a current
registration is still active).


( Jamal, your thoughts on this problem would be appreciated)

Problem is that we have a listener task which has "registered" with 
taskstats and caused
its pid to be stored in various per-cpu lists of listeners. Later, when 
some other task exits on a given cpu, its exit data is sent using 
genlmsg_unicast on each pid present on that cpu's list.


If the listener exits without doing a "deregister", its pid continues to 
be kept around, obviously not a good thing. So we need some way of 
detecting the situation (task is no longer listening on

these cpus events) that is efficient.

Two solutions come to mind:

1. During the exit of every task check to see if it is is already  
"registered" with taskstats. If so, do a cleanup of its pid on various 
per-cpu lists.


2. Before doing a genlmsg_unicast to a pid on one of the per-cpu lists 
(or if genlmsg_unicast
fails with a -ECONNREFUSED, a result of netlink_lookup failing for that 
pid), then just delete

it from that cpu's list and continue.

1 is more desirable because its the right place to catch this and 
happens relatively rarely
(few listener exits compared to all exits). However, how can we check 
whether a task/pid

has registered with taskstats earlier ? Again, two possibilities
- Maintain a list of registered listeners within taskstats and check that.
- try to leverage netlink's nl_pid_hash which maintains the same kind of 
info for each protocol.

Thus a netlink_lookup of the pid would save a lot of work.
However, the netlink layer's hashtable appears to be for the entire 
NETLINK_GENERIC
protocol and not just for the taskstats client of NETLINK_GENERIC. So 
even if a task has
deregistered with taskstats, as long as it has some other 
NETLINK_GENERIC socket open,

it will still show up as "connected" as far as netlink is concerned.

Jamal - is my interpretation correct ? Do I need to essentially 
replicate the pidhash at the
taskstats layer ? Thoughts on whether there's any way genetlink can 
provide support for this or
whether its desirable etc. (we appear to be the second user of genetlink 
- this may not be a

common need going forward).

1 has the disadvantage that if such a situation is detected, one has to 
iterate over all cpus in

the system, deleting that pid from any per-cpu list it happens to be in.
One could store the cpumask that the listener originally used to 
optimize this search. usual tradeoff of storage vs. time.


2 avoids the problem just mentioned since it delegates the task of 
cleanup to each cpu at the cost

of incurring an extra check for each listener for each exit on that cpu.
By storing the task_struct instead of the pid in the per-cpu lists, the 
check can be made quite

cheap.
But one problem with 2 is the issue of recycled task_structs and pids. 
Since the stale task on the
per-cpu listener list could have exited a while back, its possible its 
alive at the time o

Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-07-04 Thread Shailabh Nagar

jamal wrote:

On Mon, 2006-03-07 at 18:01 -0700, Andrew Morton wrote:


On Mon, 03 Jul 2006 20:54:37 -0400
Shailabh Nagar <[EMAIL PROTECTED]> wrote:



What happens when a listener exits without doing deregistration
(or if the listener attempts to register another cpumask while a current
registration is still active).



( Jamal, your thoughts on this problem would be appreciated)

Problem is that we have a listener task which has "registered" with 
taskstats and caused
its pid to be stored in various per-cpu lists of listeners. Later, when 
some other task exits on a given cpu, its exit data is sent using 
genlmsg_unicast on each pid present on that cpu's list.


If the listener exits without doing a "deregister", its pid continues to 
be kept around, obviously not a good thing. So we need some way of 
detecting the situation (task is no longer listening on

these cpus events) that is efficient.


Also need to address the case where the listener has closed off his file
descriptor but continues to run.

So hooking into listener's exit() isn't appropriate - the teardown is
associated with the lifetime of the fd, not of the process.  If we do that,
exit() gets handled for free.  



If you are always going to send unicast messages, then  -ECONNREFUSED
will tell you the listener has closed their fd - this doesnt meant it
has exited. 


Thats good. So we have atleast one way of detecting the "closed fd without
deregistering" within taskstats itself.


Besides that one process could open several sockets. I know
that would not be the app you would write - but it doesnt stop other
people from doing it.


As far as API is concerned, even a taskstats listener is not being
prevented from opening multiple sockets. As Andrew also pointed out,
everything needs to be done per-socket.


I think i may not follow what you are doing - for some reason i thought
you may have many listeners in user space and these messages get
multicast to them?


That was the design earlier. In the past week, the design has changed to
one where there are still many listeners in user space but messages
get unicast to each of them. Earlier listeners would get messages generated
on task exit from every cpu, now they get it only from cpus for which
they have explicitly registered interest (via a cpumask passed in through
another genetlink command).


Does the user space program somehow communicate its pid to the kernel?


Yes. When the listener registers interest in a set of cpus, as described
above, its (genl_info->pid) is being stored in the per-cpu list of
listeners for those cpus. When a task exits on one of those cpus, the
exit data is only sent via genetlink_unicast to those pids
(really, nl_pids) who are on that cpu's listener list.


Now that I think more about it, netlink is really maintaining a pidhash
of nl_pids, not process pids, right ? So if one userapp were to open
multiple sockets using NETLINK_GENERIC protocol (regardless of how many
of those are for the taskstats), each of them would have to use a
different nl_pid. Hence, it would be valid for the taskstats layer to use 
netlink_lookup() at any time to see if the corresponding socket were

closed ?


--Shailabh




-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-07-04 Thread Shailabh Nagar

Shailabh Nagar wrote:

jamal wrote:


On Mon, 2006-03-07 at 18:01 -0700, Andrew Morton wrote:


On Mon, 03 Jul 2006 20:54:37 -0400
Shailabh Nagar <[EMAIL PROTECTED]> wrote:



What happens when a listener exits without doing deregistration
(or if the listener attempts to register another cpumask while a 
current

registration is still active).



( Jamal, your thoughts on this problem would be appreciated)

Problem is that we have a listener task which has "registered" with 
taskstats and caused
its pid to be stored in various per-cpu lists of listeners. Later, 
when some other task exits on a given cpu, its exit data is sent 
using genlmsg_unicast on each pid present on that cpu's list.


If the listener exits without doing a "deregister", its pid 
continues to be kept around, obviously not a good thing. So we need 
some way of detecting the situation (task is no longer listening on

these cpus events) that is efficient.



Also need to address the case where the listener has closed off his file
descriptor but continues to run.

So hooking into listener's exit() isn't appropriate - the teardown is
associated with the lifetime of the fd, not of the process.  If we do 
that,
exit() gets handled for free.  




If you are always going to send unicast messages, then  -ECONNREFUSED
will tell you the listener has closed their fd - this doesnt meant it
has exited. 



Thats good. So we have atleast one way of detecting the "closed fd without
deregistering" within taskstats itself.


Besides that one process could open several sockets. I know
that would not be the app you would write - but it doesnt stop other
people from doing it.



As far as API is concerned, even a taskstats listener is not being
prevented from opening multiple sockets. As Andrew also pointed out,
everything needs to be done per-socket.


I think i may not follow what you are doing - for some reason i thought
you may have many listeners in user space and these messages get
multicast to them?



That was the design earlier. In the past week, the design has changed to
one where there are still many listeners in user space but messages
get unicast to each of them. Earlier listeners would get messages generated
on task exit from every cpu, now they get it only from cpus for which
they have explicitly registered interest (via a cpumask passed in through
another genetlink command).


Does the user space program somehow communicate its pid to the kernel?



Yes. When the listener registers interest in a set of cpus, as described
above, its (genl_info->pid) is being stored in the per-cpu list of
listeners for those cpus. When a task exits on one of those cpus, the
exit data is only sent via genetlink_unicast to those pids
(really, nl_pids) who are on that cpu's listener list.


Now that I think more about it, netlink is really maintaining a pidhash
of nl_pids, not process pids, right ? So if one userapp were to open
multiple sockets using NETLINK_GENERIC protocol (regardless of how many
of those are for the taskstats), each of them would have to use a
different nl_pid. Hence, it would be valid for the taskstats layer to 
use netlink_lookup() at any time to see if the corresponding socket were

closed ?



Here's a strawman for the problem we're trying to solve: get
notification of the close of a NETLINK_GENERIC socket that had
been used to register interest for some cpus within taskstats.

From looking at the netlink code, the way to go seems to be

- it maintains a pidhash of nl_pids that are currently
registered to listen to atleast one cpu. It also stores the
cpumask used.
- taskstats registers a notifier block within netlink_chain
and receives a callback on the NETLINK_URELEASE event, similar
to drivers/scsci/scsi_transport_iscsi.c: iscsi_rcv_nl_event()

- the callback checks to see that the protocol is NETLINK_GENERIC
and that the nl_pid for the socket is in taskstat's pidhash. If so, it
does a cleanup using the stored cpumask and releases the nl_pid
from the pidhash.

We can even do away with the deregister command altogether and
simply rely on this autocleanup.

--Shailabh
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-07-05 Thread Shailabh Nagar
jamal wrote:
> Shailabh,
> 
> On Tue, 2006-04-07 at 12:37 -0400, Shailabh Nagar wrote:
> [..]
> 
>>Here's a strawman for the problem we're trying to solve: get
>>notification of the close of a NETLINK_GENERIC socket that had
>>been used to register interest for some cpus within taskstats.
>>
>> From looking at the netlink code, the way to go seems to be
>>
>>- it maintains a pidhash of nl_pids that are currently
>>registered to listen to atleast one cpu. It also stores the
>>cpumask used.
>>- taskstats registers a notifier block within netlink_chain
>>and receives a callback on the NETLINK_URELEASE event, similar
>>to drivers/scsci/scsi_transport_iscsi.c: iscsi_rcv_nl_event()
>>
>>- the callback checks to see that the protocol is NETLINK_GENERIC
>>and that the nl_pid for the socket is in taskstat's pidhash. If so, it
>>does a cleanup using the stored cpumask and releases the nl_pid
>>from the pidhash.
>>
> 
> 
> Sound quiet reasonable.  I am beginning to wonder whether we should do 
> do the NETLINK_URELEASE in general for NETLINK_GENERIC

I'd initially thought that might be useful but since NETLINK_GENERIC
is only "virtually" multiplexing the sockfd amongst each of its users,
I don't know what benefits a generic notifier at NETLINK_GENERIC layer
would bring (as opposed to each NETLINK_GENERIC user directly registering
its callback with netlink). Perhaps simplicity ?


>>We can even do away with the deregister command altogether and
>>simply rely on this autocleanup.
> 
> 
> I think if you may still need the register if you are going to allow
> multiple sockets per listener process, no?

The register command, yes. But an explicit deregister, as opposed to
auto cleanup on fd close, may not be used all that much :-)

> The other question is how do you correlate pid -> fd?

For the notifier callback, I thought netlink_release will
provide the nl_pid correspoding to the fd being closed ?
I can just do a search for that nl_pid in the taskstats-private pidhash.

The nl_pid gets into the pidhash using the genl_info->pid field
when the listener issues the register command.

Will that be correct ?

So here's the sequence of pids being used/hashed etc. Please let
me know if my assumptions are correct ?

1. Same listener thread opens 2 sockets

On sockfd1, does a bind() using
sockaddr_nl.nl_pid = my_pid1
On sockfd2, does a bind() using
sockaddr_nl.nl_pid = my_pid2

(one of my_pid1's could by its process pid but doesn't have to be)

2. Listener supplies cpumasks on each of the sockets through a
register command sent on sockfd1.

In the kernel, when the command is received,
the genl_info->pid field contains my_pid1

my_pid1 is stored in a pidhash alongwith the corresponding cpumask.

cpumask is used to store the my_pid1 into per-cpu lists for each
cpu in the mask.

3. When an exit event happens on one of those cpus in the mask,
it is sent to this listener using
genlmsg_unicast(, my_pid1)


4. When the listener closes sockfd1, netlink_release() gets called
and that calls a taskstats notifier callback (say taskstats_cb) with
struct netlink_notify n =
{ .protocol =  NETLINK_GENERIC, .pid = my_pid1 }

and using the .pid within, taskstats_cb can do a lookup within its
pidhash. If its present, use the cpumask stored alongside to go
clean up my_pid1 stored in the listener list of each cpu in the mask.


--Shailabh




-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-07-05 Thread Shailabh Nagar

Jay Lan wrote:

Shailabh Nagar wrote:



Yes. If no one registers to listen on a particular CPU, data from tasks
exiting on that cpu is not sent out at all.



Shailabh also wrote:


During task exit, kernel goes through each registered listener (small 
list) and decides which
one needs to get this exit data and calls a genetlink_unicast to each 
one that does need it.




Are we eliminating multicast taskstats data at exit time? 


Yes. Only unicasts to each listener now.


A unicast
exit data with cpumask will do for me, but just like to be sure where
we are.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-07-05 Thread Shailabh Nagar
Chris Sturtivant wrote:
> Shailabh Nagar wrote:
> 
>> So here's the sequence of pids being used/hashed etc. Please let
>> me know if my assumptions are correct ?
>>
>> 1. Same listener thread opens 2 sockets
>>
>> On sockfd1, does a bind() using
>> sockaddr_nl.nl_pid = my_pid1
>> On sockfd2, does a bind() using
>> sockaddr_nl.nl_pid = my_pid2
>>
>> (one of my_pid1's could by its process pid but doesn't have to be)
>>   
> 
> 
> For CSA, we are proposing to use a single (multi-threaded) demon that
> combines both the userland components for job and CSA that used to be in
> the kernel.  In this case, the pid will be the same for two connections
> along with the cpu range.  Does what your saying here mean that we
> should choose distinct values for my_pid1 and my_pid2 to avoid the two
> sockets looking the same? 

Yes, that is my understanding and also whats mentioned in the bind()
section in
http://www.linuxjournal.com/article/7356

though I've yet to try it out myself (will do so shortly after
making the other suggested changes to the basic patch)

--Shailabh

> I'm not too familiar with netlink, yet.
> 
> Best regards,
> 
> 
> --Chris
> 

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] per-task delay accounting taskstats interface: control exit data through cpumasks]

2006-07-06 Thread Shailabh Nagar

On systems with a large number of cpus, with even a modest rate of
tasks exiting per cpu, the volume of taskstats data sent on thread exit
can overflow a userspace listener's buffers.

One approach to avoiding overflow is to allow listeners to get data for
a limited and specific set of cpus. By scaling the number of listeners
and/or the cpus they monitor, userspace can handle the statistical data
overload more gracefully.

In this patch, each listener registers to listen to a specific set of
cpus by specifying a cpumask.  The interest is recorded per-cpu. When
a task exits on a cpu, its taskstats data is unicast to each listener
interested in that cpu.

Thanks to Andrew Morton for pointing out the various scalability and
general concerns of previous attempts and for suggesting this design.

Signed-Off-By: Shailabh Nagar <[EMAIL PROTECTED]>

---

Fixes comments by akpm:
- uninline taskstats_exit_alloc
- remove all preempt_disable
- combine per-cpu listener list and its locking into a single struct
- check cpumask as a whole against cpu_possible_map
- missing kfree(struct listener)

Suggested by Paul Jackson
- limit on length of cpumask passed in as a cpulist

Addresses the problem of "close fd without deregistration" though it
can be done better.


 include/linux/taskstats.h  |4 -
 include/linux/taskstats_kern.h |   22 +
 kernel/exit.c  |5 -
 kernel/taskstats.c |  163 ++---
 4 files changed, 163 insertions(+), 31 deletions(-)

Index: linux-2.6.17-mm3equiv/include/linux/taskstats.h
===
--- linux-2.6.17-mm3equiv.orig/include/linux/taskstats.h2006-06-30 
19:03:40.0 -0400
+++ linux-2.6.17-mm3equiv/include/linux/taskstats.h 2006-07-06 
02:38:28.0 -0400
@@ -87,8 +87,6 @@ struct taskstats {
 };


-#define TASKSTATS_LISTEN_GROUP 0x1
-
 /*
  * Commands sent from userspace
  * Not versioned. New commands should only be inserted at the enum's end
@@ -120,6 +118,8 @@ enum {
TASKSTATS_CMD_ATTR_UNSPEC = 0,
TASKSTATS_CMD_ATTR_PID,
TASKSTATS_CMD_ATTR_TGID,
+   TASKSTATS_CMD_ATTR_REGISTER_CPUMASK,
+   TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK,
__TASKSTATS_CMD_ATTR_MAX,
 };

Index: linux-2.6.17-mm3equiv/include/linux/taskstats_kern.h
===
--- linux-2.6.17-mm3equiv.orig/include/linux/taskstats_kern.h   2006-06-30 
11:57:14.0 -0400
+++ linux-2.6.17-mm3equiv/include/linux/taskstats_kern.h2006-07-05 
16:51:52.0 -0400
@@ -20,21 +20,6 @@ enum {
 extern kmem_cache_t *taskstats_cache;
 extern struct mutex taskstats_exit_mutex;

-static inline int taskstats_has_listeners(void)
-{
-   if (!genl_sock)
-   return 0;
-   return netlink_has_listeners(genl_sock, TASKSTATS_LISTEN_GROUP);
-}
-
-
-static inline void taskstats_exit_alloc(struct taskstats **ptidstats)
-{
-   *ptidstats = NULL;
-   if (taskstats_has_listeners())
-   *ptidstats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL);
-}
-
 static inline void taskstats_exit_free(struct taskstats *tidstats)
 {
if (tidstats)
@@ -79,17 +64,18 @@ static inline void taskstats_tgid_free(s
kmem_cache_free(taskstats_cache, stats);
 }

-extern void taskstats_exit_send(struct task_struct *, struct taskstats *, int);
+extern void taskstats_exit_alloc(struct taskstats **, unsigned int *);
+extern void taskstats_exit_send(struct task_struct *, struct taskstats *, int, 
unsigned int);
 extern void taskstats_init_early(void);
 extern void taskstats_tgid_alloc(struct signal_struct *);
 #else
-static inline void taskstats_exit_alloc(struct taskstats **ptidstats)
+static inline void taskstats_exit_alloc(struct taskstats **ptidstats, unsigned 
int *mycpu)
 {}
 static inline void taskstats_exit_free(struct taskstats *ptidstats)
 {}
 static inline void taskstats_exit_send(struct task_struct *tsk,
   struct taskstats *tidstats,
-  int group_dead)
+  int group_dead, unsigned int cpu)
 {}
 static inline void taskstats_tgid_init(struct signal_struct *sig)
 {}
Index: linux-2.6.17-mm3equiv/kernel/taskstats.c
===
--- linux-2.6.17-mm3equiv.orig/kernel/taskstats.c   2006-06-30 
23:38:39.0 -0400
+++ linux-2.6.17-mm3equiv/kernel/taskstats.c2006-07-06 02:41:15.0 
-0400
@@ -19,9 +19,17 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 

+/*
+ * Maximum length of a cpumask that can be specified in
+ * the TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK attribute
+ */
+#define TASKSTATS_CPUMASK_MAXLEN   (100+6*NR_CPUS)
+
 static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
 static int family_registered = 0;
 kme

Re: [PATCH] per-task delay accounting taskstats interface: control exit data through cpumasks]

2006-07-06 Thread Shailabh Nagar
On Thu, 2006-07-06 at 02:56 -0700, Andrew Morton wrote:
> On Thu, 06 Jul 2006 05:28:35 -0400
> Shailabh Nagar <[EMAIL PROTECTED]> wrote:
> 
> > On systems with a large number of cpus, with even a modest rate of
> > tasks exiting per cpu, the volume of taskstats data sent on thread exit
> > can overflow a userspace listener's buffers.
> > 
> > One approach to avoiding overflow is to allow listeners to get data for
> > a limited and specific set of cpus. By scaling the number of listeners
> > and/or the cpus they monitor, userspace can handle the statistical data
> > overload more gracefully.
> > 
> > In this patch, each listener registers to listen to a specific set of
> > cpus by specifying a cpumask.  The interest is recorded per-cpu. When
> > a task exits on a cpu, its taskstats data is unicast to each listener
> > interested in that cpu.
> > 
> > Thanks to Andrew Morton for pointing out the various scalability and
> > general concerns of previous attempts and for suggesting this design.
> > 
> > ...
> >
> > --- linux-2.6.17-mm3equiv.orig/include/linux/taskstats.h2006-06-30 
> > 19:03:40.0 -0400
> > +++ linux-2.6.17-mm3equiv/include/linux/taskstats.h 2006-07-06 
> > 02:38:28.0 -0400
> 
> Your email client performs space-stuffing.  Fortunately "sed -e 's/^  / /'"
> is easy.
> 
> >   #include 
> >   #include 
> > +#include 
> > +#include 
> >   #include 
> >   #include 
> 
> Like that.

Sorry. First the text breaking and now patch manglingI'll switch
mail clients.
> 
> > 
> > +static int add_del_listener(pid_t pid, cpumask_t *maskp, int isadd)
> > +{
> > +   struct listener *s;
> > +   struct listener_list *listeners;
> > +   unsigned int cpu;
> > +   cpumask_t mask;
> > +   struct list_head *p;
> > +
> > +   memcpy(&mask, maskp, sizeof(cpumask_t));
> 
>   mask = *maskp; ?

Yes.

> 
> > +   if (!cpus_subset(mask, cpu_possible_map))
> > +   return -EINVAL;
> > +
> > +   if (isadd == REGISTER) {
> > +   for_each_cpu_mask(cpu, mask) {
> > +   s = kmalloc_node(sizeof(struct listener), GFP_KERNEL,
> > +cpu_to_node(cpu));
> > +   if (!s)
> > +   return -ENOMEM;
> > +   s->pid = pid;
> > +   INIT_LIST_HEAD(&s->list);
> > +
> > +   listeners = &per_cpu(listener_array, cpu);
> > +   down_write(&listeners->sem);
> > +   list_add(&s->list, &listeners->list);
> > +   up_write(&listeners->sem);
> > +   }
> > +   } else {
> > +   for_each_cpu_mask(cpu, mask) {
> > +   struct list_head *tmp;
> > +
> > +   listeners = &per_cpu(listener_array, cpu);
> > +   down_write(&listeners->sem);
> > +   list_for_each_safe(p, tmp, &listeners->list) {
> > +   s = list_entry(p, struct listener, list);
> > +   if (s->pid == pid) {
> > +   list_del(&s->list);
> > +   kfree(s);
> > +   break;
> > +   }
> > +   }
> > +   up_write(&listeners->sem);
> > +   }
> > +   }
> > +   return 0;
> > +}
> 
> You might choose to handle the ENOMEM situation here by backing out and not
> leaving things half-installed.  I suspect that's just a simple `goto'.

Will do. 

> 
> > -static int taskstats_send_stats(struct sk_buff *skb, struct genl_info 
> > *info)
> > +static int taskstats_user_cmd(struct sk_buff *skb, struct genl_info *info)
> >   {
> > int rc = 0;
> > struct sk_buff *rep_skb;
> > @@ -210,6 +302,25 @@ static int taskstats_send_stats(struct s
> > void *reply;
> > size_t size;
> > struct nlattr *na;
> > +   cpumask_t mask;
> 
> When counting add_del_listener(), that's two cpumasks on the stack.  How
> big can these get?  256 bytes?  Is it possible to get by with just the one?

For 1024 cpus, 128 bytes. All the cpumask functions seem to need the
mask (though they internally use the address). Will try to see if I can
reduce.
> 
> > +
> > +   if (info->attrs[TASKSTATS_CMD_ATTR_REGISTER_CPUMASK]

Re: [PATCH] per-task delay accounting taskstats interface: control exit data through cpumasks]

2006-07-06 Thread Shailabh Nagar
On systems with a large number of cpus, with even a modest rate of 
tasks exiting per cpu, the volume of taskstats data sent on thread exit
can overflow a userspace listener's buffers.

One approach to avoiding overflow is to allow listeners to get data for
a limited and specific set of cpus. By scaling the number of listeners 
and/or the cpus they monitor, userspace can handle the statistical data
overload more gracefully.

In this patch, each listener registers to listen to a specific set of 
cpus by specifying a cpumask.  The interest is recorded per-cpu. When 
a task exits on a cpu, its taskstats data is unicast to each listener 
interested in that cpu.

Thanks to Andrew Morton for pointing out the various scalability and 
general concerns of previous attempts and for suggesting this design.

Signed-Off-By: Shailabh Nagar <[EMAIL PROTECTED]>
---

Addresses various comments by akpm:
- fix your email client !
- avoid memcpy of cpumask
- handle ENOMEM within add_del_listener properly
- check return value of cpulist_parse
- exploit kfree(NULL)

Does not address (can't see an easy way to do)
- try to avoid allocating two cpumask's on stack from taskstats_user_cmd 

Tested on a 1-way and works ok. Testing on 2/4 way being done.


 include/linux/taskstats.h  |4 
 include/linux/taskstats_kern.h |   22 -
 kernel/exit.c  |5 -
 kernel/taskstats.c |  168 ++---
 4 files changed, 168 insertions(+), 31 deletions(-)

Index: linux-2.6.17-mm3equiv/include/linux/taskstats.h
===
--- linux-2.6.17-mm3equiv.orig/include/linux/taskstats.h2006-06-30 
19:03:40.0 -0400
+++ linux-2.6.17-mm3equiv/include/linux/taskstats.h 2006-07-06 
06:48:10.0 -0400
@@ -87,8 +87,6 @@ struct taskstats {
 };
 
 
-#define TASKSTATS_LISTEN_GROUP 0x1
-
 /*
  * Commands sent from userspace
  * Not versioned. New commands should only be inserted at the enum's end
@@ -120,6 +118,8 @@ enum {
TASKSTATS_CMD_ATTR_UNSPEC = 0,
TASKSTATS_CMD_ATTR_PID,
TASKSTATS_CMD_ATTR_TGID,
+   TASKSTATS_CMD_ATTR_REGISTER_CPUMASK,
+   TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK,
__TASKSTATS_CMD_ATTR_MAX,
 };
 
Index: linux-2.6.17-mm3equiv/include/linux/taskstats_kern.h
===
--- linux-2.6.17-mm3equiv.orig/include/linux/taskstats_kern.h   2006-06-30 
11:57:14.0 -0400
+++ linux-2.6.17-mm3equiv/include/linux/taskstats_kern.h2006-07-06 
06:48:10.0 -0400
@@ -20,21 +20,6 @@ enum {
 extern kmem_cache_t *taskstats_cache;
 extern struct mutex taskstats_exit_mutex;
 
-static inline int taskstats_has_listeners(void)
-{
-   if (!genl_sock)
-   return 0;
-   return netlink_has_listeners(genl_sock, TASKSTATS_LISTEN_GROUP);
-}
-
-
-static inline void taskstats_exit_alloc(struct taskstats **ptidstats)
-{
-   *ptidstats = NULL;
-   if (taskstats_has_listeners())
-   *ptidstats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL);
-}
-
 static inline void taskstats_exit_free(struct taskstats *tidstats)
 {
if (tidstats)
@@ -79,17 +64,18 @@ static inline void taskstats_tgid_free(s
kmem_cache_free(taskstats_cache, stats);
 }
 
-extern void taskstats_exit_send(struct task_struct *, struct taskstats *, int);
+extern void taskstats_exit_alloc(struct taskstats **, unsigned int *);
+extern void taskstats_exit_send(struct task_struct *, struct taskstats *, int, 
unsigned int);
 extern void taskstats_init_early(void);
 extern void taskstats_tgid_alloc(struct signal_struct *);
 #else
-static inline void taskstats_exit_alloc(struct taskstats **ptidstats)
+static inline void taskstats_exit_alloc(struct taskstats **ptidstats, unsigned 
int *mycpu)
 {}
 static inline void taskstats_exit_free(struct taskstats *ptidstats)
 {}
 static inline void taskstats_exit_send(struct task_struct *tsk,
   struct taskstats *tidstats,
-  int group_dead)
+  int group_dead, unsigned int cpu)
 {}
 static inline void taskstats_tgid_init(struct signal_struct *sig)
 {}
Index: linux-2.6.17-mm3equiv/kernel/taskstats.c
===
--- linux-2.6.17-mm3equiv.orig/kernel/taskstats.c   2006-06-30 
23:38:39.0 -0400
+++ linux-2.6.17-mm3equiv/kernel/taskstats.c2006-07-06 07:02:54.0 
-0400
@@ -19,9 +19,17 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 
+/*
+ * Maximum length of a cpumask that can be specified in
+ * the TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK attribute
+ */
+#define TASKSTATS_CPUMASK_MAXLEN   (100+6*NR_CPUS)
+
 static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
 static int family_registered = 0;
 kmem_cache_t *taskst

Re: [PATCH] per-task delay accounting taskstats interface: control exit data through cpumasks]

2006-07-06 Thread Shailabh Nagar
On Thu, 2006-07-06 at 14:08 +0200, Thomas Graf wrote:
> * Shailabh Nagar <[EMAIL PROTECTED]> 2006-07-06 07:37
> > @@ -37,9 +45,26 @@ static struct nla_policy taskstats_cmd_g
> >  __read_mostly = {
> > [TASKSTATS_CMD_ATTR_PID]  = { .type = NLA_U32 },
> > [TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 },
> > +   [TASKSTATS_CMD_ATTR_REGISTER_CPUMASK] = { .type = NLA_STRING },
> > +   [TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK] = { .type = NLA_STRING },};
> 
> > +   na = info->attrs[TASKSTATS_CMD_ATTR_REGISTER_CPUMASK];
> > +   if (nla_len(na) > TASKSTATS_CPUMASK_MAXLEN)
> > +   return -E2BIG;
> > +   rc = cpulist_parse((char *)nla_data(na), mask);
> 
> This isn't safe, the data in the attribute is not guaranteed to be
> NUL terminated. Still it's probably me to blame for not making
> this more obvious in the API.
> 
> I've attached a patch below extending the API to make it easier
> for interfaces using NUL termianted strings, so you'd do:
> 
> [TASKSTATS_CMS_ATTR_REGISTER_CPUMASK] = { .type = NLA_NUL_STRING,
>   .len = TASKSTATS_CPUMASK_MAXLEN }
> 
> ... and then use (char *) nla_data(str_attr) without any further
> sanity checks.


Thanks. That makes it much clearer. I was looking around for a "maxlen"
attribute helper yesterday :-)

--Shailabh


> 
> [NETLINK]: Improve string attribute validation
> 
> Introduces a new attribute type NLA_NUL_STRING to support NUL
> terminated strings. Attributes of this kind require to carry
> a terminating NUL within the maximum specified in the policy.
> 
> The `old' NLA_STRING which is not required to be NUL terminated
> is extended to provide means to specify a maximum length of the
> string.
> 
> Aims at easing the pain with using nla_strlcpy() on temporary
> buffers.
> 
> The old `minlen' field is renamed to `len' for cosmetic purposes
> which is ok since nobody was using it at this point.
> 
> Signed-off-by: Thomas Graf <[EMAIL PROTECTED]>



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] per-task delay accounting taskstats interface: control exit data through cpumasks]

2006-07-06 Thread Shailabh Nagar
Andrew Morton wrote:
> Thomas Graf <[EMAIL PROTECTED]> wrote:
> 
>>* Shailabh Nagar <[EMAIL PROTECTED]> 2006-07-06 07:37
>>
>>>@@ -37,9 +45,26 @@ static struct nla_policy taskstats_cmd_g
>>> __read_mostly = {
>>> [TASKSTATS_CMD_ATTR_PID]  = { .type = NLA_U32 },
>>> [TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 },
>>>+[TASKSTATS_CMD_ATTR_REGISTER_CPUMASK] = { .type = NLA_STRING },
>>>+[TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK] = { .type = NLA_STRING },};
>>
>>>+na = info->attrs[TASKSTATS_CMD_ATTR_REGISTER_CPUMASK];
>>>+if (nla_len(na) > TASKSTATS_CPUMASK_MAXLEN)
>>>+return -E2BIG;
>>>+rc = cpulist_parse((char *)nla_data(na), mask);
>>
>>This isn't safe, the data in the attribute is not guaranteed to be
>>NUL terminated. Still it's probably me to blame for not making
>>this more obvious in the API.
>>
> 
> 
> Thanks, that was an unpleasant bug.
> 
> 
>>I've attached a patch below extending the API to make it easier
>>for interfaces using NUL termianted strings,
> 
> 
> In the interests of keeping this work decoupled from netlink enhancements
> I'd propose the below.  

The patch looks good. I was also thinking of simply modifying the input
to have the null termination and modify later when netlink provides
generic support.


> Is it bad to modify the data at nla_data()?
> 
> 
> --- 
> a/kernel/taskstats.c~per-task-delay-accounting-taskstats-interface-control-exit-data-through-cpumasks-fix
> +++ a/kernel/taskstats.c
> @@ -299,6 +299,23 @@ cleanup:
>   return 0;
>  }
>  
> +static int parse(struct nlattr *na, cpumask_t *mask)
> +{
> + char *data;
> + int len;
> +
> + if (na == NULL)
> + return 1;
> + len = nla_len(na);
> + if (len > TASKSTATS_CPUMASK_MAXLEN)
> + return -E2BIG;
> + if (len < 1)
> + return -EINVAL;
> + data = nla_data(na);
> + data[len - 1] = '\0';
> + return cpulist_parse(data, *mask);
> +}
> +
>  static int taskstats_user_cmd(struct sk_buff *skb, struct genl_info *info)
>  {
>   int rc = 0;
> @@ -309,27 +326,17 @@ static int taskstats_user_cmd(struct sk_
>   struct nlattr *na;
>   cpumask_t mask;
>  
> - if (info->attrs[TASKSTATS_CMD_ATTR_REGISTER_CPUMASK]) {
> - na = info->attrs[TASKSTATS_CMD_ATTR_REGISTER_CPUMASK];
> - if (nla_len(na) > TASKSTATS_CPUMASK_MAXLEN)
> - return -E2BIG;
> - rc = cpulist_parse((char *)nla_data(na), mask);
> - if (rc)
> - return rc;
> - rc = add_del_listener(info->snd_pid, &mask, REGISTER);
> + rc = parse(info->attrs[TASKSTATS_CMD_ATTR_REGISTER_CPUMASK], &mask);
> + if (rc < 0)
>   return rc;
> - }
> + if (rc == 0)
> + return add_del_listener(info->snd_pid, &mask, REGISTER);
>  
> - if (info->attrs[TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK]) {
> - na = info->attrs[TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK];
> - if (nla_len(na) > TASKSTATS_CPUMASK_MAXLEN)
> - return -E2BIG;
> - rc = cpulist_parse((char *)nla_data(na), mask);
> - if (rc)
> - return rc;
> - rc = add_del_listener(info->snd_pid, &mask, DEREGISTER);
> + rc = parse(info->attrs[TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK], &mask);
> + if (rc < 0)
>   return rc;
> - }
> + if (rc == 0)
> + return add_del_listener(info->snd_pid, &mask, DEREGISTER);
>  
>   /*
>* Size includes space for nested attributes
> _
> 

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener

2006-07-10 Thread Shailabh Nagar
Use a cloned sk_buff for each netlink message sent to multiple listeners.

Earlier, the same skb, representing a netlink message, was being erroneously 
reused for doing genetlink_unicast()'s (effectively netlink_unicast()) to 
each listener on the per-cpu list of listeners. Since netlink_unicast() frees
up the skb passed to it, regardless of status of the send, reuse is bad.

Thanks to Chandra Seetharaman for discovering this bug.

Signed-Off-By: Shailabh Nagar <[EMAIL PROTECTED]>
Signed-Off-By: Chandra Seetharaman <[EMAIL PROTECTED]>

 kernel/taskstats.c |   13 -
 1 files changed, 12 insertions(+), 1 deletion(-)

Index: linux-2.6.18-rc1/kernel/taskstats.c
===
--- linux-2.6.18-rc1.orig/kernel/taskstats.c2006-07-10 23:44:56.0 
-0400
+++ linux-2.6.18-rc1/kernel/taskstats.c 2006-07-10 23:45:15.0 -0400
@@ -125,6 +125,7 @@ static int send_cpu_listeners(struct sk_
struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
struct listener_list *listeners;
struct listener *s, *tmp;
+   struct sk_buff *skb_next, *skb_cur = skb;
void *reply = genlmsg_data(genlhdr);
int rc, ret;
 
@@ -138,12 +139,22 @@ static int send_cpu_listeners(struct sk_
listeners = &per_cpu(listener_array, cpu);
down_write(&listeners->sem);
list_for_each_entry_safe(s, tmp, &listeners->list, list) {
-   ret = genlmsg_unicast(skb, s->pid);
+   skb_next = NULL;
+   if (!list_islast(&s->list, &listeners->list)) {
+   skb_next = skb_clone(skb_cur, GFP_KERNEL);
+   if (!skb_next) {
+   nlmsg_free(skb_cur);
+   rc = -ENOMEM;
+   break;
+   }
+   }
+   ret = genlmsg_unicast(skb_cur, s->pid);
if (ret == -ECONNREFUSED) {
list_del(&s->list);
kfree(s);
rc = ret;
}
+   skb_cur = skb_next;
}
up_write(&listeners->sem);
 


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener

2006-07-11 Thread Shailabh Nagar
Herbert Xu wrote:
> On Tue, Jul 11, 2006 at 03:57:31AM -0700, Andrew Morton wrote:
> 
>  down_write(&listeners->sem);
>  list_for_each_entry_safe(s, tmp, &listeners->list, list) {
>- ret = genlmsg_unicast(skb, s->pid);
>+ skb_next = NULL;
>+ if (!list_islast(&s->list, &listeners->list)) {
>+ skb_next = skb_clone(skb_cur, GFP_KERNEL);

If we do a GFP_KERNEL allocation with this semaphore held, and the
oom-killer tries to kill something to satisfy the allocation, and the
killed task gets stuck on that semaphore, I wonder of the box locks up.

Hmm...doesn't look very safe does it.
There's no real need for us to skb_clone within the sem. Keeping a count of
listeners and doing the clone outside should let us avoid this problem.

I was trying to avoid doing the above because of the potential for
listeners getting added continuously to the list
(and having to repeat the allocation loop outside the down_write).

But on second thoughts, we're under no obligation to send the data to all the
listeners who add themselves in the short time between our taking a snapshot
of the listener count and when the send is done (within the down_write). So
it should be ok.

>>>
>>>We do GFP_KERNEL inside semaphores/mutexes in lots of places.  So if this
>>>can deadlock with the oom-killer we probably should fix that, preferably
>>>by having GFP_KERNEL fail in that case.
>>
>>This lock is special, in that it's taken on the exit() path (I think).  So
>>it can block tasks which are trying to exit.
> 
> 
> Sorry, missed the context.
> 
> If there is a deadlock then it's not just this allocation that you
> need worry about.  There is also an allocation within genlmsg_uniast
> that would be GFP_KERNEL.
> 

Thats true. The GFP_KERNEL allocation potentially called in netlink_trim() as 
part
of the genlmsg/netlink_unicast() is again a problem.

So perhaps we should switch to using RCU for protecting the listener list.

--Shailabh

> Cheers,

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener

2006-07-11 Thread Shailabh Nagar
On Tue, 2006-07-11 at 07:15, Herbert Xu wrote:
> On Tue, Jul 11, 2006 at 03:57:31AM -0700, Andrew Morton wrote:
> >
> > > >>   down_write(&listeners->sem);
> > > >>   list_for_each_entry_safe(s, tmp, &listeners->list, list) {
> > > >> - ret = genlmsg_unicast(skb, s->pid);
> > > >> + skb_next = NULL;
> > > >> + if (!list_islast(&s->list, &listeners->list)) {
> > > >> + skb_next = skb_clone(skb_cur, GFP_KERNEL);
> > > > 
> > > > If we do a GFP_KERNEL allocation with this semaphore held, and the
> > > > oom-killer tries to kill something to satisfy the allocation, and the
> > > > killed task gets stuck on that semaphore, I wonder of the box locks up.
> > > 
> > > We do GFP_KERNEL inside semaphores/mutexes in lots of places.  So if this
> > > can deadlock with the oom-killer we probably should fix that, preferably
> > > by having GFP_KERNEL fail in that case.
> > 
> > This lock is special, in that it's taken on the exit() path (I think).  So
> > it can block tasks which are trying to exit.
> 
> Sorry, missed the context.
> 
> If there is a deadlock then it's not just this allocation that you
> need worry about.  There is also an allocation within genlmsg_uniast
> that would be GFP_KERNEL.


Remove down_write() from taskstats code invoked on the exit() path.

In send_cpu_listeners(), which is called on the exit path, 
a down_write() was protecting operations like skb_clone() and 
genlmsg_unicast() that do GFP_KERNEL allocations. If the oom-killer 
decides to kill tasks to satisfy the allocations,the exit of those 
tasks could block on the same semphore.

The down_write() was only needed to allow removal of invalid 
listeners from the listener list. The patch converts the down_write 
to a down_read and defers the removal to a separate critical region. 
This ensures that even if the oom-killer is called, no other task's 
exit is blocked as it can still acquire another down_read.

Thanks to Andrew Morton & Herbert Xu for pointing out the oom 
related pitfalls, and to Chandra Seetharaman for suggesting this 
fix instead of using something more complex like RCU.

Signed-Off-By: Chandra Seetharaman <[EMAIL PROTECTED]>
Signed-Off-By: Shailabh Nagar <[EMAIL PROTECTED]>

---

 kernel/taskstats.c |   24 +++-
 1 files changed, 19 insertions(+), 5 deletions(-)

Index: linux-2.6.18-rc1/kernel/taskstats.c
===
--- linux-2.6.18-rc1.orig/kernel/taskstats.c2006-07-11 00:13:00.0 
-0400
+++ linux-2.6.18-rc1/kernel/taskstats.c 2006-07-12 00:07:53.0 -0400
@@ -51,6 +51,7 @@ __read_mostly = {
 struct listener {
struct list_head list;
pid_t pid;
+   char valid;
 };
 
 struct listener_list {
@@ -127,7 +128,7 @@ static int send_cpu_listeners(struct sk_
struct listener *s, *tmp;
struct sk_buff *skb_next, *skb_cur = skb;
void *reply = genlmsg_data(genlhdr);
-   int rc, ret;
+   int rc, ret, delcount = 0;
 
rc = genlmsg_end(skb, reply);
if (rc < 0) {
@@ -137,7 +138,7 @@ static int send_cpu_listeners(struct sk_
 
rc = 0;
listeners = &per_cpu(listener_array, cpu);
-   down_write(&listeners->sem);
+   down_read(&listeners->sem);
list_for_each_entry_safe(s, tmp, &listeners->list, list) {
skb_next = NULL;
if (!list_islast(&s->list, &listeners->list)) {
@@ -150,14 +151,26 @@ static int send_cpu_listeners(struct sk_
}
ret = genlmsg_unicast(skb_cur, s->pid);
if (ret == -ECONNREFUSED) {
-   list_del(&s->list);
-   kfree(s);
+   s->valid = 0;
+   delcount++;
rc = ret;
}
skb_cur = skb_next;
}
-   up_write(&listeners->sem);
+   up_read(&listeners->sem);
+
+   if (!delcount)
+   return rc;
 
+   /* Delete invalidated entries */
+   down_write(&listeners->sem);
+   list_for_each_entry_safe(s, tmp, &listeners->list, list) {
+   if (!s->valid) {
+   list_del(&s->list);
+   kfree(s);
+   }
+   }
+   up_write(&listeners->sem);
return rc;
 }
 
@@ -290,6 +303,7 @@ static int add_del_listener(pid_t pid, c
goto cleanup;
s->pid = pid;
INIT_LIST_HEAD(&s->list);
+   s->valid = 1;
 
listeners = &per_cpu(listener_array, cpu);
down_write(&listeners->sem);


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [DOC]: generic netlink

2006-07-12 Thread Shailabh Nagar

Jamal Hadi Salim wrote:

On Tue, 2006-11-07 at 16:57 -0700, Randy.Dunlap wrote:



so make it a patch to Documentation/networking/...




I was going to when it got in better shape. Good suggestion, I will do
this soon and put it there as a patch.



I have some doc corrections, Jamal.  Do I send them against
the 2006-june-19 doc posting?  and as email comments or as a patch?




There has been some small changes; last time i punted it to Shailabh for
additional changes. 


Sorry, haven't had time to finish up the discussions and changes on account of
the flurry of stuff going on for delay accounting patches.


You can extend the attached version (from june 20)
or send me a patch - whichever is convinient. 



cheers,
jamal






1.0 Problem Statement
---

Netlink is a robust wire-format IPC typically used for kernel-user
communication although could also be used to be a communication
carrier between user-user and kernel-kernel.

A typical netlink connection setup is of the form:

netlink_socket = socket(PF_NETLINK, socket_type, netlink_family);

where netlink_family selects the netlink "bus" to communicate
on. Example of a family would be NETLINK_ROUTE which is 0x0 or
NETLINK_XFRM which is 0x6. [Refer to RFC 3549 for a high level view
and look at include/linux/netlink.h for some of the allocated families].

Over the years, due to its robust design, netlink has become very popular.
This has resulted in the danger of running out of family numbers to issue.

In netconf 2005 in Montreal it was decided to find ways to work around
the allocation challenge and as a result NETLINK_GENERIC "bus" was born.

This document gives a mid-level view if NETLINK_GENERIC and how to use it.
The reader does not necessarily have to know what netlink is, but needs
to know at least the encapsulation used - which is described in the next
section. There are some implicit assumptions about what netlink is
or what structures like TLVs are etc. I apologize i dont have much
time to give a tutorial - invite me to some odd conference and i will
be forced to do better than this doc. Better send patches to this doc.

2.0 Overview
-

In order to illustrate the way different components talk to each
other, the diagram below is used to provide an abstraction on
how the operations happen. 


1) The generic netlink connection which for illustration is refered
to as a "bus". The generic netlink bus is shown as split between user 
and kernel domains: This means programs can connect to the bus from either

kernel or user space.

2) Users : who use the connection to get information or set variables.
These are typically programs in user space but don't have to be.

3) Providers: who supply the information sent through the connection or to
execute kernel functions in response to user commands. This is
always some kernel subsystem, typically but not necessarily a module.

4) Commands: which typically define what is sent by the user and acted upon
by the provider. Commands are registered with the generic netlink bus by
providers.

In the diagram, controller, foobar and googah are providers, user1 through
user-n users in userspace and kuser-1 a user in kernel space. For brevity,
kernel space users are not discussed further.

All boxes  have kernel-wide unique identifiers that can be used to
address them.

Any users can communicate with one or more providers. The interface to a
provider is defined primarily by the commands it exports as well as the
optional provider specific headers that it mandates in messages exchanged
with users, explained further below.


+--+  +--+
|  user1   |  ..  |  user-n  |
+--+---+  +---+--+
   |  |
   /  |
  |   |User
+-++-+ Space/domain
 user   ||
+   Generic Netlink Bus  +---
 kernel ||   Kernel
+--+--+--+   Space/domain
  ||  |   \
  ||  |\   +-+
  ||  | \_ | kuser-1 |
  ||  |+-+
   +--+---++---+-+ +--+-+
   |controller|| foobar  | | googah |
   +--++-+ ++

The controller is a special built-in provider. It is the repository
of info on other providers attached to the bus.  It has
a reserved address identifier of 0x10. By querying the controller,
one could find out that both foobar and googah are registered and

[Patch 4/8] Utilities for genetlink usage

2006-04-22 Thread Shailabh Nagar
genetlink-utils.patch

Two utilities for simplifying usage of NETLINK_GENERIC
interface.

Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>
Signed-off-by: Shailabh Nagar <[EMAIL PROTECTED]>

 include/net/genetlink.h |   20 
 1 files changed, 20 insertions(+)

Index: linux-2.6.17-rc1/include/net/genetlink.h
===
--- linux-2.6.17-rc1.orig/include/net/genetlink.h   2006-04-21 
19:39:29.0 -0400
+++ linux-2.6.17-rc1/include/net/genetlink.h2006-04-21 20:29:19.0 
-0400
@@ -150,4 +150,24 @@ static inline int genlmsg_unicast(struct
return nlmsg_unicast(genl_sock, skb, pid);
 }

+/**
+ * gennlmsg_data - head of message payload
+ * @gnlh: genetlink messsage header
+ */
+static inline void *genlmsg_data(const struct genlmsghdr *gnlh)
+{
+   return ((unsigned char *) gnlh + GENL_HDRLEN);
+}
+
+/**
+ * genlmsg_len - length of message payload
+ * @gnlh: genetlink message header
+ */
+static inline int genlmsg_len(const struct genlmsghdr *gnlh)
+{
+   struct nlmsghdr *nlh = (struct nlmsghdr *)((unsigned char *)gnlh -
+   NLMSG_HDRLEN);
+   return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN);
+}
+
 #endif /* __NET_GENERIC_NETLINK_H */
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Patch 7/7] Generic netlink interface (delay accounting)

2006-02-27 Thread Shailabh Nagar
delayacct-genetlink.patch

Create a generic netlink interface (NETLINK_GENERIC family), 
called "taskstats", for getting delay and cpu statistics of 
tasks and thread groups during their lifetime and when they exit. 

The cpu stats are available only if CONFIG_SCHEDSTATS is enabled.

When a task is alive, userspace can get its stats by sending a 
command containing its pid. Sending a tgid returns the sum of stats 
of the tasks belonging to that tgid (where such a sum makes sense). 
Together, the command interface allows stats for a large number of 
tasks to be collected more efficiently than would be possible 
through /proc or any per-pid interface. 

The netlink interface also sends the stats for each task to userspace 
when the task is exiting. This permits fine-grain accounting for 
short-lived tasks, which is important if userspace is doing its own 
aggregation of statistics based on some grouping of tasks 
(e.g. CSA jobs, ELSA banks or CKRM classes).

If the exiting task belongs to a thread group (with more members than itself)
, the latters delay stats are also sent out on the task's exit. This allows
userspace to get accurate data at a per-tgid level while the tid's of a tgid
are exiting one by one.

The interface has been deliberately kept distinct from the delay 
accounting code since it is potentially usable by other kernel components
that need to export per-pid/tgid data. The format of data returned to 
userspace is versioned and the command interface easily extensible to 
facilitate reuse.

If reuse is not deemed useful enough, the naming, placement of functions
and config options will be modified to make this an interface for delay 
accounting alone.

Signed-off-by: Shailabh Nagar <[EMAIL PROTECTED]>
Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>

 include/linux/delayacct.h |2 
 include/linux/taskstats.h |  125 +
 init/Kconfig  |   16 ++
 kernel/Makefile   |1 
 kernel/delayacct.c|   49 
 kernel/taskstats.c|  266 ++
 6 files changed, 456 insertions(+), 3 deletions(-)

Index: linux-2.6.16-rc4/include/linux/delayacct.h
===
--- linux-2.6.16-rc4.orig/include/linux/delayacct.h 2006-02-27 
01:53:01.0 -0500
+++ linux-2.6.16-rc4/include/linux/delayacct.h  2006-02-27 01:53:03.0 
-0500
@@ -16,6 +16,7 @@
 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_TASK_DELAY_ACCT
 extern int delayacct_on;   /* Delay accounting turned on/off */
@@ -27,6 +28,7 @@ extern void __delayacct_tsk_init(struct 
 extern void __delayacct_tsk_exit(struct task_struct *);
 extern void __delayacct_blkio(void);
 extern void __delayacct_swapin(void);
+extern int delayacct_add_tsk(struct taskstats_reply *, struct task_struct *);
 
 static inline void delayacct_tsk_early_init(struct task_struct *tsk)
 {
Index: linux-2.6.16-rc4/include/linux/taskstats.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.16-rc4/include/linux/taskstats.h  2006-02-27 01:53:03.0 
-0500
@@ -0,0 +1,125 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *   (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ * - a task exits
+ * - listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION  1
+
+struct taskstats {
+
+   /* Version 1 */
+#define TASKSTATS_NOPID-1
+   pid_t   pid;
+   pid_t   tgid;
+
+   /* XXX_count is number of delay values recorded.
+* XXX_total is corresponding cumulative delay in nanoseconds
+*/
+
+#define TASKSTATS_NOCPUSTATS   1
+   __u64   cpu_count;
+   __u64   cpu_delay_total;/* wait, while runnable, for cpu */
+   __u64   blkio_count;
+   __u64   blkio_delay_total;  /* sync,block io completion wait*/
+   __u64   swapin_count;
+   __u64   swapin_delay_total; /* swapin page fault wait*/
+
+   __u64   cpu_run_tot

Re: [Patch 7/7] Generic netlink interface (delay accounting)

2006-02-27 Thread Shailabh Nagar

jamal wrote:


On Mon, 2006-27-02 at 03:31 -0500, Shailabh Nagar wrote:


 


+#define TASKSTATS_LISTEN_GROUP 0x1
   



You do multicast to this group - does this mean there could be multiple
listeners subscribed for this event? 

Yes, the current intent is to allow multiple listeners to receive the 
responses sent by the kernel.

In our earlier incarnation of the interface (which used connectors), Andrew
suggested that it could be multiplexed for use by other components such 
as Comprehensive System Accouting,

ELSA etc. that might also be interested in export per-task/process stats.

Since this interface (taskstats) is currently designed for that 
possibility, having multiple listeners, one for

each "component" such as delay accounting, is the model we're using.
We expect each component to have a pair of userspace programs, one for 
sending commands and the other
to "listen" to all replies + data generated on task exits. The listener 
is expected to register/deregister interest through

TASKSTATS_CMD_LISTEN and IGNORE.


How does this correlate to TASKSTATS_CMD_LISTEN/IGNORE?
 

See above. Its mainly an optimization so that if no listener is present, 
there's no need to generate the data.



Typically, an equivalent of listen is when the first user space app
joins this group; and when it leaves, you have equivalency to ignore.
Unless i misunderstood what you are trying to do with this.
 

Yes, thats what we're trying to do, except that the "first" listener has 
no special connotation.





+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+   TASKSTATS_CMD_UNSPEC,   /* Reserved */
+   TASKSTATS_CMD_NONE, /* Not a valid cmd to send
+* Marks data sent on task/tgid exit */
   



you should call it NONE just because it is unidirectional
(kernel->user). You should check in the kernel if someone sends it from
user space and/or set a command flag etc. 
 

Good point. Should check for users sending it as a cmd and treat it as a 
noop. I'm just using

this as a placeholder for data thats returned without being requested.

Come to think of it, there's no real reason to have a genlmsghdr for 
returned data, is there ?
Other than to copy the genlmsghdr that was sent so user can identify 
which command was sent

(and I'm doing that through the reply type, perhaps redundantly).



+   TASKSTATS_CMD_LISTEN,   /* Start listening */
+   TASKSTATS_CMD_IGNORE,   /* Stop listening */
+   TASKSTATS_CMD_PID,  /* Send stats for a pid */
+   TASKSTATS_CMD_TGID, /* Send stats for a tgid */
+};
+
+/* Parameters for commands
+ * New parameters should only be inserted at the struct's end
+ */
+
+struct taskstats_cmd_param {
+   union {
+   pid_t   pid;
+   pid_t   tgid;
+   } id;
+};
+
+/*
+ * Reply sent from kernel
+ * Version number affects size/format of struct taskstats only
+ */
+
+struct taskstats_reply {
+   enum outtype {
+   TASKSTATS_REPLY_NONE = 1,   /* Control cmd response */
+   TASKSTATS_REPLY_PID,/* per-pid data cmd response*/
+   TASKSTATS_REPLY_TGID,   /* per-tgid data cmd response*/
+   TASKSTATS_REPLY_EXIT_PID,   /* Exiting task's stats */
+   TASKSTATS_REPLY_EXIT_TGID,  /* Exiting tgid's stats
+* (sent on each tid's exit) */
+   } outtype;
+   __u32 version;
+   __u32 err;
+   struct taskstats stats; /* Invalid if err != 0 */
+};
+
   



Make sure you have proper alignment of above for things like x86-64
 


Will do.


+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME"TASKSTATS"
+#define TASKSTATS_GENL_VERSION 0x1
+/* Following must be > NLMSG_MIN_TYPE */
+#define TASKSTATS_GENL_ID  0x123
   



You do not check whether you got this id on registration or not.
It is not guaranteed for you.

 

Actually, the next iteration of the code will move to dynamically 
generated ID. But yes,

will need to check for that.

Thanks for the review.
Couple of questions about general netlink:
is it intended to remain a size that will always be aligned to the 
NLMSG_ALIGNTO so that
(NLMSG_DATA(nlhdr) + GENL_HDRLEN) can always be used as a pointer to the 
genlmsghdr ?


Adding some macros like genlmsg_data(nlh) would be handy (currently I 
just define and use it locally).



--Shailabh


cheers,
jamal

 



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch 7/7] Generic netlink interface (delay accounting)

2006-02-27 Thread Shailabh Nagar

jamal wrote:


+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+   TASKSTATS_CMD_UNSPEC,   /* Reserved */
+   TASKSTATS_CMD_NONE, /* Not a valid cmd to send
+* Marks data sent on task/tgid exit */
   



you should not call it NONE just because it is unidirectional
(kernel->user). You should check in the kernel if someone sends it from
user space and/or set a command flag etc. 
 


Jamal,

Was just getting down to fixing this when I noticed that genl_msg_rcv will
return -EOPNOTSUPP for TASKSTATS_CMD_NONE since taskstats doesn't
register a .doit for that command.

So AFAIC, there's nothing that needs to be done by way of processing. Or 
did you

mean just change the name of the command ?

--Shailabh
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch 7/7] Generic netlink interface (delay accounting)

2006-03-06 Thread Shailabh Nagar

Jamal,

Pls keep lkml and lse-tech on cc since some of this affects the usage
of delay accounting.


jamal wrote:


Hi Shailabh,
Apologies for taking a week to respond ..

On Mon, 2006-27-02 at 15:26 -0500, Shailabh Nagar wrote: 
 


jamal wrote:
   



 

Yes, the current intent is to allow multiple listeners to receive the 
responses sent by the kernel.
   



Responses or events? There is a difference:
Response implies the program in user space requested (ex a GET) for that
information and is receiving such info.
Event implies the program in user space asked to be informed of changes
in the kernel. Example an exit would be considered an event. 
Events are received by virtue of registering to a multicast group.
[..] 
 

My design was to have the listener get both responses (what I call 
replies in the code)

as well as events (data sent on exit of pid)

Since this interface (taskstats) is currently designed for that 
possibility, having multiple listeners, one for

each "component" such as delay accounting, is the model we're using.
We expect each component to have a pair of userspace programs, one for 
sending commands and the other
to "listen" to all replies + data generated on task exits. 
   



You need to have a sender of GETs essentially and a listener of events.
Those are two connections. The replies of a get from user1 will not be
sent to user2 as well - unless ... thats what you are trying to achieve;
the question is why?
 

Yes, I was trying to have an asymmetric model where the userspace sender 
of GETs
doesn't receive the reply as a unicast. Rather the reply is sent by 
multicast (alongwith all the

event data).

Reason for this unintuitive design was to make it easier to process the 
returned data.


The expected usage of delay accounting is to periodically "sample" the 
delays for all
tasks (or tgids) in the system. Also, get the delays from exiting pids 
(lets forget how tgid exit

is handled for now...irrelevant to this discussion).

Using the above two pieces of data, userspace can aggregate the "delays" 
seen by any

grouping of tasks that it chooses to implement.

In this usage scenario, its more efficient to have one receiver get both 
response and event

data and process in a loop.

However, we could switch to the model you suggest and use a 
multithreaded send/receive

userspace utility.

The listener 
is expected to register/deregister interest through

TASKSTATS_CMD_LISTEN and IGNORE.

   



It is not necessary if you follow the model i described.

 


How does this correlate to TASKSTATS_CMD_LISTEN/IGNORE?


 

See above. Its mainly an optimization so that if no listener is present, 
there's no need to generate the data.


   



Also not necessary - There is a recent netlink addition to make sure
that events dont get sent if no listeners exist.
genetlink needs to be extended. For now assume such a thing exists.
 


Ok. Will this addition work for both unicast and multicast modes ?



 


+
   



 

Good point. Should check for users sending it as a cmd and treat it as a 
noop. 
   



More like return an -EINVAL
 


Will this be necessary ? Isn't genl_rcv_msg() going to return a -EOPNOTSUPP
automatically for us since we've not registered the command ?


 


I'm just using
this as a placeholder for data thats returned without being requested.

   



So it is unconditional?
 


Yes.

 

Come to think of it, there's no real reason to have a genlmsghdr for 
returned data, is there ?
   



All messages should be consistent whether they are sent from user
or kernel.
 


Ok. will retain genetlink header.

Other than to copy the genlmsghdr that was sent so user can identify 
which command was sent

(and I'm doing that through the reply type, perhaps redundantly).

   



yes, that is a useful trick. Just make sure they are reflected
correctly.

 

Actually, the next iteration of the code will move to dynamically 
generated ID. But yes, will need to check for that.


   



Also if you can provide feedback whether the doc i sent was any use
and what wasnt clear etc.
 


Will do.


Thanks for the review.
Couple of questions about general netlink:
is it intended to remain a size that will always be aligned to the 
NLMSG_ALIGNTO so that (NLMSG_DATA(nlhdr) + GENL_HDRLEN) can always 
be used as a pointer to the genlmsghdr ?


   



I am not sure i followed.
The whole message (nlhdr, genlhdr, optionalhdr, TLVs) has to be in 
the end 32 bit aligned.
 

Ok , so separate padding isn't needed to make the genlhdr, optionalhdr 
and TLV parts aligned

too.

Adding some macros like genlmsg_data(nlh) would be handy (currently I 
just define and use it locally).


   



Send a patch.
 


will do.


Thanks,
Shailabh


cheers,
jamal

 



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lse-tech] Re: [Patch 7/7] Generic netlink interface (delay accounting)

2006-03-08 Thread Shailabh Nagar

jamal wrote:


On Mon, 2006-06-03 at 12:00 -0500, Shailabh Nagar wrote:

 

My design was to have the listener get both responses (what I call 
replies in the code) as well as events (data sent on exit of pid)


   



I think i may not be doing justice explaining this, so let me be more
elaborate so we can be in sync.
Here is the classical way of doing things:

- Assume several apps in user space and a target in the kernel (this
could be reversed or combined in many ways, but the sake of
simplicity/clarity make the above assumption).
- suppose we have five user space apps A, B, C, D, E; these processes
would typically do one of the following class of activities:

a) configure (ADD/NEW/DEL etc). This is issued towards the kernel to
set/create/delete/flush some scalar attribute or vector. These sorts of
commands are synchronous. i.e you issue them, you expect a response
(which may indicate success/failure etc). The response is unicast; the
effect of what they affected may cause an event which may be multicast.

b) query(GET). This is issued towards the kernel to query state of
configured items. These class of commands are also synchronous. There
are special cases of the query which dump everything in the target -
literally called "dumps". The response is unicast.

c) events. These are _asynchronous_ messages issued by the kernel to
indicate some happening in the kernel. The event may be caused by #a
above or any other activity in the kernel. Events are multicast.
To receive them you have to register for the multicast group. You do so
via sockets. You can register to many multicast group.

For clarity again assume we have a multicast group where announcements
of pids exiting is seen and C and D are registered to such a multicast
group.
Suppose process A exits. That would fall under #c above. C and D will be
notified.
Suppose B configures something in the kernel that forces the kernel to
have process E exit and that such an operation is successful. B will get
acknowledgement it succeeded (unicast). C and D will get notified
(multicast). 
Suppose C issued a GET to find details about a specific pid, then only C

will get that info back (message is unicast).
[A response message to a GET is typically designed to be the same as an
ADD message i.e one should be able to take exactly the same message,
change one or two things and shove back into the kernel to configure].
Suppose D issued a GET with dump flag, then D will get the details of
all pids (message is unicast).

Is this clear? Is there more than the above you need?
 

Thanks for the clarification of the usage model. While our needs are 
certainly much less complex,

it is useful to know the range of options.


There are no hard rules on what you need to be multicasting and as an
example you could send periodic(aka time based) samples from the kernel
on a multicast channel and that would be received by all. It did seem
odd that you want to have a semi-promiscous mode where a response to a
GET is multicast. If that is still what you want to achieve, then you
should.
 

Ok, we'll probably switch to the classical mode you suggest since the 
"efficient processing"
rationale for choosing to operate in the semi-promiscous mode earlier 
can be overcome by

writing a multi-threaded userspace utility.

However, we could switch to the model you suggest and use a 
multithreaded send/receive userspace utility.


   



This is more of the classical way of doing things. 



 


There is a recent netlink addition to make sure
that events dont get sent if no listeners exist.
genetlink needs to be extended. For now assume such a thing exists.


 


Ok. Will this addition work for both unicast and multicast modes ?

   



If you never open a connection to the kernel, nothing will be generated
towards user space. 
There are other techniques to rate limit event generation as well (one

such technique is a nagle-like algorithm used by xfrm).

 


Will this be necessary ? Isn't genl_rcv_msg() going to return a -EOPNOTSUPP
automatically for us since we've not registered the command ?

   



Yes, please in your doc feedback remind me of this,

 


Also if you can provide feedback whether the doc i sent was any use
and what wasnt clear etc.


 


Will do.

   



also take a look at the excellent documentation Thomas Graf has put in
the kernel for all the utilities for manipulating netlink messages and
tell me if that should also be put in this doc (It is listed as a TODO).


 


Ok.

Thanks,
Shailabh


cheers,
jamal

 



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UPDATED PATCH] Re: [Lse-tech] Re: [Patch 7/7] Generic netlink interface (delay accounting)

2006-03-09 Thread Shailabh Nagar

Balbir Singh wrote:




Hello, Jamal,

Please find the latest version of the patch for review. The genetlink
code has been updated as per your review comments. The changelog is provided
below

1. Eliminated TASKSTATS_CMD_LISTEN and TASKSTATS_CMD_IGNORE
2. Provide generic functions called genlmsg_data() and genlmsg_len()
  in linux/net/genetlink.h
 


Balbir,
it might be a good idea to split 2. out separately, since it has generic 
value beyond the

delay accounting patches (just like we did for the timespec_diff_ns change)


Thanks,
Shailabh


3. Do not multicast all replies, multicast only events generated due
  to task exit.
4. The taskstats and taskstats_reply structures are now 64 bit aligned.
5. Family id is dynamically generated.

Please let us know if we missed something out.

Thanks,
Balbir


Signed-off-by: Shailabh Nagar <[EMAIL PROTECTED]>
Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>

---

include/linux/delayacct.h |2 
include/linux/taskstats.h |  128 

include/net/genetlink.h   |   20 +++
init/Kconfig  |   16 ++-
kernel/Makefile   |1 
kernel/delayacct.c|   56 ++

kernel/taskstats.c|  244 ++
7 files changed, 464 insertions(+), 3 deletions(-)

 




-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Patch 8/9] generic netlink utility functions

2006-03-13 Thread Shailabh Nagar
genetlink-utils.patch

Two utilities for simplifying usage of NETLINK_GENERIC
interface.

Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>
Signed-off-by: Shailabh Nagar <[EMAIL PROTECTED]>

 include/net/genetlink.h |   20 
 1 files changed, 20 insertions(+)

Index: linux-2.6.16-rc5/include/net/genetlink.h
===
--- linux-2.6.16-rc5.orig/include/net/genetlink.h   2006-03-11 
07:41:32.0 -0500
+++ linux-2.6.16-rc5/include/net/genetlink.h2006-03-11 07:41:41.0 
-0500
@@ -150,4 +150,24 @@ static inline int genlmsg_unicast(struct
return nlmsg_unicast(genl_sock, skb, pid);
 }
 
+/**
+ * gennlmsg_data - head of message payload
+ * @gnlh: genetlink messsage header
+ */
+static inline void *genlmsg_data(const struct genlmsghdr *gnlh)
+{
+   return ((unsigned char *) gnlh + GENL_HDRLEN);
+}
+
+/**
+ * genlmsg_len - length of message payload
+ * @gnlh: genetlink message header
+ */
+static inline int genlmsg_len(const struct genlmsghdr *gnlh)
+{
+   struct nlmsghdr *nlh = (struct nlmsghdr *)((unsigned char *)gnlh -
+   NLMSG_HDRLEN);
+   return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN);
+}
+
 #endif /* __NET_GENERIC_NETLINK_H */


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Patch 9/9] Generic netlink interface for delay accounting

2006-03-13 Thread Shailabh Nagar
delayacct-genetlink.patch

Create a generic netlink interface (NETLINK_GENERIC family), 
called "taskstats", for getting delay and cpu statistics of 
tasks and thread groups during their lifetime and when they exit. 

Comments addressed (all in response to Jamal)

- Eliminated TASKSTATS_CMD_LISTEN and TASKSTATS_CMD_IGNORE
- Use multicast only for events generated due to task exit, not for 
replies to commands
- Align taskstats and taskstats_reply structures to 64 bit aligned.
- Use dynamic ID generation for genetlink family

More changes expected. Following comments will go into a 
Documentation file:

When a task is alive, userspace can get its stats by sending a 
command containing its pid. Sending a tgid returns the sum of stats 
of the tasks belonging to that tgid (where such a sum makes sense). 
Together, the command interface allows stats for a large number of 
tasks to be collected more efficiently than would be possible 
through /proc or any per-pid interface. 

The netlink interface also sends the stats for each task to userspace 
when the task is exiting. This permits fine-grain accounting for 
short-lived tasks, which is important if userspace is doing its own 
aggregation of statistics based on some grouping of tasks 
(e.g. CSA jobs, ELSA banks or CKRM classes).

If the exiting task belongs to a thread group (with more members than itself)
, the latters delay stats are also sent out on the task's exit. This allows
userspace to get accurate data at a per-tgid level while the tid's of a tgid
are exiting one by one.

The interface has been deliberately kept distinct from the delay 
accounting code since it is potentially usable by other kernel components
that need to export per-pid/tgid data. The format of data returned to 
userspace is versioned and the command interface easily extensible to 
facilitate reuse.

If reuse is not deemed useful enough, the naming, placement of functions
and config options will be modified to make this an interface for delay 
accounting alone.

Signed-off-by: Shailabh Nagar <[EMAIL PROTECTED]>
Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>

 include/linux/taskstats.h |  128 
 init/Kconfig  |   16 ++-
 kernel/taskstats.c|  244 ++
 3 files changed, 385 insertions(+), 3 deletions(-)

Index: linux-2.6.16-rc5/include/linux/taskstats.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.16-rc5/include/linux/taskstats.h  2006-03-13 19:01:35.0 
-0500
@@ -0,0 +1,128 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *   (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ * - a task exits
+ * - listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION  1
+
+struct taskstats {
+   /* Maintain 64-bit alignment while extending */
+
+   /* Version 1 */
+#define TASKSTATS_NOPID-1
+   __s64   pid;
+   __s64   tgid;
+
+   /* XXX_count is number of delay values recorded.
+* XXX_total is corresponding cumulative delay in nanoseconds
+*/
+
+#define TASKSTATS_NOCPUSTATS   1
+   __u64   cpu_count;
+   __u64   cpu_delay_total;/* wait, while runnable, for cpu */
+   __u64   blkio_count;
+   __u64   blkio_delay_total;  /* sync,block io completion wait*/
+   __u64   swapin_count;
+   __u64   swapin_delay_total; /* swapin page fault wait*/
+
+   __u64   cpu_run_total;  /* cpu running time
+* no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP 0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+   TASKSTATS_CMD_UNSPEC,   /* Reserved */
+   TASKSTATS_CMD_NONE, /* Not a valid cmd to send
+* Marks data sent on task/tgid exit */
+   TASKSTATS_CMD_LISTEN,   /* Start list

Re: [Patch 9/9] Generic netlink interface for delay accounting

2006-03-13 Thread Shailabh Nagar

jamal wrote:


On Mon, 2006-13-03 at 18:33 -0800, Matt Helsley wrote:
 


On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:
   



 


Jamal, was your Mon, 13 Mar 2006 21:29:09 -0500 reply:
   


Note, you are still not following the standard scheme of doing things.
Example: using command = GET and the message carrying the TGID to note
which TGID is of interest. Instead you have command = TGID.

 



 


meant to suggest that TASKSTATS_CMD_(P|TG)ID should be renamed to
TASKSTATS_CMD_GET_(P|TG)ID ? Is that sufficient? Or am I
misunderstanding?

   



I had a long description in an earlier email feedback; but the summary
of it is the GET command is generic like TASKSTATS_CMD_GET; the message
itself carries TLVs of what needs to be gotten which are 
either PID and/or TGID etc. Anyways, theres a long spill of what i am

saying in that earlier email. Perhaps the current patch is a transition
towards that?
 



Yes, the comments you'd made in the previous mail have not been
incorporated and this is still the older version of the patch.
We'd been avoiding TLV usage so far :-)


cheers,
jamal 

 



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch 9/9] Generic netlink interface for delay accounting

2006-03-13 Thread Shailabh Nagar

Matt Helsley wrote:


On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:


 


Comments addressed (all in response to Jamal)

- Eliminated TASKSTATS_CMD_LISTEN and TASKSTATS_CMD_IGNORE
   



The enums for these are still in the patch. See below.



 


+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+   TASKSTATS_CMD_UNSPEC,   /* Reserved */
+   TASKSTATS_CMD_NONE, /* Not a valid cmd to send
+* Marks data sent on task/tgid exit */
+   TASKSTATS_CMD_LISTEN,   /* Start listening */
+   TASKSTATS_CMD_IGNORE,   /* Stop listening */
   




From the description I thought you had eliminated these.
 



Yup, typo. the commands aren't registered but the enums hang on.
Will fix.

--Shailabh

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Patch 4/8] generic netlink utility functions

2006-03-29 Thread Shailabh Nagar
genetlink-utils.patch

Two utilities for simplifying usage of NETLINK_GENERIC
interface.

Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>
Signed-off-by: Shailabh Nagar <[EMAIL PROTECTED]>

 include/net/genetlink.h |   20 
 1 files changed, 20 insertions(+)

Index: linux-2.6.16/include/net/genetlink.h
===
--- linux-2.6.16.orig/include/net/genetlink.h   2006-03-29 18:12:54.0 
-0500
+++ linux-2.6.16/include/net/genetlink.h2006-03-29 18:13:17.0 
-0500
@@ -150,4 +150,24 @@ static inline int genlmsg_unicast(struct
return nlmsg_unicast(genl_sock, skb, pid);
 }

+/**
+ * gennlmsg_data - head of message payload
+ * @gnlh: genetlink messsage header
+ */
+static inline void *genlmsg_data(const struct genlmsghdr *gnlh)
+{
+   return ((unsigned char *) gnlh + GENL_HDRLEN);
+}
+
+/**
+ * genlmsg_len - length of message payload
+ * @gnlh: genetlink message header
+ */
+static inline int genlmsg_len(const struct genlmsghdr *gnlh)
+{
+   struct nlmsghdr *nlh = (struct nlmsghdr *)((unsigned char *)gnlh -
+   NLMSG_HDRLEN);
+   return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN);
+}
+
 #endif /* __NET_GENERIC_NETLINK_H */

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Patch 5/8] generic netlink interface for delay accounting

2006-03-29 Thread Shailabh Nagar
delayacct-genetlink.patch

Create a generic netlink interface (NETLINK_GENERIC family),
called "taskstats", for getting delay and cpu statistics of
tasks and thread groups during their lifetime and when they exit.


Signed-off-by: Shailabh Nagar <[EMAIL PROTECTED]>
Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>

 include/linux/delayacct.h |   11 +
 include/linux/taskstats.h |  113 +
 init/Kconfig  |   16 ++
 kernel/Makefile   |1
 kernel/delayacct.c|   44 ++
 kernel/taskstats.c|  292 ++
 6 files changed, 474 insertions(+), 3 deletions(-)

Index: linux-2.6.16/include/linux/delayacct.h
===
--- linux-2.6.16.orig/include/linux/delayacct.h 2006-03-29 18:13:13.0 
-0500
+++ linux-2.6.16/include/linux/delayacct.h  2006-03-29 18:13:18.0 
-0500
@@ -18,6 +18,7 @@
 #define _LINUX_TASKDELAYS_H

 #include 
+#include 

 #ifdef CONFIG_TASK_DELAY_ACCT
 extern int delayacct_on;   /* Delay accounting turned on/off */
@@ -27,6 +28,7 @@ extern void __delayacct_tsk_init(struct
 extern void __delayacct_tsk_exit(struct task_struct *);
 extern void __delayacct_blkio_start(void);
 extern void __delayacct_blkio_end(void);
+extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);

 static inline void delayacct_tsk_init(struct task_struct *tsk)
 {
@@ -64,4 +66,13 @@ static inline void delayacct_blkio_start
 static inline void delayacct_blkio_end(void)
 {}
 #endif /* CONFIG_TASK_DELAY_ACCT */
+#ifdef CONFIG_TASKSTATS
+static inline int delayacct_add_tsk(struct taskstats *d,
+   struct task_struct *tsk)
+{
+   if (!tsk->delays)
+   return -EINVAL;
+   return __delayacct_add_tsk(d, tsk);
+}
+#endif
 #endif /* _LINUX_TASKDELAYS_H */
Index: linux-2.6.16/include/linux/taskstats.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.16/include/linux/taskstats.h  2006-03-29 18:13:18.0 
-0500
@@ -0,0 +1,113 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *   (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ * - a task exits
+ * - listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION  1
+#define TASKSTATS_NOPID-1
+
+struct taskstats {
+   /* Maintain 64-bit alignment while extending */
+   /* Version 1 */
+
+   /* XXX_count is number of delay values recorded.
+* XXX_total is corresponding cumulative delay in nanoseconds
+*/
+
+#define TASKSTATS_NOCPUSTATS   1
+   __u64   cpu_count;
+   __u64   cpu_delay_total;/* wait, while runnable, for cpu */
+   __u64   blkio_count;
+   __u64   blkio_delay_total;  /* sync,block io completion wait*/
+   __u64   swapin_count;
+   __u64   swapin_delay_total; /* swapin page fault wait*/
+
+   __u64   cpu_run_total;  /* cpu running time
+* no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP 0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+   TASKSTATS_CMD_UNSPEC = 0,   /* Reserved */
+   TASKSTATS_CMD_GET,  /* user->kernel request/get-response */
+   TASKSTATS_CMD_NEW,  /* kernel->user event */
+   __TASKSTATS_CMD_MAX,
+};
+
+#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1)
+
+enum {
+   TASKSTATS_TYPE_UNSPEC = 0,  /* Reserved */
+   TASKSTATS_TYPE_PID, /* Process id */
+   TASKSTATS_TYPE_TGID,/* Thread group id */
+   TASKSTATS_TYPE_STATS,   /* taskstats structure */
+   TASKSTATS_TYPE_AGGR_PID,/* contains pid + stats */
+   TASKSTATS_TYPE_AGGR_TGID,   /* contains tgid + stats */
+   __TASKSTATS_TYPE_MAX,
+};
+
+#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1)
+
+enum {
+

Re: [Patch 5/8] generic netlink interface for delay accounting

2006-03-30 Thread Shailabh Nagar

Balbir Singh wrote:


Hi, Andrew


On Wed, Mar 29, 2006 at 09:04:06PM -0800, Andrew Morton wrote:
 


Shailabh Nagar <[EMAIL PROTECTED]> wrote:
   


delayacct-genetlink.patch

Create a generic netlink interface (NETLINK_GENERIC family),
called "taskstats", for getting delay and cpu statistics of
tasks and thread groups during their lifetime and when they exit.



 





+
+static int __init taskstats_init(void)
+{
+   if (genl_register_family(&family))
+   return -EFAULT;
 


EFAULT?
   



It shouldn't be (Shailabh please comment). We will fix it.
 


Sorry, it should return the  value returned by genl_register_family.

I copied the code from net/tipc/netlink.c
where a similarly erroneous errno is being used. We'll submit a fix
for that as well.

--Shailabh
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html