Re: [PATCH] bpf: Fix backport of "bpf: restrict unknown scalars of mixed signed bounds for unprivileged"

2021-04-19 Thread Balbir Singh
On Mon, Apr 19, 2021 at 04:56:41PM -0700, Samuel Mendoza-Jonas wrote:
> The 4.14 backport of 9d7eceede ("bpf: restrict unknown scalars of mixed
> signed bounds for unprivileged") adds the PTR_TO_MAP_VALUE check to the
> wrong location in adjust_ptr_min_max_vals(), most likely because 4.14
> doesn't include the commit that updates the if-statement to a
> switch-statement (aad2eeaf4 "bpf: Simplify ptr_min_max_vals adjustment").
> 
> Move the check to the proper location in adjust_ptr_min_max_vals().
> 
> Fixes: 17efa65350c5a ("bpf: restrict unknown scalars of mixed signed bounds 
> for unprivileged")
> Signed-off-by: Samuel Mendoza-Jonas 
> Reviewed-by: Frank van der Linden 
> Reviewed-by: Ethan Chen 
> ---

Thanks for catching it :)

Reviewed-by: Balbir Singh 


Re: [PATCH v3] tcp: verify the checksum of the first data segment in a new connection

2018-06-12 Thread Balbir Singh
On Wed, Jun 13, 2018 at 9:09 AM, Frank van der Linden
 wrote:
> commit 079096f103fa ("tcp/dccp: install syn_recv requests into ehash
> table") introduced an optimization for the handling of child sockets
> created for a new TCP connection.
>
> But this optimization passes any data associated with the last ACK of the
> connection handshake up the stack without verifying its checksum, because it
> calls tcp_child_process(), which in turn calls tcp_rcv_state_process()
> directly.  These lower-level processing functions do not do any checksum
> verification.
>
> Insert a tcp_checksum_complete call in the TCP_NEW_SYN_RECEIVE path to
> fix this.
>
> Signed-off-by: Frank van der Linden 
> ---
>  net/ipv4/tcp_ipv4.c | 4 
>  net/ipv6/tcp_ipv6.c | 4 
>  2 files changed, 8 insertions(+)
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index f70586b..ef8cd0f 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1689,6 +1689,10 @@ int tcp_v4_rcv(struct sk_buff *skb)
> reqsk_put(req);
> goto discard_it;
> }
> +   if (tcp_checksum_complete(skb)) {
> +   reqsk_put(req);
> +   goto csum_error;
> +   }
> if (unlikely(sk->sk_state != TCP_LISTEN)) {
> inet_csk_reqsk_queue_drop_and_put(sk, req);
> goto lookup;
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 6d664d8..5d4eb9d 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -1475,6 +1475,10 @@ static int tcp_v6_rcv(struct sk_buff *skb)
> reqsk_put(req);
> goto discard_it;
> }
> +   if (tcp_checksum_complete(skb)) {
> +   reqsk_put(req);
> +   goto csum_error;
> +   }
> if (unlikely(sk->sk_state != TCP_LISTEN)) {
>     inet_csk_reqsk_queue_drop_and_put(sk, req);
> goto lookup;


I've tested the IPv4 variant with some changes

Tested-by: Balbir Singh 
Reviewed-by: Balbir Singh 

Balbir


Re: [PATCH v2] tcp: verify the checksum of the first data segment in a new connection

2018-06-12 Thread Balbir Singh
On Wed, Jun 13, 2018 at 7:50 AM, Eric Dumazet  wrote:
>
>
> On 06/12/2018 02:41 PM, Frank van der Linden wrote:
>> commit 079096f103fa ("tcp/dccp: install syn_recv requests into ehash
>> table") introduced an optimization for the handling of child sockets
>> created for a new TCP connection.
>>
>> But this optimization passes any data associated with the last ACK of the
>> connection handshake up the stack without verifying its checksum, because it
>> calls tcp_child_process(), which in turn calls tcp_rcv_state_process()
>> directly.  These lower-level processing functions do not do any checksum
>> verification.
>>
>> Insert a tcp_checksum_complete call in the TCP_NEW_SYN_RECEIVE path to
>> fix this.
>>
>> Signed-off-by: Frank van der Linden 
>
>
> This is way too complicated.
>
> You should call tcp_checksum_complete() earlier and avoid all this mess.
>
>
> IPV4 part shown here :
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 
> fed3f1c6616708997f621535efe9412e4afa0a50..7b5f32aa3835b0124b0a9bd342c371df7b46f471
>  100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1730,6 +1730,10 @@ int tcp_v4_rcv(struct sk_buff *skb)
> reqsk_put(req);
> goto discard_it;
> }
> +   if (unlikely(tcp_checksum_complete(skb))) {
> +   reqsk_put(req);
> +   goto csum_error;
> +   }

I like this variant, it is not under the sock_lock, but it skips
tcp_filter() as Frank pointed out.

I tested a variant of this patch with an increment of MIB/CSUM errors
and jump to discard_and_relse and it seemed to pass my testing


Balbir Singh.


Re: [PATCH net-next 9/9] taskstats: use the libnl API to align nlattr on 64-bit

2016-04-27 Thread Balbir Singh


On 27/04/16 17:29, Nicolas Dichtel wrote:
> Le 27/04/2016 03:14, Balbir Singh a écrit :
>>
>>
>> On 23/04/16 01:31, Nicolas Dichtel wrote:
>>> Goal of this patch is to use the new libnl API to align netlink attribute
>>> when needed.
>>> The layout of the netlink message will be a bit different after the patch,
>>> because the padattr (TASKSTATS_TYPE_STATS) will be inside the nested
>>> attribute instead of before it.
>>>
>>> Signed-off-by: Nicolas Dichtel 
>>
>> The layout will change/break user space -- I've not tested the patch though..
> Sigh.
> 
> I quote David:
> "All userspace components using netlink should always ignore attributes
> they do not recognize in dumps.
> 
> This is one of the most basic principles of netlink"
> 
> Do you have some pointers so I can made some tests?
> 

Please try

https://www.kernel.org/doc/Documentation/accounting/getdelays.c

iotop uses it as well. My concern is ABI breakage of user space.

Balbir Singh


Re: [PATCH net-next 9/9] taskstats: use the libnl API to align nlattr on 64-bit

2016-04-26 Thread Balbir Singh


On 23/04/16 01:31, Nicolas Dichtel wrote:
> Goal of this patch is to use the new libnl API to align netlink attribute
> when needed.
> The layout of the netlink message will be a bit different after the patch,
> because the padattr (TASKSTATS_TYPE_STATS) will be inside the nested
> attribute instead of before it.
> 
> Signed-off-by: Nicolas Dichtel 

The layout will change/break user space -- I've not tested the patch though..

Balbir Singh.
 


[RFC][PATCH -mm] Fix getdelays.c - cpumask length and error reporting.

2006-09-11 Thread Balbir Singh


Fix the length passed while (un)registering cpumask. We were passing
sizeof the array, make it strlen().

Error value printed in fatal errors should be derived from the message.
The message contains an nlmsgerr embedded with an error value. We must
report that value to the user.

Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>
---

 Documentation/accounting/getdelays.c |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff -puN Documentation/accounting/getdelays.c~taskstats-fix-delayacct-utility 
Documentation/accounting/getdelays.c
--- 
linux-2.6.18-rc6/Documentation/accounting/getdelays.c~taskstats-fix-delayacct-utility
   2006-09-11 12:43:50.0 +0530
+++ linux-2.6.18-rc6-balbir/Documentation/accounting/getdelays.c
2006-09-11 12:46:57.0 +0530
@@ -285,7 +285,7 @@ int main(int argc, char *argv[])
if (maskset) {
rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET,
  TASKSTATS_CMD_ATTR_REGISTER_CPUMASK,
- &cpumask, sizeof(cpumask));
+ &cpumask, strlen(cpumask) + 1);
PRINTF("Sent register cpumask, retval %d\n", rc);
if (rc < 0) {
printf("error sending register cpumask\n");
@@ -315,7 +315,8 @@ int main(int argc, char *argv[])
}
if (msg.n.nlmsg_type == NLMSG_ERROR ||
!NLMSG_OK((&msg.n), rep_len)) {
-   printf("fatal reply error,  errno %d\n", errno);
+   struct nlmsgerr *err = NLMSG_DATA(&msg);
+   printf("fatal reply error,  errno %d\n", err->error);
goto done;
}
 
@@ -383,7 +384,7 @@ done:
if (maskset) {
rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET,
  TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK,
- &cpumask, sizeof(cpumask));
+ &cpumask, strlen(cpumask) + 1);
printf("Sent deregister mask, retval %d\n", rc);
if (rc < 0)
    err(rc, "error sending deregister cpumask\n");
_

-- 

Balbir Singh,
Linux Technology Center,
IBM Software Labs
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH -mm] Fix taskstats size calculation (use the new genetlink utility functions)

2006-09-11 Thread Balbir Singh


The addition of the CSA patch pushed the size of struct taskstats to 256
bytes. This exposed a problem with prepare_reply(), we were not allocating
space for the netlink and genetlink header. It worked earlier because
alloc_skb() would align the skb to SMP_CACHE_BYTES, which added some additonal
bytes.

Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>
---

 kernel/taskstats.c |2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

diff -puN kernel/taskstats.c~taskstats-fix-msg-size kernel/taskstats.c
--- linux-2.6.18-rc6/kernel/taskstats.c~taskstats-fix-msg-size  2006-09-11 
11:42:40.0 +0530
+++ linux-2.6.18-rc6-balbir/kernel/taskstats.c  2006-09-11 11:42:55.0 
+0530
@@ -77,7 +77,7 @@ static int prepare_reply(struct genl_inf
/*
 * If new attributes are added, please revisit this allocation
 */
-   skb = nlmsg_new(size, GFP_KERNEL);
+   skb = nlmsg_new(genlmsg_total_size(size), GFP_KERNEL);
if (!skb)
return -ENOMEM;
 
_

-- 

    Balbir Singh,
Linux Technology Center,
IBM Software Labs
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH -mm] Add genetlink utilities for payload length calculation

2006-09-11 Thread Balbir Singh


Add two utility helper functions genlmsg_msg_size() and genlmsg_total_size().
These functions are derived from their netlink counterparts.

Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>
---

 include/net/genetlink.h |   18 ++
 1 files changed, 18 insertions(+)

diff -puN include/net/genetlink.h~genetlink-payload-size-helpers 
include/net/genetlink.h
--- linux-2.6.18-rc6/include/net/genetlink.h~genetlink-payload-size-helpers 
2006-09-11 10:34:56.0 +0530
+++ linux-2.6.18-rc6-balbir/include/net/genetlink.h 2006-09-11 
11:42:37.0 +0530
@@ -171,4 +171,22 @@ static inline int genlmsg_len(const stru
return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN);
 }
 
+/**
+ * genlmsg_msg_size - length of genetlink message not including padding
+ * @payload: length of message payload
+ */
+static inline int genlmsg_msg_size(int payload)
+{
+   return GENL_HDRLEN + payload;
+}
+
+/**
+ * genlmsg_total_size - length of genetlink message including padding
+ * @payload: length of message payload
+ */
+static inline int genlmsg_total_size(int payload)
+{
+   return NLMSG_ALIGN(genlmsg_msg_size(payload));
+}
+
 #endif /* __NET_GENERIC_NETLINK_H */
_

-- 

    Balbir Singh,
Linux Technology Center,
IBM Software Labs
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: netlink vs. debugfs (was Re: [Patch 0/6] statistics infrastructure)

2006-05-22 Thread Balbir Singh
On Mon, May 22, 2006 at 11:09:22AM -0700, Tim Bird wrote:
> Andrew Morton wrote:
> > Martin Peschke <[EMAIL PROTECTED]> wrote:
> >> My patch series is a proposal for a generic implementation of statistics.
> > 
> > This uses debugfs for the user interface, but the
> > per-task-delay-accounting-*.patch series from Balbir creates an extensible
> > netlink-based system for passing instrumentation results back to userspace.
> > 
> > Can this code be converted to use those netlink interfaces, or is Balbir's
> > approach unsuitable, or hasn't it even been considered, or what?
> 
> Can someone give me the 20-second elevator pitch on why
> netlink is preferred over debugfs?  I've heard of a
> number of debugfs/procfs users requested to switch over.
> 
> Thanks,
>  -- Tim
> 
> =
> Tim Bird
> Architecture Group Chair, CE Linux Forum
> Senior Staff Engineer, Sony Electronics
> =

Hi, Tim,

I am no debugfs expert, I hope I can do justice to the comparison.

Debugfs Netlink/Genetlink

1. Filesystem based - requires creating Several types of data can
   files for each type of data passed   be multiplexed over one netlink
   down socket.
2. Hard to determine record format/data Contains metadata including
type of data and length
with each record
3. Notifications are hard   Notifications are very easy
   I think they can be done using inotify   good library support for
notifications. Data can
either be broadcast or
selectively mulitcast
4. Requires several open/read/write/close   A single socket can be
   operations   opened, data from kernel
space can be multiplexed
over it.

I don't think I did any justice to the advantages of debugfs. The only
one I can think of is that it uses relayfs. Relayfs is efficient in the
sense that it uses per-cpu buffers.

Anybody else want to take a shot in comparing the two?

Balbir Singh,
Linux Technology Center,
IBM Software Labs
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Patch 4/8] Utilities for genetlink usage

2006-05-01 Thread Balbir Singh

genetlink-utils.patch

Two utilities for simplifying usage of NETLINK_GENERIC
interface.

Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>
Signed-off-by: Shailabh Nagar <[EMAIL PROTECTED]>
---

 include/net/genetlink.h |   20 
 1 files changed, 20 insertions(+)

diff -puN include/net/genetlink.h~genetlink-utils include/net/genetlink.h
--- linux-2.6.17-rc3/include/net/genetlink.h~genetlink-utils2006-05-02 
07:35:15.0 +0530
+++ linux-2.6.17-rc3-balbir/include/net/genetlink.h 2006-05-02 
07:35:52.0 +0530
@@ -150,4 +150,24 @@ static inline int genlmsg_unicast(struct
return nlmsg_unicast(genl_sock, skb, pid);
 }
 
+/**
+ * gennlmsg_data - head of message payload
+ * @gnlh: genetlink messsage header
+ */
+static inline void *genlmsg_data(const struct genlmsghdr *gnlh)
+{
+   return ((unsigned char *) gnlh + GENL_HDRLEN);
+}
+
+/**
+ * genlmsg_len - length of message payload
+ * @gnlh: genetlink message header
+ */
+static inline int genlmsg_len(const struct genlmsghdr *gnlh)
+{
+   struct nlmsghdr *nlh = (struct nlmsghdr *)((unsigned char *)gnlh -
+   NLMSG_HDRLEN);
+   return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN);
+}
+
 #endif /* __NET_GENERIC_NETLINK_H */
_
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dcache leak in 2.6.16-git8 II

2006-03-29 Thread Balbir Singh
On Thu, Mar 30, 2006 at 12:53:24AM +0200, Andi Kleen wrote:
> On Thursday 30 March 2006 00:50, Andrew Morton wrote:
> 
> > It looks that way.  Didn't someone else report a sock_inode_cache leak?
> 
> Didn't see it.
>  
> > > I still got a copy of the /proc in case anybody wants more information.
> > 
> > We have this fancy new /proc/slab_allocators now, it might show something
> > interesting.  It needs CONFIG_DEBUG_SLAB_LEAK.
> 
> I didn't have that enabled unfortunately. I can try it on the next round.
> 
> -Andi
>

There is also a new sysctl to drop caches. It is called vm.drop_caches.
It will be interesting to see if it is able to free up some dcache memory
for you.

Balbir
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch 5/8] generic netlink interface for delay accounting

2006-03-29 Thread Balbir Singh
On Wed, Mar 29, 2006 at 10:26:29PM -0800, Andrew Morton wrote:
> Balbir Singh <[EMAIL PROTECTED]> wrote:
> >
> > > The kmem_cache_free() can happen outside the lock.
> > 
> > 
> > kmem_cache_free() and setting to NULL outside the lock is prone to
> > race conditions. Consider the following scenario
> > 
> > A thread group T1 has exiting processes P1 and P2
> > 
> > P1 is exiting, finishes the delay accounting by calling taskstats_exit_pid()
> > and gives up the mutex and calls kmem_cache_free(), but before it can set
> > tsk->delays to NULL, we try to get statistics for the entire thread group.
> > This task will show up in the thread group with a dangling tsk->delays.
> 
> Yes, the `tsk->delays = NULL;' needs to happen inside the lock.  But the
> kmem_cache_free() does not.  It pointlessly increases the lock hold time.

Understood will fix it

> 
> > > > +   if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> > > > +   u32 pid = 
> > > > nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> > > > +   rc = fill_pid((pid_t)pid, NULL, &stats);
> > > 
> > > We shouldn't have a typecast here.  If it generates a warning then we need
> > > to get in there and find out why.
> > 
> > The reason for a typecast is that pid is passed as a u32 from userspace.
> > genetlink currently supports most unsigned types with little or no
> > support for signed types. We exchange data as u32 and do the correct
> > thing in the kernel. Would you like us to move away from this?
> > 
> 
> I think it's best to avoid the cast unless it's actually needed to avoid a
> warning or compile error, or to do special things with sign extension. 
> Because casts clutter up the code and can hide real bugs.  In this case the
> compiler should silently perform the conversion.

Yep, the compiler was doing it for me, but I tried to be smart and cast
things around. Will fix it.

Thanks,
Balbir
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch 5/8] generic netlink interface for delay accounting

2006-03-29 Thread Balbir Singh
Hi, Andrew


On Wed, Mar 29, 2006 at 09:04:06PM -0800, Andrew Morton wrote:
> Shailabh Nagar <[EMAIL PROTECTED]> wrote:
> >
> > delayacct-genetlink.patch
> > 
> > Create a generic netlink interface (NETLINK_GENERIC family),
> > called "taskstats", for getting delay and cpu statistics of
> > tasks and thread groups during their lifetime and when they exit.
> > 
> > 
> 
> It's be best to have a netlink person review the netlinkisms here.

Jamal did review this code, his comments are available at
http://lkml.org/lkml/2006/3/26/71. His comments were very helpful in
doing the correct thing and understanding the design and usage of
genetlink.

> 
> > +static inline int delayacct_add_tsk(struct taskstats *d,
> > +   struct task_struct *tsk)
> > +{
> > +   if (!tsk->delays)
> > +   return -EINVAL;
> > +   return __delayacct_add_tsk(d, tsk);
> > +}
> 
> hm.  It's a worry that this can return an error if delay accounting simply
> isn't enabled.

Yes, if CONFIG_TASKSTATS is enabled and the kernel is booted without
delayacct. A user space utility trying to extract statistics will get
an error.

> 
> > +struct taskstats {
> > +   /* Maintain 64-bit alignment while extending */
> > +   /* Version 1 */
> > +
> > +   /* XXX_count is number of delay values recorded.
> > +* XXX_total is corresponding cumulative delay in nanoseconds
> > +*/
> > +
> > +#define TASKSTATS_NOCPUSTATS   1
> > +   __u64   cpu_count;
> > +   __u64   cpu_delay_total;/* wait, while runnable, for cpu */
> > +   __u64   blkio_count;
> > +   __u64   blkio_delay_total;  /* sync,block io completion wait*/
> > +   __u64   swapin_count;
> > +   __u64   swapin_delay_total; /* swapin page fault wait*/
> > +
> > +   __u64   cpu_run_total;  /* cpu running time
> > +* no count available/provided */
> > +};
> 
> What locking is used to make updates to these u64's appear to be atomic? 
> Maybe it's deliberately nonatomic.  Either way, it needs a comment.

These fields are protected by tsk->delays->lock. We will add comments
to indicate the same.

> 
> >  void __delayacct_tsk_exit(struct task_struct *tsk)
> >  {
> > +   /*
> > +* Protect against racing thread group exits
> > +*/
> > +   mutex_lock(&delayacct_exit_mutex);
> > +   taskstats_exit_pid(tsk);
> > if (tsk->delays) {
> > kmem_cache_free(delayacct_cache, tsk->delays);
> > tsk->delays = NULL;
> > }
> > +   mutex_unlock(&delayacct_exit_mutex);
> >  }
> 
> hm, I wonder how contended that lock is likely to be.
>

It is taken for every exiting task. We did not measure the contention
using any tool.
 
> The kmem_cache_free() can happen outside the lock.


kmem_cache_free() and setting to NULL outside the lock is prone to
race conditions. Consider the following scenario

A thread group T1 has exiting processes P1 and P2

P1 is exiting, finishes the delay accounting by calling taskstats_exit_pid()
and gives up the mutex and calls kmem_cache_free(), but before it can set
tsk->delays to NULL, we try to get statistics for the entire thread group.
This task will show up in the thread group with a dangling tsk->delays.

> 
> > +#ifdef CONFIG_TASKSTATS
> > +int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
> > +{
> > +   nsec_t tmp;
> > +   struct timespec ts;
> > +   unsigned long t1,t2;
> > +
> > +   /* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
> > +
> > +   tmp = (nsec_t)d->cpu_run_total ;
> 
> stray space before semicolon.

Will fix it.

> 
> > +   tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC;
> > +   d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0: tmp;
> 
> Missing space before ?, missing space before :

Will fix it.

> 
> > +   d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0: tmp;
> > +   d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0: tmp;
> 
> dittos.

Will fix it.

> 
> > ===
> > --- /dev/null   1970-01-01 00:00:00.0 +
> > +++ linux-2.6.16/kernel/taskstats.c 2006-03-29 18:13:18.0 -0500
> > @@ -0,0 +1,292 @@
> > +/*
> > + * taskstats.c - Export per-task statistics to userland
> > + *
> > + * Copyright (C) Shailabh Nagar, IBM Corp. 2006
> > + *   (C) Balbir Singh,   IBM Corp. 2006
> > + *
> > + * 

Re: [Patch 8/9] generic netlink utility functions

2006-03-26 Thread Balbir Singh
On Mon, Mar 13, 2006 at 07:55:05PM -0500, Shailabh Nagar wrote:
> genetlink-utils.patch
> 
> Two utilities for simplifying usage of NETLINK_GENERIC
> interface.
> 
> Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>
> Signed-off-by: Shailabh Nagar <[EMAIL PROTECTED]>
> 
>  include/net/genetlink.h |   20 
>  1 files changed, 20 insertions(+)
> 
> Index: linux-2.6.16-rc5/include/net/genetlink.h
> ===
> --- linux-2.6.16-rc5.orig/include/net/genetlink.h 2006-03-11 
> 07:41:32.0 -0500
> +++ linux-2.6.16-rc5/include/net/genetlink.h  2006-03-11 07:41:41.0 
> -0500
> @@ -150,4 +150,24 @@ static inline int genlmsg_unicast(struct
>   return nlmsg_unicast(genl_sock, skb, pid);
>  }
>  
> +/**
> + * gennlmsg_data - head of message payload
> + * @gnlh: genetlink messsage header
> + */
> +static inline void *genlmsg_data(const struct genlmsghdr *gnlh)
> +{
> +   return ((unsigned char *) gnlh + GENL_HDRLEN);
> +}
> +
> +/**
> + * genlmsg_len - length of message payload
> + * @gnlh: genetlink message header
> + */
> +static inline int genlmsg_len(const struct genlmsghdr *gnlh)
> +{
> +   struct nlmsghdr *nlh = (struct nlmsghdr *)((unsigned char *)gnlh -
> +   NLMSG_HDRLEN);
> +   return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN);
> +}
> +
>  #endif   /* __NET_GENERIC_NETLINK_H */
> 
>

Jamal,

Does the implementation of these utilities look ok? We use genlmsg_data()
in the delay accounting code but not genlmsg_len(), hence it might not
be very well tested (just reviewed).

Thanks,
Balbir 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

2006-03-26 Thread Balbir Singh
On Sun, Mar 26, 2006 at 09:05:18AM -0500, jamal wrote:
> On Sat, 2006-25-03 at 23:52 +0530, Balbir Singh wrote:
> 
> 
> > No, we cannot have both passed. If we pass both a PID and a TGID and
> > then the code returns just the stats for the PID.
> > 
> 
> ok, that clears it then; i think you are ready to go.

Cool! Thanks for all your help and review.

> 
> > >
> > > Also in regards to the nesting, isnt there a need for nla_nest_cancel in
> > > case of failures to add TLVs?
> > >
> > 
> > I thought about it, but when I looked at the code of genlmsg_cancel()
> > and nla_nest_cancel().  It seemed that genlmsg_cancel() should
> > suffice.
> > 
> 
> If your policy is to never send a message if anything fails, then you
> are fine.
> 
> What would be really useful now that you understand this, is if you can
> help extending/cleaning the document i sent you. Or send me a table of
> contents of how it would have flowed better for you.

I will start looking at the document and see what changes can be made.
I will try and update the document from a genetlink programmers perspective
i.e. things to know, avoid, etc.

> 
> cheers,
> jamal
> 
>

Thanks,
Balbir 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

2006-03-25 Thread Balbir Singh
On 3/25/06, jamal <[EMAIL PROTECTED]> wrote:
> On Sat, 2006-25-03 at 21:06 +0530, Balbir Singh wrote:
> > On Sat, Mar 25, 2006 at 07:52:13AM -0500, jamal wrote:
>
>
> I didnt pay attention to failure paths etc; i suppose your testing
> should catch those. Getting there, a couple more comments:
>

Yes, I have tried several negative test cases.

>
> > +enum {
> > + TASKSTATS_CMD_UNSPEC = 0,   /* Reserved */
> > + TASKSTATS_CMD_GET,  /* user->kernel request */
> > + TASKSTATS_CMD_NEW,  /* kernel->user event */
>
> Should the comment read "kernel->user event/get-response"
>

Yes, good catch. I will update the comment.

>
> > +
> > +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info 
> > *info)
> > +{
>
>
> > +
> > + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> > + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> > + rc = fill_pid((pid_t)pid, NULL, &stats);
> > + if (rc < 0)
> > + goto err;
> > +
> > + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
> > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
> > + } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
>
> in regards to the elseif above:
> Could you not have both PID and TGID passed? From my earlier
> understanding it seemed legit, no? if answer is yes, then you will have
> to do your sizes + reply TLVs at the end.

No, we cannot have both passed. If we pass both a PID and a TGID and
then the code returns just the stats for the PID.

>
> Also in regards to the nesting, isnt there a need for nla_nest_cancel in
> case of failures to add TLVs?
>

I thought about it, but when I looked at the code of genlmsg_cancel()
and nla_nest_cancel().  It seemed that genlmsg_cancel() should
suffice.


static inline int genlmsg_cancel(struct sk_buff *skb, void *hdr)
{
return nlmsg_cancel(skb, hdr - GENL_HDRLEN - NLMSG_HDRLEN);
}

static inline int nlmsg_cancel(struct sk_buff *skb, struct nlmsghdr *nlh)
{
skb_trim(skb, (unsigned char *) nlh - skb->data);

return -1;
}

static inline int nla_nest_cancel(struct sk_buff *skb, struct nlattr *start)
{
if (start)
skb_trim(skb, (unsigned char *) start - skb->data);

return -1;
}



genlmsg_cancel() seemed more generic, since it handles skb_trim from
the nlmsghdr down to skb->data, where as nla_test_cancel() does it
only from the start of the nested attributes to skb->data.

Is my understanding correct?


> cheers,
> jamal
>

Thanks,
Balbir
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

2006-03-25 Thread Balbir Singh
On Sat, Mar 25, 2006 at 07:52:13AM -0500, jamal wrote:
> On Sat, 2006-25-03 at 15:11 +0530, Balbir Singh wrote:
> 
> > 
> > Thanks for the advice, I will dive into nesting. I could not find any 
> > in tree users who use nesting, so I have a few questions
> > 
> 
> Hrm - I have to say i am suprised theres nothing; i could have sworn
> Thomas had done some conversions already.

I used cscope to check for global references and callers of nla_nest_.*.
I could not find anything.

> 
> > nla_nest_start() accepts two parameters an skb and an attribute type.
> > Do I have to create a new attribute type like TASKSTATS_TYPE_AGGR to
> > contain the nested attributes 
> > 
> > TASKSTATS_TYPE_AGGR
> >TASKSTATS_TYPE_PID/TGID
> >TASKSTATS_TYPE_STATS
> > 
> >
> > but this will lead to
> > 
> > TASKSTATS_TYPE_AGGR
> >TASKSTATS_TYPE_PID
> >TASKSTATS_TYPE_STATS
> > TASKSTATS_TYPE_AGGR
> >TASKSTATS_TYPE_TGID
> >TASKSTATS_TYPE_STATS
> > 
> > being returned from taskstats_exit_pid().
> > 
> 
> no this is wrong by virtue of having TASKSTATS_TYPE_AGGR twice.
> Again invoke the rule i cited earlier.

Yes, thats why I wanted to point it out to you. Thanks for explaining the
rule.

> What you could do instead is a second AGGR; and your nesting would be:
> 
> TASKSTATS_TYPE_AGGR1 <--- nest start with this type
>TASKSTATS_TYPE_PID <-- NLA_U32_PUT
>TASKSTATS_TYPE_STATS <-- NAL_PUT_TYPE
>  <-- nest end of TASKSTATS_TYPE_AGGR1
> TASKSTATS_TYPE_AGGR2 <--- nest start with this type
>TASKSTATS_TYPE_TGID <-- NLA_U32_PUT
>TASKSTATS_TYPE_STATS <-- NAL_PUT_TYPE
><-- nest end of TASKSTATS_TYPE_AGGR2
> 
> > The other option is to nest
> > 
> > TASKSTATS_TYPE_PID/TGID
> >TASKSTATS_TYPE_STATS
> > 
> 
> The advantage being you dont introduce another T.
> 
> > but the problem with this approach is, nla_len contains the length of
> > all attributes including the nested attribute. So it is hard to find
> > the offset of TASKSTATS_TYPE_STATS in the buffer.
> > 
> 
> So you would distinguish the two as have something like:
> 
> TASKSTATS_TYPE_PID
>u32 pid
>TASKSTATS_TYPE_STATS
> TASKSTATS_TYPE_TGID
>u32 tgid
>TASKSTATS_TYPE_STATS
> or
> TASKSTATS_TYPE_PID
>u32 pid
> TASKSTATS_TYPE_TGID
>u32 tgid
> 
> both should be fine. The difference between the two is the length in the
> second case will be 4 and in the other case will be larger. 
> 
> But come to think of it, this will introduce unneeded semantics; you
> have very few items to do, so forget it. Go with scheme #1 but change
> the names to TASKSTATS_TYPE_AGGR_PID and TASKSTATS_TYPE_AGGR_TGID.
>

I prefer #1 as well. The overloaded use of the same type with different lengths
can be confusing.
 
> cheers,
> jamal
> 

Here is another attempt (one more iteration) at trying to get it right.
Thank you for your patience and help in getting it right.

Changelog
-

As discussed in our email.

Thanks,
Balbir

Signed-off-by: Shailabh Nagar <[EMAIL PROTECTED]>
Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>

---

 include/linux/delayacct.h |   11 +
 include/linux/taskstats.h |  113 +
 init/Kconfig  |   16 ++
 kernel/Makefile   |1 
 kernel/delayacct.c|   44 ++
 kernel/taskstats.c|  291 ++
 6 files changed, 473 insertions(+), 3 deletions(-)

diff -puN include/linux/delayacct.h~delayacct-genetlink 
include/linux/delayacct.h
--- linux-2.6.16/include/linux/delayacct.h~delayacct-genetlink  2006-03-22 
11:56:03.0 +0530
+++ linux-2.6.16-balbir/include/linux/delayacct.h   2006-03-22 
11:56:03.0 +0530
@@ -15,6 +15,7 @@
 #define _LINUX_TASKDELAYS_H
 
 #include 
+#include 
 
 #ifdef CONFIG_TASK_DELAY_ACCT
 extern int delayacct_on;   /* Delay accounting turned on/off */
@@ -25,6 +26,7 @@ extern void __delayacct_tsk_exit(struct 
 extern void __delayacct_blkio_start(void);
 extern void __delayacct_blkio_end(void);
 extern unsigned long long __delayacct_blkio_ticks(struct task_struct *);
+extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
 
 static inline void delayacct_tsk_init(struct task_struct *tsk)
 {
@@ -72,4 +74,13 @@ static inline unsigned long long delayac
return 0;
 }
 #endif /* CONFIG_TASK_DELAY_ACCT */
+#ifdef CONFIG_TASKSTATS
+static inline int delayacct_add_tsk(struct taskstats *d,
+   struct task_struct *tsk)
+{
+   if (!tsk->delays)
+   return -EINVAL;
+   return __delayacct_add_tsk(d, tsk);
+}
+#endif
 #endif /* 

Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

2006-03-25 Thread Balbir Singh
On Fri, Mar 24, 2006 at 08:19:25PM -0500, jamal wrote:
> On Fri, 2006-24-03 at 20:24 +0530, Balbir Singh wrote:
> 
> > Hmm... Would it be ok to send one message with the following format
> > 
> > 1. TLV=TASKSTATS_TYPE_PID
> > 2. TLV=TASKSTATS_TYPE_STATS
> > 3. TLV=TASKSTATS_TYPE_TGID
> > 4. TLV=TASKSTATS_TYPE_STATS
> > 
> > It would still be one message, except that 3 and 4 would be optional.
> > What do you think?
> > 
> 
> No, that wont work since #2 and #4 are basically the same TLV. [Recall
> that "T" is used to index an array]. Your other alternative is to have
> #4 perhaps called TASKSTATS_TGID_STATS and #2 TASKSTATS_PID_STATS
> although that would smell a little.
> Dont be afraid to do the nest, it will be a little painful initially but
> i am sure once you figure it out you will appreciate it.
>

Thanks for the advice, I will dive into nesting. I could not find any 
in tree users who use nesting, so I have a few questions

nla_nest_start() accepts two parameters an skb and an attribute type.
Do I have to create a new attribute type like TASKSTATS_TYPE_AGGR to
contain the nested attributes 

TASKSTATS_TYPE_AGGR
   TASKSTATS_TYPE_PID/TGID
   TASKSTATS_TYPE_STATS

but this will lead to


TASKSTATS_TYPE_AGGR
   TASKSTATS_TYPE_PID
   TASKSTATS_TYPE_STATS
TASKSTATS_TYPE_AGGR
   TASKSTATS_TYPE_TGID
   TASKSTATS_TYPE_STATS

being returned from taskstats_exit_pid().

The other option is to nest

TASKSTATS_TYPE_PID/TGID
   TASKSTATS_TYPE_STATS

but the problem with this approach is, nla_len contains the length of
all attributes including the nested attribute. So it is hard to find
the offset of TASKSTATS_TYPE_STATS in the buffer.

Do I understand NLA nesting at all? May be I am missing something obvious.

Thanks,
Balbir
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

2006-03-24 Thread Balbir Singh
On Fri, Mar 24, 2006 at 09:11:58AM -0500, jamal wrote:
> On Fri, 2006-24-03 at 07:02 +0530, Balbir Singh wrote:
> > On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:
> 
> > 3. nlmsg_new() now allocates for 2*u32 + sizeof(taskstats)
> 
> Not the right size; the u32 covers the V part of TLV. The T = 16 bits
> and L = 16 bits. And if you nest TLVs, then it gets more interesting.
> 
> Look at using proper macros instead of hard coding like you did.
> grep for something like RTA_SPACE and perhaps send a patch to make it
> generic for netlink.h
> 
> cheers,
> jamal
>

My bad,  but I was wondering why my test case did not segfault until
I saw the implementation of nlmsg_new :-)

I will fix this and use nla_total_size() to calculate the correct sizes
including padding and TLV.

Thanks again,
Balbir

 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

2006-03-24 Thread Balbir Singh
On Fri, Mar 24, 2006 at 09:04:21AM -0500, jamal wrote:
> On Thu, 2006-23-03 at 21:11 +0530, Balbir Singh wrote:
> > On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:
> 
> > > Should there be at least either a pid or tgid? If yes, you need to
> > > validate here...
> > >
> > 
> > Yes, you are correct. One of my test cases caught it too.. But I did
> > not want to untidy the code with if-else's which will keep growing if
> > the attributes change in the future. I just followed the controller
> > example. I will change it and validate it. Currently if the attribute
> > is not valid, a stat of all zero's is returned back.
> >  
> 
> There are many ways to skin this cat.
> As an example: you could make pid and tgid global to the function and
> set them to 0. At the end of the if statements, you could check if at
> least one of them is set - if not you know none was passed and bail out.

The latest patch does fix it this issue. In the Changelog
6. taskstats_send_stats() now validates the command attributes and ensures
  that it either gets a PID or a TGID. If it gets both simultaneously
  the PID stats are sent.

Is this change ok with you?

> 
> > > As a general comment double check your logic for errors; if you already
> > > have stashed something in the skb, you need to remove it etc.
> > > 
> > 
> > Wouldn't genlmsg_cancel() take care of cleaning all attributes?
> > 
> 
> it would for attribute setting - but not for others. As a general
> comment this is one of those areas where cutnpasting aka TheLinuxWay(tm)
> could result in errors.

:-) I understand.
What I have done is moved all the NLA_PUT_U32 to after verifying the
return values of functions fill_*(). That way we do not stash anything into the
skb if there are pending errors.

> 
> 
> > > A single message with PID+TGID sounds reasonable. Why two messages with
> > > two stats? all you will need to do is get rid of the prepare_reply()
> > > above and NLA_PUT_U32() below (just like you do in a response to a GET.
> > > 
> > 
> > The reason for two stats is that for TGID, we return accumulated values
> > (of all threads in the group) and for PID we return the value just
> > for that pid. The return value is
> > 
> 
> Ok, I understand the dilemma now - but still not thrilled with having
> two messages.
> Perhaps you could have nesting of TLVs? This is widely used in the net
> code for example
> i.e:
> 
> TLV = TASKSTATS_TYPE_TGID/PID
> TLV = TASKSTATS_TYPE_STATS
> 
> Look at using nla_nest_start/end/cancel

Hmm... Would it be ok to send one message with the following format

1. TLV=TASKSTATS_TYPE_PID
2. TLV=TASKSTATS_TYPE_STATS
3. TLV=TASKSTATS_TYPE_TGID
4. TLV=TASKSTATS_TYPE_STATS

It would still be one message, except that 3 and 4 would be optional.
What do you think?

> 
> cheers,
> jamal

Thanks for your comments,
Balbir
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

2006-03-23 Thread Balbir Singh
On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:
> 
> Hi Balbir,
> 
> Looking good.
> This is a quick scan, so i didnt look at little details.
> Some comments embedded.

Hi, Jamal,

I tried addressing your comments in this new version.

Changelog
-
1. Moved TASKSTATS_MSG_* to under #ifdef __KERNEL__
2. Got rid of .hdrsize = 0 in genl_family family
3. nlmsg_new() now allocates for 2*u32 + sizeof(taskstats)
4. Got rid of NLM_F_REQUEST, all flags passed down to user space are now 0
5. The response to TASKSTATS_CMD_GET is TASKSTATS_CMD_NEW
6. taskstats_send_stats() now validates the command attributes and ensures
   that it either gets a PID or a TGID. If it gets both simultaneously
   the PID stats are sent.
7. Do not put the PID/TGID into the skb if there are errors in fill_pid() or
   fill_tgid().

Thanks,
Balbir


Signed-off-by: Shailabh Nagar <[EMAIL PROTECTED]>
Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>
---

 include/linux/delayacct.h |   11 +
 include/linux/taskstats.h |  111 
 init/Kconfig  |   16 ++
 kernel/Makefile   |1 
 kernel/delayacct.c|   44 +++
 kernel/taskstats.c|  255 ++
 6 files changed, 435 insertions(+), 3 deletions(-)

diff -puN include/linux/delayacct.h~delayacct-genetlink 
include/linux/delayacct.h
--- linux-2.6.16/include/linux/delayacct.h~delayacct-genetlink  2006-03-22 
11:56:03.0 +0530
+++ linux-2.6.16-balbir/include/linux/delayacct.h   2006-03-22 
11:56:03.0 +0530
@@ -15,6 +15,7 @@
 #define _LINUX_TASKDELAYS_H
 
 #include 
+#include 
 
 #ifdef CONFIG_TASK_DELAY_ACCT
 extern int delayacct_on;   /* Delay accounting turned on/off */
@@ -25,6 +26,7 @@ extern void __delayacct_tsk_exit(struct 
 extern void __delayacct_blkio_start(void);
 extern void __delayacct_blkio_end(void);
 extern unsigned long long __delayacct_blkio_ticks(struct task_struct *);
+extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
 
 static inline void delayacct_tsk_init(struct task_struct *tsk)
 {
@@ -72,4 +74,13 @@ static inline unsigned long long delayac
return 0;
 }
 #endif /* CONFIG_TASK_DELAY_ACCT */
+#ifdef CONFIG_TASKSTATS
+static inline int delayacct_add_tsk(struct taskstats *d,
+   struct task_struct *tsk)
+{
+   if (!tsk->delays)
+   return -EINVAL;
+   return __delayacct_add_tsk(d, tsk);
+}
+#endif
 #endif /* _LINUX_TASKDELAYS_H */
diff -puN /dev/null include/linux/taskstats.h
--- /dev/null   2004-06-24 23:34:38.0 +0530
+++ linux-2.6.16-balbir/include/linux/taskstats.h   2006-03-24 
06:49:24.0 +0530
@@ -0,0 +1,111 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *   (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ * - a task exits
+ * - listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION  1
+#define TASKSTATS_NOPID-1
+
+struct taskstats {
+   /* Maintain 64-bit alignment while extending */
+   /* Version 1 */
+
+   /* XXX_count is number of delay values recorded.
+* XXX_total is corresponding cumulative delay in nanoseconds
+*/
+
+#define TASKSTATS_NOCPUSTATS   1
+   __u64   cpu_count;
+   __u64   cpu_delay_total;/* wait, while runnable, for cpu */
+   __u64   blkio_count;
+   __u64   blkio_delay_total;  /* sync,block io completion wait*/
+   __u64   swapin_count;
+   __u64   swapin_delay_total; /* swapin page fault wait*/
+
+   __u64   cpu_run_total;  /* cpu running time
+* no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP 0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+   TASKSTATS_CMD_UNSPEC = 0,   /* Reserved */
+   TASKSTATS_CMD_GET,  /* user->kernel request */
+   TASKSTATS_CMD_NEW,  /* kernel->user event */
+   __TASKSTA

Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

2006-03-23 Thread Balbir Singh
On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:
> 
> Hi Balbir,
> 
> Looking good.
> This is a quick scan, so i didnt look at little details.

Thanks for your detailed feedback. Please find my response interspersed
along with your comments.

> Some comments embedded.
> 
> On Wed, 2006-22-03 at 13:19 +0530, Balbir Singh wrote:
> 
> 
> 
> 
> >  
> 
> > diff -puN /dev/null include/linux/taskstats.h
> 
> > + * The struct is versioned. Newer versions should only add fields to
> > + * the bottom of the struct to maintain backward compatibility.
> > + *
> > + * To create the next version, bump up the taskstats_version variable
> > + * and delineate the start of newly added fields with a comment indicating
> > + * the version number.
> > + */
> > +
> 
> 
> 
> > +enum {
> > +   TASKSTATS_MSG_UNICAST,  /* send data only to requester */
> > +   TASKSTATS_MSG_MULTICAST,/* send data to a group */
> > +};
> > +
> 
> Above will never be used outside of the kernel.
> Should it go under the ifdef kernel below?
>

Will do
 
> 
> > +#ifdef __KERNEL__
> > +
> 
> Note: some people will argue that you should probably have
> two header files. One for all kernel things and includes another
> which contains all the stuff above ifdef __KERNEL__
> 
> > +
> > +#endif /* __KERNEL__ */
> > +#endif /* _LINUX_TASKSTATS_H */
> 
> 
> 
> 
> > diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile
> > --- linux-2.6.16/kernel/Makefile~delayacct-genetlink2006-03-22 
> > 11:56:03.0 +0530
> > +++ linux-2.6.16-balbir/kernel/Makefile 2006-03-22 11:56:03.0 
> > +0530
> 
> > +
> > +const int taskstats_version = TASKSTATS_VERSION;
> > +static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
> > +static int family_registered = 0;
> > +
> > +static struct genl_family family = {
> > +   .id = GENL_ID_GENERATE,
> > +   .name   = TASKSTATS_GENL_NAME,
> > +   .version= TASKSTATS_GENL_VERSION,
> > +   .hdrsize= 0,
> 
> Do you need to specify hdrsize of 0?

I will remove it

> 
> 
> > +static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff 
> > **skbp,
> > +void **replyp)
> > +{
> > +   struct sk_buff *skb;
> > +   void *reply;
> > +
> > +   skb = nlmsg_new(NLMSG_GOODSIZE);
> 
> Ok,  getting a size of NLMSG_GOODSIZE is not a good idea. 
> The max size youll ever get it seems to me is 2*32-bit-data TLVs +
> sizeof struct stats. Why dont you allocate that size?
> 

Will do

> > +   if (!skb)
> > +   return -ENOMEM;
> > +
> > +   if (!info) {
> > +   int seq = get_cpu_var(taskstats_seqnum)++;
> > +   put_cpu_var(taskstats_seqnum);
> > +
> > +   reply = genlmsg_put(skb, 0, seq,
> > +   family.id, 0, NLM_F_REQUEST,
> > +   cmd, family.version);
> 
> Double check if you need NLM_F_REQUEST
> 

It is not required, I will remove it

> > +   } else
> > +   reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
> > +   family.id, 0, info->nlhdr->nlmsg_flags,
> > +   info->genlhdr->cmd, family.version);
> 
> A Response to a GET is a NEW. So i dont think info->genlhdr->cmd is the
> right thing?
> 
> 

I will change it

> 
> 
> 
> > +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info 
> > *info)
> > +{
> > +   int rc;
> > +   struct sk_buff *rep_skb;
> > +   struct taskstats stats;
> > +   void *reply;
> > +
> > +   memset(&stats, 0, sizeof(stats));
> > +   rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);
> 
> Same comment as before: a response to a GET is a NEW; so
> info->genlhdr->cmd doesnt seem right.
> 

I will change it

> > +   if (rc < 0)
> > +   return rc;
> > +
> > +   if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> > +   u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> > +   NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
> > +   rc = fill_pid((pid_t)pid, NULL, &stats);
> > +   if (rc < 0)
> > +   return rc;
> > +   }
> > +
> > +   if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
> > +   u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
> > +   NLA_PU

[RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

2006-03-21 Thread Balbir Singh
On Mon, Mar 13, 2006 at 09:48:26PM -0500, jamal wrote:
> On Mon, 2006-13-03 at 18:33 -0800, Matt Helsley wrote:
> > On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:
> 
> 
> I had a long description in an earlier email feedback; but the summary
> of it is the GET command is generic like TASKSTATS_CMD_GET; the message
> itself carries TLVs of what needs to be gotten which are 
> either PID and/or TGID etc. Anyways, theres a long spill of what i am
> saying in that earlier email. Perhaps the current patch is a transition
> towards that?
> 

Hi, Jamal,

Please find the updated version of delayacct-genetlink.patch. We hope
this iteration is closer to your expectation. I have copied the enums
you suggested in your previous review comments and used them.

Comments addressed (in this patch)

- Changed the code to use TLV's for data exchange between kernel and
  user space

Thanks,
Balbir


Documentation for the patch

Create a generic netlink interface (NETLINK_GENERIC family),
called "taskstats", for getting delay and cpu statistics of
tasks and thread groups during their lifetime and when they exit.

More changes expected. Following comments will go into a
Documentation file:

When a task is alive, userspace can get its stats by sending a
command containing its pid. Sending a tgid returns the sum of stats
of the tasks belonging to that tgid (where such a sum makes sense).
Together, the command interface allows stats for a large number of
tasks to be collected more efficiently than would be possible
through /proc or any per-pid interface.

The netlink interface also sends the stats for each task to userspace
when the task is exiting. This permits fine-grain accounting for
short-lived tasks, which is important if userspace is doing its own
aggregation of statistics based on some grouping of tasks
(e.g. CSA jobs, ELSA banks or CKRM classes).

If the exiting task belongs to a thread group (with more members than itself)
, the latters delay stats are also sent out on the task's exit. This allows
userspace to get accurate data at a per-tgid level while the tid's of a tgid
are exiting one by one.

The interface has been deliberately kept distinct from the delay
accounting code since it is potentially usable by other kernel components
that need to export per-pid/tgid data. The format of data returned to
userspace is versioned and the command interface easily extensible to
facilitate reuse.

If reuse is not deemed useful enough, the naming, placement of functions
and config options will be modified to make this an interface for delay
accounting alone.

Signed-off-by: Shailabh Nagar <[EMAIL PROTECTED]>
Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>

---

 include/linux/delayacct.h |   11 ++
 include/linux/taskstats.h |  112 
 init/Kconfig  |   16 ++
 kernel/Makefile   |1 
 kernel/delayacct.c|   44 
 kernel/taskstats.c|  251 ++
 6 files changed, 432 insertions(+), 3 deletions(-)

diff -puN include/linux/delayacct.h~delayacct-genetlink 
include/linux/delayacct.h
--- linux-2.6.16/include/linux/delayacct.h~delayacct-genetlink  2006-03-22 
11:56:03.0 +0530
+++ linux-2.6.16-balbir/include/linux/delayacct.h   2006-03-22 
11:56:03.0 +0530
@@ -15,6 +15,7 @@
 #define _LINUX_TASKDELAYS_H
 
 #include 
+#include 
 
 #ifdef CONFIG_TASK_DELAY_ACCT
 extern int delayacct_on;   /* Delay accounting turned on/off */
@@ -25,6 +26,7 @@ extern void __delayacct_tsk_exit(struct 
 extern void __delayacct_blkio_start(void);
 extern void __delayacct_blkio_end(void);
 extern unsigned long long __delayacct_blkio_ticks(struct task_struct *);
+extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
 
 static inline void delayacct_tsk_init(struct task_struct *tsk)
 {
@@ -72,4 +74,13 @@ static inline unsigned long long delayac
return 0;
 }
 #endif /* CONFIG_TASK_DELAY_ACCT */
+#ifdef CONFIG_TASKSTATS
+static inline int delayacct_add_tsk(struct taskstats *d,
+   struct task_struct *tsk)
+{
+   if (!tsk->delays)
+   return -EINVAL;
+   return __delayacct_add_tsk(d, tsk);
+}
+#endif
 #endif /* _LINUX_TASKDELAYS_H */
diff -puN /dev/null include/linux/taskstats.h
--- /dev/null   2004-06-24 23:34:38.0 +0530
+++ linux-2.6.16-balbir/include/linux/taskstats.h   2006-03-22 
13:12:01.0 +0530
@@ -0,0 +1,112 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *   (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY W

Re: [UPDATED PATCH] Re: [Lse-tech] Re: [Patch 7/7] Generic netlink interface (delay accounting)

2006-03-13 Thread Balbir Singh
On Sat, Mar 11, 2006 at 08:30:49AM -0500, jamal wrote:
> On Fri, 2006-10-03 at 22:09 +0530, Balbir Singh wrote:
> > On Fri, Mar 10, 2006 at 09:53:53AM -0500, jamal wrote: 
> 
> > > On kernel->user (in the case of response to #a or async notifiation as
> > > in #b) you really dont need to specify the TG/PID since they appear in
> > > the STATS etc.
> > 
> > I see your point now. I am looking at other users of netlink like
> > rtnetlink and I see the classical usage.
> > 
> > We can implement TLV's in our code, but for the most part the data we 
> > exchange
> > between the user <-> kernel has all the TLV's listed in the enum above.
> >
> > The major differnece is the type (pid/tgid). Hence we created a structure
> > (taskstats) instead of using TLV's.
> 
> Something to remember:
> 
> 1) TLVs are essentially giving you the flexibility to send optionally
> appearing elements. It is up to the receiver (in the kernel or user
> space) to check for the presence of mandatory elements or execute things
> depending on the presence of certain TLVs. Example in your case:
> if the tgid TLV appears then the user is requesting for that TLV
> if the pid appears then they are requesting for that
> if both appear then it is the && of the two.
> You should always ignore TLVs you dont understand - to allow for forward
> compatibility.
> 
> 2)  The "T" part is essentially also encoding (semantically) what size
> the value is; the "L" part is useful for validation. So the receiver
> will always know what the size of the TLV is by definition and uses the
> L to make sure it is the right size. Reject what is of the wrong size.
> 
> cheers,
> jamal

Thanks for the clarification, I will try and adapt our genetlink to use
TLV's, I can see the benefits - we will work on this as an evolutionary change
to our code.

Warm Regards,
Balbir
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UPDATED PATCH] Re: [Lse-tech] Re: [Patch 7/7] Generic netlink interface (delay accounting)

2006-03-10 Thread Balbir Singh
On Fri, Mar 10, 2006 at 09:53:53AM -0500, jamal wrote:
> On Thu, 2006-09-03 at 20:07 +0530, Balbir Singh wrote:
> > Please let us know if we missed something out.
> 
> Design still shaky IMO - now that i think i may understand what your end
> goal is. 
> Using the principles i described in earlier email, the problem you are
> trying to solve is:
> 
> a) shipping of the taskstats from kernel to user-space asynchronously to
> all listeners on multicast channel/group TASKSTATS_LISTEN_GRP
> at the point when some process exits.
> b) responding to queries issued by the user to the kernel for taskstats
> of a particular defined tgid and/or pid combination. 
> 
> Did i summarize your goals correctly?

Yes, you did.

> 
> So lets stat with #b:
> i) the message is multicast; there has to be a user space app registered
> to the multicast group otherwise nothing goes to user space.
> ii) user space issues a GET and it seems to me the appropriate naming
> for the response is a NEW.
> 
> Lets go to #a:
> The issued multicast messages are also NEW and no different from the
> ones sent in response to a GET.
> 
> Having said that then, you have the following commands:
> 
> enum {
>TASKSTATS_CMD_UNSPEC,   /* Reserved */
>TASKSTATS_CMD_GET,  /* user -> kernel query*/
>TASKSTATS_CMD_NEW, /* kernel -> user update */
> };
> 
> You also need the following TLVs
> 
> enum {
>TASKSTATS_TYPE_UNSPEC,   /* Reserved */
>TASKSTATS_TYPE_TGID,  /* The TGID */
>TASKSTATS_TYPE_PID, /* The PID */
>TASKSTATS_TYPE_STATS,  /* carries the taskstats */
>TASKSTATS_TYPE_VERS,   /* carries the version */
> };
> 
> Refer to the doc i passed you and provide feedback if how to use the
> above is not obvious.
>  

I will look at the document, just got hold of it.

> The use of TLVs above implies that any of these can be optionally
> appearing.
> So when you are going from user->kernel with a GET a in #a above, then
> you can specify the PID and/or TGID and you dont need to specify the
> STATS and this would be perfectly legal.
> 
> On kernel->user (in the case of response to #a or async notifiation as
> in #b) you really dont need to specify the TG/PID since they appear in
> the STATS etc.

I see your point now. I am looking at other users of netlink like
rtnetlink and I see the classical usage.

We can implement TLV's in our code, but for the most part the data we exchange
between the user <-> kernel has all the TLV's listed in the enum above.
The major differnece is the type (pid/tgid). Hence we created a structure
(taskstats) instead of using TLV's.

> I take it you dont want to configure the values of taskstats from user
> space, otherwise user->kernel will send a NEW as well.

Your understanding is correct.

> I also take it dumping doesnt apply to you, so you dont need a dump
> callback in your kernel code.

Yes, this is correct as well.

> >From what i described so far, you dont really need a header for yourself
> either (maybe you need one to just store the VERSION?)
> 

True, we do not need a header.

> I didnt understand the point to the err field you had in the reply.
> Netlink already does issue errors which can be read via perror. If this
> is different from what you need, it may be an excuse to have your own
> header.
> 

Hmm.. Will look into this.

> I hope this helps.
> 

Yes, it does immensely. Thanks for the detailed feedback.

> cheers,
> jamal
> 
> 

Warm Regards,
Balbir
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[UPDATED PATCH] Re: [Lse-tech] Re: [Patch 7/7] Generic netlink interface (delay accounting)

2006-03-09 Thread Balbir Singh
> Thanks for the clarification of the usage model. While our needs are 
> certainly much less complex,
> it is useful to know the range of options.
> 
> >There are no hard rules on what you need to be multicasting and as an
> >example you could send periodic(aka time based) samples from the kernel
> >on a multicast channel and that would be received by all. It did seem
> >odd that you want to have a semi-promiscous mode where a response to a
> >GET is multicast. If that is still what you want to achieve, then you
> >should.
> > 
> >>>Also if you can provide feedback whether the doc i sent was any use
> >>>and what wasnt clear etc.
> >also take a look at the excellent documentation Thomas Graf has put in
> >the kernel for all the utilities for manipulating netlink messages and
> >tell me if that should also be put in this doc (It is listed as a TODO).

Hello, Jamal,

Please find the latest version of the patch for review. The genetlink
code has been updated as per your review comments. The changelog is provided
below

1. Eliminated TASKSTATS_CMD_LISTEN and TASKSTATS_CMD_IGNORE
2. Provide generic functions called genlmsg_data() and genlmsg_len()
   in linux/net/genetlink.h
3. Do not multicast all replies, multicast only events generated due
   to task exit.
4. The taskstats and taskstats_reply structures are now 64 bit aligned.
5. Family id is dynamically generated.

Please let us know if we missed something out.

Thanks,
Balbir


Signed-off-by: Shailabh Nagar <[EMAIL PROTECTED]>
Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>

---

 include/linux/delayacct.h |2 
 include/linux/taskstats.h |  128 
 include/net/genetlink.h   |   20 +++
 init/Kconfig  |   16 ++-
 kernel/Makefile   |1 
 kernel/delayacct.c|   56 ++
 kernel/taskstats.c|  244 ++
 7 files changed, 464 insertions(+), 3 deletions(-)

diff -puN include/linux/delayacct.h~delayacct-genetlink 
include/linux/delayacct.h
--- linux-2.6.16-rc5/include/linux/delayacct.h~delayacct-genetlink  
2006-03-09 17:15:31.0 +0530
+++ linux-2.6.16-rc5-balbir/include/linux/delayacct.h   2006-03-09 
17:15:31.0 +0530
@@ -15,6 +15,7 @@
 #define _LINUX_TASKDELAYS_H
 
 #include 
+#include 
 
 #ifdef CONFIG_TASK_DELAY_ACCT
 extern int delayacct_on;   /* Delay accounting turned on/off */
@@ -24,6 +25,7 @@ extern void __delayacct_tsk_init(struct 
 extern void __delayacct_tsk_exit(struct task_struct *);
 extern void __delayacct_blkio(void);
 extern void __delayacct_swapin(void);
+extern int delayacct_add_tsk(struct taskstats_reply *, struct task_struct *);
 
 static inline void delayacct_tsk_init(struct task_struct *tsk)
 {
diff -puN /dev/null include/linux/taskstats.h
--- /dev/null   2004-06-24 23:34:38.0 +0530
+++ linux-2.6.16-rc5-balbir/include/linux/taskstats.h   2006-03-09 
19:28:54.0 +0530
@@ -0,0 +1,128 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *   (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ * - a task exits
+ * - listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION  1
+
+struct taskstats {
+   /* Maintain 64-bit alignment while extending */
+
+   /* Version 1 */
+#define TASKSTATS_NOPID-1
+   __s64   pid;
+   __s64   tgid;
+
+   /* XXX_count is number of delay values recorded.
+* XXX_total is corresponding cumulative delay in nanoseconds
+*/
+
+#define TASKSTATS_NOCPUSTATS   1
+   __u64   cpu_count;
+   __u64   cpu_delay_total;/* wait, while runnable, for cpu */
+   __u64   blkio_count;
+   __u64   blkio_delay_total;  /* sync,block io completion wait*/
+   __u64   swapin_count;
+   __u64   swapin_delay_total; /* swapin page fault wait*/
+
+   __u64   cpu_run_total;  /* cpu running time
+* no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP 0x1