Re: [PATCH] include listenq max backlog in /proc/net/tcp and include in tcp_info

2007-09-10 Thread Eric Dumazet

Sridhar Samudrala a écrit :

On Mon, 2007-09-10 at 16:13 -0700, Rick Jones wrote:

Return some useful information such as the maximum listen backlog and
the current listen backlog in the tcp_info structure and have that 
match what one can see in /proc/net/tcp and /proc/net/tcp6.


If we are also exporting max listen backlog, another place to
consider adding this is to tcp_diag_get_info() called via INET_DIAG_INFO.
Current listen backlog is returned in inet_diag_msg->idiag_rqueue.
max listen backlog can be returned in inet_diag_msg->idiag_wqueue.



I agree, /proc/net/tcp is deprecated nowadays...

Rick, could you add this part in your patch, and add my Sign-off-by ?

Thank you
Eric



diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c
index 57c5f0b..f5b6275 100644
--- a/net/ipv4/tcp_diag.c
+++ b/net/ipv4/tcp_diag.c
@@ -25,11 +25,13 @@ static void tcp_diag_get_info(struct sock *sk, struct 
inet_diag_msg *r,
const struct tcp_sock *tp = tcp_sk(sk);
struct tcp_info *info = _info;
 
-   if (sk->sk_state == TCP_LISTEN)
+   if (sk->sk_state == TCP_LISTEN) {
r->idiag_rqueue = sk->sk_ack_backlog;
-   else
+   r->idiag_wqueue = sk->sk_max_ack_backlog;
+   else {
r->idiag_rqueue = tp->rcv_nxt - tp->copied_seq;
-   r->idiag_wqueue = tp->write_seq - tp->snd_una;
+   r->idiag_wqueue = tp->write_seq - tp->snd_una;
+   }
if (info != NULL)
tcp_get_info(sk, info);
 }


[PATCH][2/2] Add ICMPMsgStats MIB (RFC 4293)

2007-09-10 Thread David Stevens
Background: RFC 4293 deprecates existing individual, named ICMP
type counters to be replaced with the ICMPMsgStatsTable. This table
includes entries for both IPv4 and IPv6, and requires counting of all
ICMP types, whether or not the machine implements the type.

These patches "remove" (but not really) the existing counters, and
replace them with the ICMPMsgStats tables for v4 and v6.
It includes the named counters in the /proc places they were, but gets the
values for them from the new tables. It also counts packets generated
from raw socket output (e.g., OutEchoes, MLD queries, RA's from
radvd, etc).

Changes:
1) create icmpmsg_statistics mib
2) create icmpv6msg_statistics mib
3) modify existing counters to use these
4) modify /proc/net/snmp to add "IcmpMsg" with all ICMP types
listed by number for easy SNMP parsing
5) modify /proc/net/snmp printing for "Icmp" to get the named data
from new counters.

IPv6 patch attached.

+-DLS

Signed-off-by: David L Stevens <[EMAIL PROTECTED]>

diff -ruNp linux-2.6.22.5/include/linux/snmp.h 
linux-2.6.22.5_ICMPv6MSG/include/linux/snmp.h
--- linux-2.6.22.5/include/linux/snmp.h 2007-08-22 16:23:54.0 
-0700
+++ linux-2.6.22.5_ICMPv6MSG/include/linux/snmp.h   2007-09-10 
15:02:43.0 -0700
@@ -91,35 +91,12 @@ enum
ICMP6_MIB_NUM = 0,
ICMP6_MIB_INMSGS,   /* InMsgs */
ICMP6_MIB_INERRORS, /* InErrors */
-   ICMP6_MIB_INDESTUNREACHS,   /* InDestUnreachs */
-   ICMP6_MIB_INPKTTOOBIGS, /* InPktTooBigs */
-   ICMP6_MIB_INTIMEEXCDS,  /* InTimeExcds */
-   ICMP6_MIB_INPARMPROBLEMS,   /* InParmProblems */
-   ICMP6_MIB_INECHOS,  /* InEchos */
-   ICMP6_MIB_INECHOREPLIES,/* InEchoReplies */
-   ICMP6_MIB_INGROUPMEMBQUERIES,   /* InGroupMembQueries */
-   ICMP6_MIB_INGROUPMEMBRESPONSES, /* InGroupMembResponses */
-   ICMP6_MIB_INGROUPMEMBREDUCTIONS,/* InGroupMembReductions 
*/
-   ICMP6_MIB_INROUTERSOLICITS, /* InRouterSolicits */
-   ICMP6_MIB_INROUTERADVERTISEMENTS,   /* InRouterAdvertisements 
*/
-   ICMP6_MIB_INNEIGHBORSOLICITS,   /* InNeighborSolicits */
-   ICMP6_MIB_INNEIGHBORADVERTISEMENTS, /* 
InNeighborAdvertisements */
-   ICMP6_MIB_INREDIRECTS,  /* InRedirects */
ICMP6_MIB_OUTMSGS,  /* OutMsgs */
-   ICMP6_MIB_OUTDESTUNREACHS,  /* OutDestUnreachs */
-   ICMP6_MIB_OUTPKTTOOBIGS,/* OutPktTooBigs */
-   ICMP6_MIB_OUTTIMEEXCDS, /* OutTimeExcds */
-   ICMP6_MIB_OUTPARMPROBLEMS,  /* OutParmProblems */
-   ICMP6_MIB_OUTECHOREPLIES,   /* OutEchoReplies */
-   ICMP6_MIB_OUTROUTERSOLICITS,/* OutRouterSolicits */
-   ICMP6_MIB_OUTNEIGHBORSOLICITS,  /* OutNeighborSolicits */
-   ICMP6_MIB_OUTNEIGHBORADVERTISEMENTS,/* 
OutNeighborAdvertisements */
-   ICMP6_MIB_OUTREDIRECTS, /* OutRedirects */
-   ICMP6_MIB_OUTGROUPMEMBRESPONSES,/* OutGroupMembResponses 
*/
-   ICMP6_MIB_OUTGROUPMEMBREDUCTIONS,   /* OutGroupMembReductions 
*/
__ICMP6_MIB_MAX
 };
 
+#define __ICMP6MSG_MIB_MAX 512 /* Out+In for all 8-bit ICMPv6 types */
+
 /* tcp mib definitions */
 /*
  * RFC 1213:  MIB-II TCP group
diff -ruNp linux-2.6.22.5/include/net/ipv6.h 
linux-2.6.22.5_ICMPv6MSG/include/net/ipv6.h
--- linux-2.6.22.5/include/net/ipv6.h   2007-08-22 16:23:54.0 
-0700
+++ linux-2.6.22.5_ICMPv6MSG/include/net/ipv6.h 2007-09-10 
15:41:19.0 -0700
@@ -132,6 +132,7 @@ DECLARE_SNMP_STAT(struct ipstats_mib, ip
SNMP_INC_STATS_USER(ipv6_statistics, field);\
 })
 DECLARE_SNMP_STAT(struct icmpv6_mib, icmpv6_statistics);
+DECLARE_SNMP_STAT(struct icmpv6msg_mib, icmpv6msg_statistics);
 #define ICMP6_INC_STATS(idev, field)   ({  \
struct inet6_dev *_idev = (idev);   \
if (likely(_idev != NULL))  \
@@ -157,6 +158,14 @@ DECLARE_SNMP_STAT(struct icmpv6_mib, icm
SNMP_INC_STATS_OFFSET_BH(_idev->stats.icmpv6, field, 
_offset);   \
SNMP_INC_STATS_OFFSET_BH(icmpv6_statistics, field, _offset); \
 })
+
+#define ICMP6MSGOUT_INC_STATS(field) SNMP_INC_STATS(icmpv6msg_statistics, 
field+256)
+#define ICMP6MSGOUT_INC_STATS_BH(field) 
SNMP_INC_STATS_BH(icmpv6msg_statistics, field+256)
+#define ICMP6MSGOUT_INC_STATS_USER(field) 
SNMP_INC_STATS_USER(icmpv6msg_statistics, field+256)
+#define ICMP6MSGIN_INC_STATS(field) SNMP_INC_STATS(icmpv6msg_statistics, 
field)
+#define ICMP6MSGIN_INC_STATS_BH(field) 
SNMP_INC_STATS_BH(icmpv6msg_statistics, field)
+#define ICMP6MSGIN_INC_STATS_USER(field) 
SNMP_INC_STATS_USER(icmpv6msg_statisti

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Segher Boessenkool

"volatile" has nothing to do with reordering.  atomic_dec() writes
to memory, so it _does_ have "volatile semantics", implicitly, as
long as the compiler cannot optimise the atomic variable away
completely -- any store counts as a side effect.


Stores can be reordered. Only x86 has (mostly) implicit write ordering.
So no atomic_dec has no volatile semantics


Read again: I said the C "volatile" construct has nothing to do
with CPU memory access reordering.


and may be reordered on a variety
of processors. Writes to memory may not follow code order on several
processors.


The _compiler_ isn't allowed to reorder things here.  Yes, of course
you do need stronger barriers for many purposes, volatile isn't all
that useful you know.


Segher

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][1/2] Add ICMPMsgStats MIB (RFC 4293)

2007-09-10 Thread David Stevens
Background: RFC 4293 deprecates existing individual, named ICMP
type counters to be replaced with the ICMPMsgStatsTable. This table
includes entries for both IPv4 and IPv6, and requires counting of all
ICMP types, whether or not the machine implements the type.

These patches "remove" (but not really) the existing counters, and
replace them with the ICMPMsgStats tables for v4 and v6.
It includes the named counters in the /proc places they were, but gets the
values for them from the new tables. It also counts packets generated
from raw socket output (e.g., OutEchoes, MLD queries, RA's from
radvd, etc).

Changes:
1) create icmpmsg_statistics mib
2) create icmpv6msg_statistics mib
3) modify existing counters to use these
4) modify /proc/net/snmp to add "IcmpMsg" with all ICMP types
listed by number for easy SNMP parsing
5) modify /proc/net/snmp printing for "Icmp" to get the named data
from new counters.

IPv4 patch attached, IPv6 patch to follow.

+-DLS

Signed-off-by: David L Stevens <[EMAIL PROTECTED]>

diff -ruNp linux-2.6.22.5/include/linux/snmp.h 
linux-2.6.22.5_ICMPMSG/include/linux/snmp.h
--- linux-2.6.22.5/include/linux/snmp.h 2007-08-22 16:23:54.0 
-0700
+++ linux-2.6.22.5_ICMPMSG/include/linux/snmp.h 2007-08-23 
15:32:29.0 -0700
@@ -82,6 +82,8 @@ enum
__ICMP_MIB_MAX
 };
 
+#define __ICMPMSG_MIB_MAX 512  /* Out+In for all 8-bit ICMP types */
+
 /* icmp6 mib definitions */
 /*
  * RFC 2466:  ICMPv6-MIB
diff -ruNp linux-2.6.22.5/include/net/icmp.h 
linux-2.6.22.5_ICMPMSG/include/net/icmp.h
--- linux-2.6.22.5/include/net/icmp.h   2007-08-22 16:23:54.0 
-0700
+++ linux-2.6.22.5_ICMPMSG/include/net/icmp.h   2007-08-23 
15:56:45.0 -0700
@@ -30,9 +30,16 @@ struct icmp_err {
 
 extern struct icmp_err icmp_err_convert[];
 DECLARE_SNMP_STAT(struct icmp_mib, icmp_statistics);
+DECLARE_SNMP_STAT(struct icmpmsg_mib, icmpmsg_statistics);
 #define ICMP_INC_STATS(field)  SNMP_INC_STATS(icmp_statistics, 
field)
 #define ICMP_INC_STATS_BH(field)   SNMP_INC_STATS_BH(icmp_statistics, 
field)
 #define ICMP_INC_STATS_USER(field) SNMP_INC_STATS_USER(icmp_statistics, 
field)
+#define ICMPMSGOUT_INC_STATS(field)SNMP_INC_STATS(icmpmsg_statistics, 
field+256)
+#define ICMPMSGOUT_INC_STATS_BH(field) 
SNMP_INC_STATS_BH(icmpmsg_statistics, field+256)
+#define ICMPMSGOUT_INC_STATS_USER(field) 
SNMP_INC_STATS_USER(icmpmsg_statistics, field+256)
+#define ICMPMSGIN_INC_STATS(field) SNMP_INC_STATS(icmpmsg_statistics, 
field)
+#define ICMPMSGIN_INC_STATS_BH(field) 
SNMP_INC_STATS_BH(icmpmsg_statistics, field)
+#define ICMPMSGIN_INC_STATS_USER(field) 
SNMP_INC_STATS_USER(icmpmsg_statistics, field)
 
 struct dst_entry;
 struct net_proto_family;
@@ -42,6 +49,7 @@ extern void   icmp_send(struct sk_buff *sk
 extern int icmp_rcv(struct sk_buff *skb);
 extern int icmp_ioctl(struct sock *sk, int cmd, unsigned long arg);
 extern voidicmp_init(struct net_proto_family *ops);
+extern voidicmp_out_count(unsigned char type);
 
 /* Move into dst.h ? */
 extern int xrlim_allow(struct dst_entry *dst, int timeout);
diff -ruNp linux-2.6.22.5/include/net/snmp.h 
linux-2.6.22.5_ICMPMSG/include/net/snmp.h
--- linux-2.6.22.5/include/net/snmp.h   2007-08-22 16:23:54.0 
-0700
+++ linux-2.6.22.5_ICMPMSG/include/net/snmp.h   2007-08-23 
14:42:50.0 -0700
@@ -82,6 +82,11 @@ struct icmp_mib {
unsigned long   mibs[ICMP_MIB_MAX];
 } __SNMP_MIB_ALIGN__;
 
+#define ICMPMSG_MIB_MAX__ICMPMSG_MIB_MAX
+struct icmpmsg_mib {
+   unsigned long   mibs[ICMPMSG_MIB_MAX];
+} __SNMP_MIB_ALIGN__;
+
 /* ICMP6 (IPv6-ICMP) */
 #define ICMP6_MIB_MAX  __ICMP6_MIB_MAX
 struct icmpv6_mib {
diff -ruNp linux-2.6.22.5/net/ipv4/af_inet.c 
linux-2.6.22.5_ICMPMSG/net/ipv4/af_inet.c
--- linux-2.6.22.5/net/ipv4/af_inet.c   2007-08-22 16:23:54.0 
-0700
+++ linux-2.6.22.5_ICMPMSG/net/ipv4/af_inet.c   2007-08-23 
14:47:26.0 -0700
@@ -1296,6 +1296,10 @@ static int __init init_ipv4_mibs(void)
  sizeof(struct icmp_mib),
  __alignof__(struct icmp_mib)) < 0)
goto err_icmp_mib;
+   if (snmp_mib_init((void **)icmpmsg_statistics,
+ sizeof(struct icmpmsg_mib),
+ __alignof__(struct icmpmsg_mib)) < 0)
+   goto err_icmpmsg_mib;
if (snmp_mib_init((void **)tcp_statistics,
  sizeof(struct tcp_mib),
  __alignof__(struct tcp_mib)) < 0)
@@ -1318,6 +1322,8 @@ err_udplite_mib:
 err_udp_mib:
snmp_mib_free((void **)tcp_statistics);
 err_tcp_mib:
+   snmp_mib_free((void **)icmpmsg_statistics);
+err_icmpmsg_mib:
snmp_mib_free((void **)icmp_statistics);
 err_icmp_mib:
snmp_mib_free((void **)ip_statistics);
diff -ruNp linux-2.6.22.5/net/ipv4/icmp.c 
linux-2.6.22.5_ICMPMSG/net/ipv4/icmp.c
--- linux-2.6.22.5/net/ipv4/icmp.c  2007

Re: [PATCH resend] Fix a lock problem in generic phy code

2007-09-10 Thread Herbert Xu
On Mon, Sep 10, 2007 at 08:45:50PM +0200, Hans-Jürgen Koch wrote:
>
> > Could you please audit all instances of physdev->lock and add
> > _bh where necessary?  I can see that at least phys_stop also
> > needs the _bh.
> 
> I think the patch does all that's necessary. At least, there're no error
> messages in the logs anymore. I didn't check if there's an error on
> unload, though.

Sorry, but you can't rely on the non-existence of lockdep
messages as a proof of correctness :)

If we're going to fix the obvious bugs here, we should fix
the subtle ones too as otherwise they'll be much harder to
notice with this patch merged.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23-rc5: possible irq lock inversion dependency detected

2007-09-10 Thread Herbert Xu
On Mon, Sep 10, 2007 at 08:04:41PM -0400, jamal wrote:
>
> disabling BH would make it more symmetric to the way we handle
> egress. I couldnt reproduce the issue, but this should hopefully resolve
> it.
> Christian, can you test with this patch?

Jamal, it's the police_lock that we need to make _bh.  The
ingress_lock is already _bh because of the spin_lock_bh that
directly precedes it.

Oh and I think the same thing applies for the other actions
too.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] cfg80211: fix initialisation if built-in

2007-09-10 Thread Rob Hussey
On 9/10/07, Magnus Damm <[EMAIL PROTECTED]> wrote:
> -module_init(rate_control_simple_init);
> +//module_init(rate_control_simple_init);
> +postcore_initcall(rate_control_simple_init);
>  module_exit(rate_control_simple_exit);
>
>  MODULE_DESCRIPTION("Simple rate control algorithm for ieee80211");

Same problem here, except with the rt2x00 driver. I changed it to
subsys_initcall(rate_control_simple_init), which also worked. I also
found that without this change, it was failing at this point in
ieee80211_rate.c:

ieee80211_try_rate_control_ops_get(const char *name)
{
struct rate_control_alg *alg;
struct rate_control_ops *ops = NULL;

mutex_lock(&rate_ctrl_mutex);
list_for_each_entry(alg, &rate_ctrl_algs, list) {   <= Here
if (!name || !strcmp(alg->ops->name, name))
if (try_module_get(alg->ops->module)) {
ops = alg->ops;
break;
}
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] cfg80211: fix initialisation if built-in

2007-09-10 Thread Magnus Damm
On 9/10/07, Johannes Berg <[EMAIL PROTECTED]> wrote:
> When cfg80211 is built into the kernel it needs to init earlier
> so that device registrations are run after it has initialised.
>
> Signed-off-by: Johannes Berg <[EMAIL PROTECTED]>

Yep, I need this fix as well. Without it the ath5k driver built in
bombs out during module_init(). Something with kref and a struct
device pointing to an uninitialized ieee80211_class.

I need a similar fix for net/mac80211/rc80211_simple.c as well to get
ath5k working though, not sure why at the moment. There may be some
bug with request_module() not being called properly.
ieee80211_register_hw() calls ieee80211_init_rate_ctrl_alg() with NULL
as name which calls rate_control_alloc() with NULL which always seems
to fail when built in.

This hack works around that problem, not sure what the real fix is.

--- 0002/net/mac80211/rc80211_simple.c
+++ work/net/mac80211/rc80211_simple.c  2007-09-09 18:11:48.0 +0900
@@ -431,7 +431,8 @@ static void __exit rate_control_simple_e
 }


-module_init(rate_control_simple_init);
+//module_init(rate_control_simple_init);
+postcore_initcall(rate_control_simple_init);
 module_exit(rate_control_simple_exit);

 MODULE_DESCRIPTION("Simple rate control algorithm for ieee80211");

Thanks,

/ magnus
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16] net: netlink support for moving devices between network namespaces.

2007-09-10 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
> "Serge E. Hallyn" <[EMAIL PROTECTED]> writes:
> >> 
> >> +static struct net *get_net_ns_by_pid(pid_t pid)
> >> +{
> >> +  struct task_struct *tsk;
> >> +  struct net *net;
> >> +
> >> +  /* Lookup the network namespace */
> >> +  net = ERR_PTR(-ESRCH);
> >> +  rcu_read_lock();
> >> +  tsk = find_task_by_pid(pid);
> >> +  if (tsk) {
> >> +  task_lock(tsk);
> >> +  if (tsk->nsproxy)
> >> +  net = get_net(tsk->nsproxy->net_ns);
> >> +  task_unlock(tsk);
> >
> > Thinking...  Ok, I'm not sure this is 100% safe in the target tree, but
> > the long-term correct way probably isn't yet implemented in the net-
> > tree.  Eventually you will want to:
> >
> > net_ns = NULL;
> > rcu_read_lock();
> > tsk = find_task_by_pid();  /* or _pidns equiv? */
> > nsproxy = task_nsproxy(tsk);
> > if (nsproxy)
> > net_ns = get_net(nsproxy->net_ns);
> > rcu_read_unlock;
> >
> > What you have here is probably unsafe if tsk is the last task pointing
> > to it's nsproxy and it does an unshare, bc unshare isn't protected by
> > task_lock, and you're not rcu_dereferencing tsk->nsproxy (which
> > task_nsproxy does).  At one point we floated a patch to reuse the same
> > nsproxy in that case which would prevent you having to worry about it,
> > but that isn't being done in -mm now so i doubt it's in -net.
> 
> 
> That change isn't merged upstream yet, so it isn't in David's
> net-2.6.24 tree.  Currently task->nsproxy is protected but
> task_lock(current). So the code is fine.
> 
> I am aware that removing the task_lock(current) for the setting
> of current->nsproxy is currently in the works, and I have planned
> to revisit this later when all of these pieces come together.
> 
> For now the code is fine.
> 
> If need be we can drop this patch to remove the potential merge
> conflict.

No, no.  Like you say it's correct at the moment.  Just something we
need to watch out for when it does get merged with the newer changes.

> But I figured it was useful

Absolutely.

> for this part of the user space
> interface to be available for review.

Agreed.  And the rest of the patchset looks good to me.

Thanks.

-serge
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] include listenq max backlog in /proc/net/tcp and include in tcp_info

2007-09-10 Thread Sridhar Samudrala
On Mon, 2007-09-10 at 16:13 -0700, Rick Jones wrote:
> Return some useful information such as the maximum listen backlog and
> the current listen backlog in the tcp_info structure and have that 
> match what one can see in /proc/net/tcp and /proc/net/tcp6.

If we are also exporting max listen backlog, another place to
consider adding this is to tcp_diag_get_info() called via INET_DIAG_INFO.
Current listen backlog is returned in inet_diag_msg->idiag_rqueue.
max listen backlog can be returned in inet_diag_msg->idiag_wqueue.

Thanks
Sridhar
> 
> Signed-off-by: Rick Jones <[EMAIL PROTECTED]>
> ---
> 
> diff -r bdcdd0e1ee9d Documentation/networking/proc_net_tcp.txt
> --- a/Documentation/networking/proc_net_tcp.txt   Sat Sep 01 07:00:31 
> 2007 +
> +++ b/Documentation/networking/proc_net_tcp.txt   Mon Sep 10 16:09:46 
> 2007 -0700
> @@ -20,8 +20,8 @@ up into 3 parts because of the length of
>|| | |   |--> number of unrecovered RTO timeouts
>|| | |--> number of jiffies until timer expires
>|| |> timer_active (see below)
> -  ||--> receive-queue
> -  |---> transmit-queue
> +  ||--> receive-queue or connection backlog
> +  |---> transmit-queue or connection limit
> 
> 10000 54165785 4 cd1e6040 25 4 27 3 -1
>  |  || || |  | |  | |--> slow start size 
> threshold, 
> diff -r bdcdd0e1ee9d net/ipv4/tcp.c
> --- a/net/ipv4/tcp.c  Sat Sep 01 07:00:31 2007 +
> +++ b/net/ipv4/tcp.c  Mon Sep 10 16:09:46 2007 -0700
> @@ -2030,8 +2030,14 @@ void tcp_get_info(struct sock *sk, struc
>   info->tcpi_snd_mss = tp->mss_cache;
>   info->tcpi_rcv_mss = icsk->icsk_ack.rcv_mss;
> 
> - info->tcpi_unacked = tp->packets_out;
> - info->tcpi_sacked = tp->sacked_out;
> + if (sk->sk_state == TCP_LISTEN) {
> + info->tcpi_unacked = sk->sk_ack_backlog;
> + info->tcpi_sacked = sk->sk_max_ack_backlog;
> + }
> + else {
> + info->tcpi_unacked = tp->packets_out;
> + info->tcpi_sacked = tp->sacked_out;
> + }
>   info->tcpi_lost = tp->lost_out;
>   info->tcpi_retrans = tp->retrans_out;
>   info->tcpi_fackets = tp->fackets_out;
> diff -r bdcdd0e1ee9d net/ipv4/tcp_ipv4.c
> --- a/net/ipv4/tcp_ipv4.c Sat Sep 01 07:00:31 2007 +
> +++ b/net/ipv4/tcp_ipv4.c Mon Sep 10 16:09:46 2007 -0700
> @@ -2320,7 +2320,8 @@ static void get_tcp4_sock(struct sock *s
>   sprintf(tmpbuf, "%4d: %08X:%04X %08X:%04X %02X %08X:%08X %02X:%08lX "
>   "%08X %5d %8d %lu %d %p %u %u %u %u %d",
>   i, src, srcp, dest, destp, sk->sk_state,
> - tp->write_seq - tp->snd_una,
> + sk->sk_state == TCP_LISTEN ? sk->sk_max_ack_backlog :
> +  (tp->write_seq - tp->snd_una),
>   sk->sk_state == TCP_LISTEN ? sk->sk_ack_backlog :
>(tp->rcv_nxt - tp->copied_seq),
>   timer_active,
> diff -r bdcdd0e1ee9d net/ipv6/tcp_ipv6.c
> --- a/net/ipv6/tcp_ipv6.c Sat Sep 01 07:00:31 2007 +
> +++ b/net/ipv6/tcp_ipv6.c Mon Sep 10 16:09:46 2007 -0700
> @@ -2005,8 +2005,10 @@ static void get_tcp6_sock(struct seq_fil
>  dest->s6_addr32[0], dest->s6_addr32[1],
>  dest->s6_addr32[2], dest->s6_addr32[3], destp,
>  sp->sk_state,
> -tp->write_seq-tp->snd_una,
> -(sp->sk_state == TCP_LISTEN) ? sp->sk_ack_backlog : 
> (tp->rcv_nxt - tp->copied_seq),
> +(sp->sk_state == TCP_LISTEN) ? sp->sk_max_ack_backlog:
> +   tp->write_seq-tp->snd_una,
> +(sp->sk_state == TCP_LISTEN) ? sp->sk_ack_backlog : 
> + (tp->rcv_nxt - tp->copied_seq),
>  timer_active,
>  jiffies_to_clock_t(timer_expires - jiffies),
>  icsk->icsk_retransmits,
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New NAPI API: Need for netif_napi_remove() ?!

2007-09-10 Thread Kok, Auke

Kok, Auke wrote:

David,

 From an old thread:

 > 5) Since, in the NETPOLL case, netif_napi_init() adds the NAPI struct
 >   to the per-device list I renamed it to netif_napi_add().  Currently
 >   no teardown is really necessary, anything that would need to be done
 >   would be driver internal, so I didn't create the corollary
 >   netif_napi_remove() for the time being.  Let's not add it unless it
 >   really becomes necessary.

while coding the NAPI API changes into the ixgbe driver, I notice that I'm in 
need for an implementation for netif_napi_remove(). The ixgbe driver itself 
already modifies it's polling routing on open() and close() based on whether it 
was able to acquire MSI-X vectors or not, and can thus logically change as the 
system suspends/resumes and new hardware is inserted that change the balance in 
the MSI-X vectors in the system. Or, even more bluntly, all MSI support is 
disabled and we want the driver to come up in legacy mode and use a completely 
different poll routine alltogether. We can't do this at probe time.


In any case I think we have a legitimate case for netif_napi_remove() to be 
implemented.



hm, I spoke too soon, I think I can get by for now by just modifying 
adapter->napi.poll when needed, and this would be clean enough for now. This 
might change as I enable multiqueue in this driver later though.


Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


New NAPI API: Need for netif_napi_remove() ?!

2007-09-10 Thread Kok, Auke


David,

From an old thread:

> 5) Since, in the NETPOLL case, netif_napi_init() adds the NAPI struct
>   to the per-device list I renamed it to netif_napi_add().  Currently
>   no teardown is really necessary, anything that would need to be done
>   would be driver internal, so I didn't create the corollary
>   netif_napi_remove() for the time being.  Let's not add it unless it
>   really becomes necessary.

while coding the NAPI API changes into the ixgbe driver, I notice that I'm in 
need for an implementation for netif_napi_remove(). The ixgbe driver itself 
already modifies it's polling routing on open() and close() based on whether it 
was able to acquire MSI-X vectors or not, and can thus logically change as the 
system suspends/resumes and new hardware is inserted that change the balance in 
the MSI-X vectors in the system. Or, even more bluntly, all MSI support is 
disabled and we want the driver to come up in legacy mode and use a completely 
different poll routine alltogether. We can't do this at probe time.


In any case I think we have a legitimate case for netif_napi_remove() to be 
implemented.


Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


noob dev question

2007-09-10 Thread DHAJOGLO
I'm new to the list and new to just about everything involving kernel 
development.  I'm working on a project where I have successfully written a LKM 
to handle IP packets for protocol type 253 (an experimental protocol number).  
Now, I'm working on sending my reply and I hit a road block.  I'm not sure if I 
should:

1) create a new skb with my protocol data and try to pass it to the IP handler
OR
2) use the inbound skb and just turn it around and use it to send back out 
manually via a NF Hook.

Ideally I would like to get my reply formatted and sent into ip_queue_xmit but 
I'm at the limit of what I can do regarding the manipulation of the socket 
buffers.  I'm currently working in 2.6.18.8 as the book I'm using does not 
reflect some of the newer changes to the sk_buff struct.  Any help or pointers 
would be nice.  I'm looking more for online tutorials at this point.

Regards,
-dave



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23-rc5: possible irq lock inversion dependency detected

2007-09-10 Thread jamal
On Mon, 2007-10-09 at 21:00 +0800, Herbert Xu wrote:


> The minimal fix would be to make sure that we disable BH on
> the first CPU. 

disabling BH would make it more symmetric to the way we handle
egress. I couldnt reproduce the issue, but this should hopefully resolve
it.
Christian, can you test with this patch?

cheers,
jamal




[NET_SCHED] make ingress qlock symmetric to egress

Signed-off-by: Jamal Hadi Salim <[EMAIL PROTECTED]>

--- a/net/sched/sch_generic.c   2007/09/10 23:19:45 1.1
+++ b/net/sched/sch_generic.c   2007/09/10 23:52:45
@@ -42,12 +42,12 @@
 void qdisc_lock_tree(struct net_device *dev)
 {
spin_lock_bh(&dev->queue_lock);
-   spin_lock(&dev->ingress_lock);
+   spin_lock_bh(&dev->ingress_lock);
 }
 
 void qdisc_unlock_tree(struct net_device *dev)
 {
-   spin_unlock(&dev->ingress_lock);
+   spin_unlock_bh(&dev->ingress_lock);
spin_unlock_bh(&dev->queue_lock);
 }
 


Re: [PATCH] Document non-semantics of atomic_read() and atomic_set()

2007-09-10 Thread Paul E. McKenney
On Mon, Sep 10, 2007 at 07:19:44PM -0400, Chris Snook wrote:
> From: Chris Snook <[EMAIL PROTECTED]>
> 
> Unambiguously document the fact that atomic_read() and atomic_set()
> do not imply any ordering or memory access, and that callers are
> obligated to explicitly invoke barriers as needed to ensure that
> changes to atomic variables are visible in all contexts that need
> to see them.

Acked-by: Paul E. McKenney <[EMAIL PROTECTED]>

> Signed-off-by: Chris Snook <[EMAIL PROTECTED]>
> 
> --- a/Documentation/atomic_ops.txt2007-07-08 19:32:17.0 -0400
> +++ b/Documentation/atomic_ops.txt2007-09-10 19:02:50.0 -0400
> @@ -12,7 +12,11 @@
>  C integer type will fail.  Something like the following should
>  suffice:
> 
> - typedef struct { volatile int counter; } atomic_t;
> + typedef struct { int counter; } atomic_t;
> +
> + Historically, counter has been declared volatile.  This is now
> +discouraged.  See Documentation/volatile-considered-harmful.txt for the
> +complete rationale.
> 
>   The first operations to implement for atomic_t's are the
>  initializers and plain reads.
> @@ -42,6 +46,22 @@
> 
>  which simply reads the current value of the counter.
> 
> +*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***
> +
> +Some architectures may choose to use the volatile keyword, barriers, or
> +inline assembly to guarantee some degree of immediacy for atomic_read()
> +and atomic_set().  This is not uniformly guaranteed, and may change in
> +the future, so all users of atomic_t should treat atomic_read() and
> +atomic_set() as simple C assignment statements that may be reordered or
> +optimized away entirely by the compiler or processor, and explicitly
> +invoke the appropriate compiler and/or memory barrier for each use case.
> +Failure to do so will result in code that may suddenly break when used with
> +different architectures or compiler optimizations, or even changes in
> +unrelated code which changes how the compiler optimizes the section
> +accessing atomic_t variables.
> +
> +*** YOU HAVE BEEN WARNED! ***
> +
>  Now, we move onto the actual atomic operation interfaces.
> 
>   void atomic_add(int i, atomic_t *v);
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [-MM, FIX V4] e1000e: incorporate napi_struct changes from net-2.6.24.git

2007-09-10 Thread Auke Kok
This incorporates the new napi_struct changes into e1000e. Included
bugfix for ifdown hang from Krishna Kumar for e1000.

Disabling polling is no longer needed at init time, so remove
napi_disable() call from _probe().

This also fixes an endless polling loop where the driver signalled
"polling done" improperly back to the stack.

Signed-off-by: Auke Kok <[EMAIL PROTECTED]>
---

 drivers/net/e1000e/e1000.h  |2 ++
 drivers/net/e1000e/netdev.c |   40 
 2 files changed, 18 insertions(+), 24 deletions(-)

diff --git a/drivers/net/e1000e/e1000.h b/drivers/net/e1000e/e1000.h
index c57e35a..d2499bb 100644
--- a/drivers/net/e1000e/e1000.h
+++ b/drivers/net/e1000e/e1000.h
@@ -187,6 +187,8 @@ struct e1000_adapter {
struct e1000_ring *tx_ring /* One per active queue */
cacheline_aligned_in_smp;
 
+   struct napi_struct napi;
+
unsigned long tx_queue_len;
unsigned int restart_queue;
u32 txd_cmd;
diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 372da46..eeb40cc 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -1149,12 +1149,12 @@ static irqreturn_t e1000_intr_msi(int irq, void *data)
mod_timer(&adapter->watchdog_timer, jiffies + 1);
}
 
-   if (netif_rx_schedule_prep(netdev)) {
+   if (netif_rx_schedule_prep(netdev, &adapter->napi)) {
adapter->total_tx_bytes = 0;
adapter->total_tx_packets = 0;
adapter->total_rx_bytes = 0;
adapter->total_rx_packets = 0;
-   __netif_rx_schedule(netdev);
+   __netif_rx_schedule(netdev, &adapter->napi);
} else {
atomic_dec(&adapter->irq_sem);
}
@@ -1212,12 +1212,12 @@ static irqreturn_t e1000_intr(int irq, void *data)
mod_timer(&adapter->watchdog_timer, jiffies + 1);
}
 
-   if (netif_rx_schedule_prep(netdev)) {
+   if (netif_rx_schedule_prep(netdev, &adapter->napi)) {
adapter->total_tx_bytes = 0;
adapter->total_tx_packets = 0;
adapter->total_rx_bytes = 0;
adapter->total_rx_packets = 0;
-   __netif_rx_schedule(netdev);
+   __netif_rx_schedule(netdev, &adapter->napi);
} else {
atomic_dec(&adapter->irq_sem);
}
@@ -1662,10 +1662,10 @@ set_itr_now:
  * e1000_clean - NAPI Rx polling callback
  * @adapter: board private structure
  **/
-static int e1000_clean(struct net_device *poll_dev, int *budget)
+static int e1000_clean(struct napi_struct *napi, int budget)
 {
-   struct e1000_adapter *adapter;
-   int work_to_do = min(*budget, poll_dev->quota);
+   struct e1000_adapter *adapter = container_of(napi, struct 
e1000_adapter, napi);
+   struct net_device *poll_dev = adapter->netdev;
int tx_cleaned = 0, work_done = 0;
 
/* Must NOT use netdev_priv macro here. */
@@ -1684,25 +1684,19 @@ static int e1000_clean(struct net_device *poll_dev, int 
*budget)
spin_unlock(&adapter->tx_queue_lock);
}
 
-   adapter->clean_rx(adapter, &work_done, work_to_do);
-   *budget -= work_done;
-   poll_dev->quota -= work_done;
+   adapter->clean_rx(adapter, &work_done, budget);
 
/* If no Tx and not enough Rx work done, exit the polling mode */
-   if ((!tx_cleaned && (work_done == 0)) ||
+   if ((!tx_cleaned && (work_done < budget)) ||
   !netif_running(poll_dev)) {
 quit_polling:
if (adapter->itr_setting & 3)
e1000_set_itr(adapter);
-   netif_rx_complete(poll_dev);
-   if (test_bit(__E1000_DOWN, &adapter->state))
-   atomic_dec(&adapter->irq_sem);
-   else
-   e1000_irq_enable(adapter);
-   return 0;
+   netif_rx_complete(poll_dev, napi);
+   e1000_irq_enable(adapter);
}
 
-   return 1;
+   return work_done;
 }
 
 static void e1000_vlan_rx_add_vid(struct net_device *netdev, u16 vid)
@@ -2439,7 +2433,7 @@ int e1000e_up(struct e1000_adapter *adapter)
 
clear_bit(__E1000_DOWN, &adapter->state);
 
-   netif_poll_enable(adapter->netdev);
+   napi_enable(&adapter->napi);
e1000_irq_enable(adapter);
 
/* fire a link change interrupt to start the watchdog */
@@ -2472,7 +2466,7 @@ void e1000e_down(struct e1000_adapter *adapter)
e1e_flush();
msleep(10);
 
-   netif_poll_disable(netdev);
+   napi_disable(&adapter->napi);
e1000_irq_disable(adapter);
 
del_timer_sync(&adapter->watchdog_timer);
@@ -2605,7 +2599,7 @@ static int e1000_open(struct net_device *netdev)
/* From here on the code is the same as e1000e_up() */
clear_bit(__E1000_DOWN, &adapter->state);
 
-   netif_poll_enable

[PATCH] Document non-semantics of atomic_read() and atomic_set()

2007-09-10 Thread Chris Snook
From: Chris Snook <[EMAIL PROTECTED]>

Unambiguously document the fact that atomic_read() and atomic_set()
do not imply any ordering or memory access, and that callers are
obligated to explicitly invoke barriers as needed to ensure that
changes to atomic variables are visible in all contexts that need
to see them.

Signed-off-by: Chris Snook <[EMAIL PROTECTED]>

--- a/Documentation/atomic_ops.txt  2007-07-08 19:32:17.0 -0400
+++ b/Documentation/atomic_ops.txt  2007-09-10 19:02:50.0 -0400
@@ -12,7 +12,11 @@
 C integer type will fail.  Something like the following should
 suffice:
 
-   typedef struct { volatile int counter; } atomic_t;
+   typedef struct { int counter; } atomic_t;
+
+   Historically, counter has been declared volatile.  This is now
+discouraged.  See Documentation/volatile-considered-harmful.txt for the
+complete rationale.
 
The first operations to implement for atomic_t's are the
 initializers and plain reads.
@@ -42,6 +46,22 @@
 
 which simply reads the current value of the counter.
 
+*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***
+
+Some architectures may choose to use the volatile keyword, barriers, or
+inline assembly to guarantee some degree of immediacy for atomic_read()
+and atomic_set().  This is not uniformly guaranteed, and may change in
+the future, so all users of atomic_t should treat atomic_read() and
+atomic_set() as simple C assignment statements that may be reordered or
+optimized away entirely by the compiler or processor, and explicitly
+invoke the appropriate compiler and/or memory barrier for each use case.
+Failure to do so will result in code that may suddenly break when used with
+different architectures or compiler optimizations, or even changes in
+unrelated code which changes how the compiler optimizes the section
+accessing atomic_t variables.
+
+*** YOU HAVE BEEN WARNED! ***
+
 Now, we move onto the actual atomic operation interfaces.
 
void atomic_add(int i, atomic_t *v);
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] rfkill: Add rfkill documentation

2007-09-10 Thread Randy Dunlap
On Mon, 10 Sep 2007 19:56:03 +0200 Ivo van Doorn wrote:

> Add a documentation file which contains
> a short description about rfkill with some
> notes about drivers and the userspace interface.

Thanks.  I have noted a few typo/editorial changes below.


> Signed-off-by: Ivo van Doorn <[EMAIL PROTECTED]>
> Acked-by: Dmitry Torokhov <[EMAIL PROTECTED]>
> ---
>  Documentation/rfkill.txt |   88 
> ++
>  1 files changed, 88 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/rfkill.txt
> 
> diff --git a/Documentation/rfkill.txt b/Documentation/rfkill.txt
> new file mode 100644
> index 000..93c76fc
> --- /dev/null
> +++ b/Documentation/rfkill.txt
> @@ -0,0 +1,88 @@
> +rfkill - RF switch subsystem support
> +
> +
> +1 Implementation details
> +2 Driver support
> +3 Userspace support
> +
> +===
> +1: Implementation details
> +
> +The rfkill switch subsystem offers support for keys often found on laptops
> +to enable wireless devices like WiFi and Bluetooth.
> +
> +This is done by providing the user 3 possibilities:
> + - The rfkill system handles all events, userspace is not aware of events.
> + - The rfkill system handles all events, userspace is informed about the 
> event.
> + - The rfkill system does not handle events, userspace handles all events.

I would s/,/;/ in the 3 lines above.

> +The buttons to enable and disable the wireless radios are important in
> +situations where the user is for example using his laptop on a location where
> +wireless radios _must_ be disabled (e.g airplanes).
> +Because of this requirement, userspace support for the keys should not be
> +made mandatory. Because userspace might want to perform some additional 
> smarter
> +tasks when the key is pressed, rfkill still provides userspace the 
> possibility
> +to take over the task to handle the key events.
> +
> +The system inside the kernel has been split into 2 seperate sections:

  separate

> + 1 - RFKILL
> + 2 - RFKILL_INPUT
> +
> +The first option enables rfkill support and will make sure userspace will
> +be notified of any events through the input device. It also creates several
> +sysfs entries which can be used by userspace. See section "Userspace 
> support".
> +
> +The second option provides a rfkill input handler. This handler will

  an

> +listen to all rfkill key events and will toggle the radio accordingly,

end above with ; or .  If '.', s/with/With/ on next line.

> +with this option enabled userspace could either do nothing or simply
> +perform monitoring tasks.
> +
> +
> +2: Driver support
> +
> +Drivers who wish to build in rfkill subsystem support should

   Drivers that

But, drivers can't/don't wish, so it would be better to say something
like:

To build a driver with rfkill subsystem support, the driver should
depend on the Kconfig symbol RFKILL; it should _not_ depend on
RKFILL_INPUT.


> +make sure their driver depends of the Kconfig option RFKILL, it should
> +_not_ depend on RFKILL_INPUT.
> +
> +Unless key events trigger a interrupt to which the driver listens, polling

 an interrupt

> +will be required to determine the key state changes. For this the input
> +layer providers the input-polldev handler.
> +
> +A driver should implement a few steps to correctly make use of the
> +rfkill subsystem. First for non-polling drivers:
> +
> + - rfkill_allocate()
> + - input_allocate_device()
> + - rfkill_register()
> + - input_register_device()
> +
> +For polling drivers:
> +
> + - rfkill_allocate()
> + - input_allocate_polled_device()
> + - rfkill_register()
> + - input_register_polled_device()
> +
> +When a key event has been detected, the correct event should be
> +send over the input device which has been registered by the driver.

   sent

> +
> +
> +3: Userspace support
> +
> +For each key a input device will be created which will send out the correct

an

> +key event when the rfkill key has been pressed.
> +
> +The following sysfs entries will be created:
> +
> + name: Name assigned by driver to this key (interface or driver name).
> + type: Name of the key type ("wlan", "bluetooth", etc).
> + state: Current state of the key. 1: On, 0: Off.
> + claim: 1: Userspace handles events, 0: Kernel handles events
> +
> +Both the "state" and "claim" entries are also writable. For the "state" entry
> +this means that when 1 or 0 is written all radios will be toggled 
> accordingly.

will be written even if they are already in that state?

> +For the "claim" entry writing 1 to it will mean that the kernel will no 
> longer

s/will mean/means/
s/will no longer handle/no longer handles/

> +hand

[PATCH] include listenq max backlog in /proc/net/tcp and include in tcp_info

2007-09-10 Thread Rick Jones
Return some useful information such as the maximum listen backlog and
the current listen backlog in the tcp_info structure and have that 
match what one can see in /proc/net/tcp and /proc/net/tcp6.

Signed-off-by: Rick Jones <[EMAIL PROTECTED]>
---

diff -r bdcdd0e1ee9d Documentation/networking/proc_net_tcp.txt
--- a/Documentation/networking/proc_net_tcp.txt Sat Sep 01 07:00:31 2007 +
+++ b/Documentation/networking/proc_net_tcp.txt Mon Sep 10 16:09:46 2007 -0700
@@ -20,8 +20,8 @@ up into 3 parts because of the length of
   || | |   |--> number of unrecovered RTO timeouts
   || | |--> number of jiffies until timer expires
   || |> timer_active (see below)
-  ||--> receive-queue
-  |---> transmit-queue
+  ||--> receive-queue or connection backlog
+  |---> transmit-queue or connection limit
 
10000 54165785 4 cd1e6040 25 4 27 3 -1
 |  || || |  | |  | |--> slow start size threshold, 
diff -r bdcdd0e1ee9d net/ipv4/tcp.c
--- a/net/ipv4/tcp.cSat Sep 01 07:00:31 2007 +
+++ b/net/ipv4/tcp.cMon Sep 10 16:09:46 2007 -0700
@@ -2030,8 +2030,14 @@ void tcp_get_info(struct sock *sk, struc
info->tcpi_snd_mss = tp->mss_cache;
info->tcpi_rcv_mss = icsk->icsk_ack.rcv_mss;
 
-   info->tcpi_unacked = tp->packets_out;
-   info->tcpi_sacked = tp->sacked_out;
+   if (sk->sk_state == TCP_LISTEN) {
+   info->tcpi_unacked = sk->sk_ack_backlog;
+   info->tcpi_sacked = sk->sk_max_ack_backlog;
+   }
+   else {
+   info->tcpi_unacked = tp->packets_out;
+   info->tcpi_sacked = tp->sacked_out;
+   }
info->tcpi_lost = tp->lost_out;
info->tcpi_retrans = tp->retrans_out;
info->tcpi_fackets = tp->fackets_out;
diff -r bdcdd0e1ee9d net/ipv4/tcp_ipv4.c
--- a/net/ipv4/tcp_ipv4.c   Sat Sep 01 07:00:31 2007 +
+++ b/net/ipv4/tcp_ipv4.c   Mon Sep 10 16:09:46 2007 -0700
@@ -2320,7 +2320,8 @@ static void get_tcp4_sock(struct sock *s
sprintf(tmpbuf, "%4d: %08X:%04X %08X:%04X %02X %08X:%08X %02X:%08lX "
"%08X %5d %8d %lu %d %p %u %u %u %u %d",
i, src, srcp, dest, destp, sk->sk_state,
-   tp->write_seq - tp->snd_una,
+   sk->sk_state == TCP_LISTEN ? sk->sk_max_ack_backlog :
+(tp->write_seq - tp->snd_una),
sk->sk_state == TCP_LISTEN ? sk->sk_ack_backlog :
 (tp->rcv_nxt - tp->copied_seq),
timer_active,
diff -r bdcdd0e1ee9d net/ipv6/tcp_ipv6.c
--- a/net/ipv6/tcp_ipv6.c   Sat Sep 01 07:00:31 2007 +
+++ b/net/ipv6/tcp_ipv6.c   Mon Sep 10 16:09:46 2007 -0700
@@ -2005,8 +2005,10 @@ static void get_tcp6_sock(struct seq_fil
   dest->s6_addr32[0], dest->s6_addr32[1],
   dest->s6_addr32[2], dest->s6_addr32[3], destp,
   sp->sk_state,
-  tp->write_seq-tp->snd_una,
-  (sp->sk_state == TCP_LISTEN) ? sp->sk_ack_backlog : 
(tp->rcv_nxt - tp->copied_seq),
+  (sp->sk_state == TCP_LISTEN) ? sp->sk_max_ack_backlog:
+ tp->write_seq-tp->snd_una,
+  (sp->sk_state == TCP_LISTEN) ? sp->sk_ack_backlog : 
+   (tp->rcv_nxt - tp->copied_seq),
   timer_active,
   jiffies_to_clock_t(timer_expires - jiffies),
   icsk->icsk_retransmits,
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: why does tcp_v[46]_conn_request not inc MIB stats

2007-09-10 Thread Rick Jones

Sridhar Samudrala wrote:

On Mon, 2007-09-10 at 11:42 -0700, Rick Jones wrote:

I've been digging around to see about inducing /proc/net/tcp to show 
some "interesting" things for listen sockets (eg backlog depth, its max, 
and dropped connection requests).



backlog depth(acceptq length) for a listening socket should be available
with the newer kernels. The following patch exports this value via the rx_queue
field in /proc/net/tcp.
 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=47da8ee681d04e68ca1b1812c10e28162150d453


Yep, I see it in the 2.6.23-rc5 tree I'm using.  At the risk of yet 
another "merely practice" patching excercise I'm putting together a 
patch which will also return that in a TCP_INFO and add the max backlog 
(to the tx_queue field)


While doing that, I've noticed that 
Documenation/networking/proc-net-tcp.txt (?) talks about a tcp6_get_info 
which I cannot find anywhere in the tree.  I'm not sure if that simply 
means that tcp_get_info is what is used for a "tcp6" connection and the 
text can simply be removed from the documentation, or if it is called 
something else.


 While there I've noticed that both 
tcp_v[46]_syn_recv_sock and tcp_v[46]conn_request both check that the 
listen queue is full, but only tcp_v[46]_syn_recv_sock increments some 
mib stats for dropped connection requests.


Is that deliberate, or is that a hole in the stats?



looks like it is a hole in the stats. I think we should increment
LISTENOVERFLOWS or LISTENDROPS in tcp_v[46]_conn_request too if the
SYN is dropped.


OK.  Now, can we get a third to Sridhar's second?-)

rick jones
struggling through the maze of twisty routines for connection establishment

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] SCTP: Convert bind_addr_list locking to RCU

2007-09-10 Thread Paul E. McKenney
On Mon, Sep 10, 2007 at 03:46:30PM -0400, Vlad Yasevich wrote:
> Since the sctp_sockaddr_entry is now RCU enabled as part of
> the patch to synchronize sctp_localaddr_list, it makes sense to
> change all handling of these entries to RCU.  This includes the
> sctp_bind_addrs structure and it's list of bound addresses.
> 
> This list is currently protected by an external rw_lock and that
> looks like an overkill.  There are only 2 writers to the list:
> bind()/bindx() calls, and BH processing of ASCONF-ACK chunks.
> These are already seriealized via the socket lock, so they will
> not step on each other.  These are also relatively rare, so we
> should be good with RCU.
> 
> The readers are varied and they are easily converted to RCU.

Again, good start -- similar questions as for the other patch in this
series.  Also a few places where it looks like you are letting a pointer
to an RCU-protected data structure slip out of rcu_read_lock() protection,
and a case of mixing rcu_read_lock() and rcu_read_lock_bh() within the
same RCU-protected data structure.

Thanx, Paul

> Signed-off-by: Vlad Yasevich <[EMAIL PROTECTED]>
> ---
>  include/net/sctp/structs.h |3 -
>  net/sctp/associola.c   |   14 +--
>  net/sctp/bind_addr.c   |   59 ++--
>  net/sctp/endpointola.c |   26 -
>  net/sctp/ipv6.c|   12 ++---
>  net/sctp/protocol.c|   25 +---
>  net/sctp/sm_make_chunk.c   |   17 +++-
>  net/sctp/socket.c  |   92 
> ++--
>  8 files changed, 97 insertions(+), 151 deletions(-)
> 
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index 2591c49..1d46f7d 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -1222,9 +1222,6 @@ struct sctp_ep_common {
>* bind_addr.address_list is our set of local IP addresses.
>*/
>   struct sctp_bind_addr bind_addr;
> -
> - /* Protection during address list comparisons. */
> - rwlock_t   addr_lock;
>  };
> 
> 
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 2ad1caf..9bad8ba 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -99,7 +99,6 @@ static struct sctp_association 
> *sctp_association_init(struct sctp_association *a
> 
>   /* Initialize the bind addr area.  */
>   sctp_bind_addr_init(&asoc->base.bind_addr, ep->base.bind_addr.port);
> - rwlock_init(&asoc->base.addr_lock);
> 
>   asoc->state = SCTP_STATE_CLOSED;
> 
> @@ -937,8 +936,6 @@ struct sctp_transport *sctp_assoc_is_match(struct 
> sctp_association *asoc,
>  {
>   struct sctp_transport *transport;
> 
> - sctp_read_lock(&asoc->base.addr_lock);
> -
>   if ((htons(asoc->base.bind_addr.port) == laddr->v4.sin_port) &&
>   (htons(asoc->peer.port) == paddr->v4.sin_port)) {
>   transport = sctp_assoc_lookup_paddr(asoc, paddr);
> @@ -952,7 +949,6 @@ struct sctp_transport *sctp_assoc_is_match(struct 
> sctp_association *asoc,
>   transport = NULL;
> 
>  out:
> - sctp_read_unlock(&asoc->base.addr_lock);
>   return transport;
>  }
> 
> @@ -1376,19 +1372,13 @@ int sctp_assoc_set_bind_addr_from_cookie(struct 
> sctp_association *asoc,
>  int sctp_assoc_lookup_laddr(struct sctp_association *asoc,
>   const union sctp_addr *laddr)
>  {
> - int found;
> + int found = 0;
> 
> - sctp_read_lock(&asoc->base.addr_lock);
>   if ((asoc->base.bind_addr.port == ntohs(laddr->v4.sin_port)) &&
>   sctp_bind_addr_match(&asoc->base.bind_addr, laddr,
> -  sctp_sk(asoc->base.sk))) {
> +  sctp_sk(asoc->base.sk)))
>   found = 1;
> - goto out;
> - }
> 
> - found = 0;
> -out:
> - sctp_read_unlock(&asoc->base.addr_lock);
>   return found;
>  }
> 
> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
> index 7fc369f..9c7db1f 100644
> --- a/net/sctp/bind_addr.c
> +++ b/net/sctp/bind_addr.c
> @@ -167,7 +167,10 @@ int sctp_add_bind_addr(struct sctp_bind_addr *bp, union 
> sctp_addr *new,
> 
>   INIT_LIST_HEAD(&addr->list);
>   INIT_RCU_HEAD(&addr->rcu);
> - list_add_tail(&addr->list, &bp->address_list);
> +
> + rcu_read_lock();
> + list_add_tail_rcu(&addr->list, &bp->address_list);
> + rcu_read_unlock();

Given the original code, we presumably hold the update-side lock.  If so,
the rcu_read_lock() and rcu_read_unlock() are (harmlessly) redundant.

>   SCTP_DBG_OBJCNT_INC(addr);
> 
>   return 0;
> @@ -178,20 +181,23 @@ int sctp_add_bind_addr(struct sctp_bind_addr *bp, union 
> sctp_addr *new,
>   */
>  int sctp_del_bind_addr(struct sctp_bind_addr *bp, union sctp_addr *del_addr)
>  {
> - struct list_head *pos, *temp;
> - struct sctp_sockaddr_entry *addr;
> + struct sctp_sockaddr_entry *addr, *temp;
> 
> - list_for_each_safe(pos, tem

Re: [PATCH] ipconfig.c: De-clutter IP configuration report

2007-09-10 Thread Jan Engelhardt

On Sep 10 2007 13:09, Maciej W. Rozycki wrote:
> The new code builds fine; no semantic changes.
>
> Please apply,
>
>  Maciej
>
>patch-mips-2.6.23-rc5-20070904-ipconfig-printk-2
>diff -up --recursive --new-file 
>linux-mips-2.6.23-rc5-20070904.macro/net/ipv4/ipconfig.c 
>linux-mips-2.6.23-rc5-20070904/net/ipv4/ipconfig.c
>--- linux-mips-2.6.23-rc5-20070904.macro/net/ipv4/ipconfig.c   2007-09-04 
>04:56:22.0 +
>+++ linux-mips-2.6.23-rc5-20070904/net/ipv4/ipconfig.c 2007-09-10 
>11:53:19.0 +
>@@ -1364,17 +1364,17 @@ static int __init ip_auto_config(void)
>   /*
>* Clue in the operator.
>*/
>-  printk("IP-Config: Complete:");
>-  printk("\n  device=%s", ic_dev->name);
>-  printk(", addr=%u.%u.%u.%u", NIPQUAD(ic_myaddr));
>-  printk(", mask=%u.%u.%u.%u", NIPQUAD(ic_netmask));
>-  printk(", gw=%u.%u.%u.%u", NIPQUAD(ic_gateway));
>-  printk(",\n host=%s, domain=%s, nis-domain=%s",
>- utsname()->nodename, ic_domain, utsname()->domainname);
>-  printk(",\n bootserver=%u.%u.%u.%u", NIPQUAD(ic_servaddr));
>-  printk(", rootserver=%u.%u.%u.%u", NIPQUAD(root_server_addr));
>-  printk(", rootpath=%s", root_server_path);
>-  printk("\n");
>+  pr_info("IP-Config: Complete:\n");
>+  pr_info("  device=%s, addr=%u.%u.%u.%u, "
>+  "mask=%u.%u.%u.%u, gw=%u.%u.%u.%u,\n",
>+  ic_dev->name, NIPQUAD(ic_myaddr),
>+  NIPQUAD(ic_netmask), NIPQUAD(ic_gateway));
>+  pr_info("  host=%s, domain=%s, nis-domain=%s,\n",
>+  utsname()->nodename, ic_domain, utsname()->domainname);
>+  pr_info("  bootserver=%u.%u.%u.%u, "
>+  "rootserver=%u.%u.%u.%u, rootpath=%s\n",
>+  NIPQUAD(ic_servaddr),
>+  NIPQUAD(root_server_addr), root_server_path);
> #endif /* !SILENT */
> 
>   return 0;

It should really be done in userspace. And ripped from the kernel.



Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: why does tcp_v[46]_conn_request not inc MIB stats

2007-09-10 Thread Sridhar Samudrala
On Mon, 2007-09-10 at 11:42 -0700, Rick Jones wrote:
> I've been digging around to see about inducing /proc/net/tcp to show 
> some "interesting" things for listen sockets (eg backlog depth, its max, 
> and dropped connection requests).

backlog depth(acceptq length) for a listening socket should be available
with the newer kernels. The following patch exports this value via the rx_queue
field in /proc/net/tcp.
 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=47da8ee681d04e68ca1b1812c10e28162150d453

>   While there I've noticed that both 
> tcp_v[46]_syn_recv_sock and tcp_v[46]conn_request both check that the 
> listen queue is full, but only tcp_v[46]_syn_recv_sock increments some 
> mib stats for dropped connection requests.
> 
> Is that deliberate, or is that a hole in the stats?

looks like it is a hole in the stats. I think we should increment
LISTENOVERFLOWS or LISTENDROPS in tcp_v[46]_conn_request too if the
SYN is dropped.

Thanks
Sridhar

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Paul E. McKenney
On Mon, Sep 10, 2007 at 02:36:26PM -0700, Christoph Lameter wrote:
> On Mon, 10 Sep 2007, Paul E. McKenney wrote:
> 
> > The one exception to this being the case where process-level code is
> > communicating to an interrupt handler running on that same CPU -- on
> > all CPUs that I am aware of, a given CPU always sees its own writes
> > in order.
> 
> Yes but that is due to the code path effectively continuing in the 
> interrupt handler. The cpu makes sure that op codes being executed always 
> see memory in a consistent way. The basic ordering problem with out of 
> order writes is therefore coming from other processors concurrently 
> executing code and holding variables in registers that are modified 
> elsewhere. The only solution that I know of are one or the other form of 
> barrier.

So we are agreed then -- volatile accesses may be of some assistance when
interacting with interrupt handlers running on the same CPU (presumably
when using per-CPU variables), but are generally useless when sharing
variables among CPUs.  Correct?

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/2] SCTP: Add RCU synchronization around sctp_localaddr_list

2007-09-10 Thread Paul E. McKenney
On Mon, Sep 10, 2007 at 03:46:29PM -0400, Vlad Yasevich wrote:
> sctp_localaddr_list is modified dynamically via NETDEV_UP
> and NETDEV_DOWN events, but there is not synchronization
> between writer (even handler) and readers.  As a result,
> the readers can access an entry that has been freed and
> crash the sytem.

Good start, but few questions interspersed below...

Thanx, Paul

> Signed-off-by: Vlad Yasevich <[EMAIL PROTECTED]>
> ---
>  include/net/sctp/sctp.h|1 +
>  include/net/sctp/structs.h |2 +
>  net/sctp/bind_addr.c   |2 +
>  net/sctp/ipv6.c|   33 
>  net/sctp/protocol.c|   50 ++-
>  net/sctp/socket.c  |   38 ++---
>  6 files changed, 88 insertions(+), 38 deletions(-)
> 
> diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
> index d529045..c9cc00c 100644
> --- a/include/net/sctp/sctp.h
> +++ b/include/net/sctp/sctp.h
> @@ -123,6 +123,7 @@
>   * sctp/protocol.c
>   */
>  extern struct sock *sctp_get_ctl_sock(void);
> +extern void sctp_local_addr_free(struct rcu_head *head);
>  extern int sctp_copy_local_addr_list(struct sctp_bind_addr *,
>sctp_scope_t, gfp_t gfp,
>int flags);
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index c0d5848..2591c49 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -737,8 +737,10 @@ const union sctp_addr *sctp_source(const struct 
> sctp_chunk *chunk);
>  /* This is a structure for holding either an IPv6 or an IPv4 address.  */
>  struct sctp_sockaddr_entry {
>   struct list_head list;
> + struct rcu_head rcu;
>   union sctp_addr a;
>   __u8 use_as_src;
> + __u8 valid;
>  };
> 
>  typedef struct sctp_chunk *(sctp_packet_phandler_t)(struct sctp_association 
> *);
> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
> index fdb287a..7fc369f 100644
> --- a/net/sctp/bind_addr.c
> +++ b/net/sctp/bind_addr.c
> @@ -163,8 +163,10 @@ int sctp_add_bind_addr(struct sctp_bind_addr *bp, union 
> sctp_addr *new,
>   addr->a.v4.sin_port = htons(bp->port);
> 
>   addr->use_as_src = use_as_src;
> + addr->valid = 1;
> 
>   INIT_LIST_HEAD(&addr->list);
> + INIT_RCU_HEAD(&addr->rcu);
>   list_add_tail(&addr->list, &bp->address_list);
>   SCTP_DBG_OBJCNT_INC(addr);
> 
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index f8aa23d..fc2e4e2 100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -77,13 +77,18 @@
> 
>  #include 
> 
> -/* Event handler for inet6 address addition/deletion events.  */
> +/* Event handler for inet6 address addition/deletion events.
> + * This even is part of the atomic notifier call chain
> + * and thus happens atomically and can NOT sleep.  As a result
> + * we can't and really don't need to add any locks to guard the
> + * RCU.
> + */
>  static int sctp_inet6addr_event(struct notifier_block *this, unsigned long 
> ev,
>   void *ptr)
>  {
>   struct inet6_ifaddr *ifa = (struct inet6_ifaddr *)ptr;
> - struct sctp_sockaddr_entry *addr;
> - struct list_head *pos, *temp;
> + struct sctp_sockaddr_entry *addr = NULL;
> + struct sctp_sockaddr_entry *temp;
> 
>   switch (ev) {
>   case NETDEV_UP:
> @@ -94,19 +99,26 @@ static int sctp_inet6addr_event(struct notifier_block 
> *this, unsigned long ev,
>   memcpy(&addr->a.v6.sin6_addr, &ifa->addr,
>sizeof(struct in6_addr));
>   addr->a.v6.sin6_scope_id = ifa->idev->dev->ifindex;
> - list_add_tail(&addr->list, &sctp_local_addr_list);
> + addr->valid = 1;
> + rcu_read_lock();
> + list_add_tail_rcu(&addr->list, &sctp_local_addr_list);
> + rcu_read_unlock();

If we are under the protection of the update-side mutex, the rcu_read_lock()
and rcu_read_unlock() are (harmlessly) redundant.  If we are not under
the protection of some mutex, what prevents concurrent list_add_tail_rcu()
calls from messing up the sctp_sockaddr_entry list?

>   }
>   break;
>   case NETDEV_DOWN:
> - list_for_each_safe(pos, temp, &sctp_local_addr_list) {
> - addr = list_entry(pos, struct sctp_sockaddr_entry, 
> list);
> - if (ipv6_addr_equal(&addr->a.v6.sin6_addr, &ifa->addr)) 
> {
> - list_del(pos);
> - kfree(addr);
> + rcu_read_lock();
> + list_for_each_entry_safe(addr, temp,
> + &sctp_local_addr_list, list) {
> + if (ipv6_addr_equal(&addr->a.v6.sin6_addr,
> +  &ifa->addr)) {
>

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Christoph Lameter
On Mon, 10 Sep 2007, Paul E. McKenney wrote:

> The one exception to this being the case where process-level code is
> communicating to an interrupt handler running on that same CPU -- on
> all CPUs that I am aware of, a given CPU always sees its own writes
> in order.

Yes but that is due to the code path effectively continuing in the 
interrupt handler. The cpu makes sure that op codes being executed always 
see memory in a consistent way. The basic ordering problem with out of 
order writes is therefore coming from other processors concurrently 
executing code and holding variables in registers that are modified 
elsewhere. The only solution that I know of are one or the other form of 
barrier.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ne driver crashes when unloaded in 2.6.22.6

2007-09-10 Thread Dan Williams
On Sun, 2007-09-09 at 23:12 +0100, Chris Rankin wrote:
> --- Dan Williams <[EMAIL PROTECTED]> wrote:
> > Offhand question, does your ne2000 card support carrier detection?
> 
> Err... there is a /sys/class/net/eth0/carrier entry (I think - not in front 
> of that machine right
> now). IIRC it said "1" when I read it.

Does it read '0' when you unplug the cable?

Dan

> Cheers,
> Chris
> 
> 
> 
>   ___
> Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
> now.
> http://uk.answers.yahoo.com/ 

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Paul E. McKenney
On Mon, Sep 10, 2007 at 11:59:29AM -0700, Christoph Lameter wrote:
> On Fri, 17 Aug 2007, Segher Boessenkool wrote:
> 
> > "volatile" has nothing to do with reordering.  atomic_dec() writes
> > to memory, so it _does_ have "volatile semantics", implicitly, as
> > long as the compiler cannot optimise the atomic variable away
> > completely -- any store counts as a side effect.
> 
> Stores can be reordered. Only x86 has (mostly) implicit write ordering. So 
> no atomic_dec has no volatile semantics and may be reordered on a variety 
> of processors. Writes to memory may not follow code order on several 
> processors.

The one exception to this being the case where process-level code is
communicating to an interrupt handler running on that same CPU -- on
all CPUs that I am aware of, a given CPU always sees its own writes
in order.

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Kyle Moffett

On Sep 10, 2007, at 12:46:33, Denys Vlasenko wrote:
My point is that people are confused as to what atomic_read()   
exactly means, and this is bad. Same for cpu_relax().  First one  
says "read", and second one doesn't say "barrier".


Q&A:

Q:  When is it OK to use atomic_read()?
A:  You are asking the question, so never.

Q:  But I need to check the value of the atomic at this point in time...
A:  Your code is buggy if it needs to do that on an atomic_t for  
anything other than debugging or optimization.  Use either  
atomic_*_return() or a lock and some normal integers.


Q:  "So why can't the atomic_read DTRT magically?"
A:  Because "the right thing" depends on the situation and is usually  
best done with something other than atomic_t.


If somebody can post some non-buggy code which is correctly using  
atomic_read() *and* depends on the compiler generating extra  
nonsensical loads due to "volatile" then the issue *might* be  
reconsidered.  This also includes samples of code which uses  
atomic_read() and needs memory barriers (so that we can fix the buggy  
code, not so we can change atomic_read()).  So far the only code  
samples anybody has posted are buggy regardless of whether or not the  
value and/or accessors are flagged "volatile" or not.  And hey, maybe  
the volatile ops *should* be implemented in inline ASM for future- 
proof-ness, but that's a separate issue.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATH 0/2] Add RCU locking to SCTP address management

2007-09-10 Thread Vlad Yasevich
Hi All

This is my first attempt to add RCU synchronization to pieces of SCTP
and I want to make sure I do this right.

The RCU docs a somewhat outdated, and the calling conventions differ
between subsystems, so I am using what I've been able to find.

A bit of a background...

The whole problem started with a panic that led me to NULL deref while
walking a global list of addresses that SCTP maitains.  That list
is modified curing the NETDEV_UP and NETDEV_DOWN notifier processing
and all readers are run in user context.  It looks like the list was
modified while a user app was walking it and we crashed.

Easy enough to fix by adding locking.  However, some notifiers
(ipv6 addrconf) run in atomic context and it says that they can't sleep.
(Q1.  Why are ipv6 notifiers atomic, while IPv4 notifires are blocking?)

So, I decided to add RCU locking around that list.  These events are
already synchronized, so no additional locking on the writer side are need,
so should be make for a nice RCU conversion.

While doing this conversion, I saw that the same structures are used to
maintain a list of bound addresses.  This seemed like another opportunity
to use RCU.  In this case, the readers are wide spread, but there are
only 2 wirters: BH processing of specific chunks, and bind() calls.
Again the writers are already synchronized on the socket lock, so they
will keep out of each others way.  Readers are widespread, but rcu takes
care of that nicely.

So, can folks please take a look and make sure I didn't mess up the
rcu conventions.

Thanks a lot
-vlad
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 1/2] SCTP: Add RCU synchronization around sctp_localaddr_list

2007-09-10 Thread Vlad Yasevich
sctp_localaddr_list is modified dynamically via NETDEV_UP
and NETDEV_DOWN events, but there is not synchronization
between writer (even handler) and readers.  As a result,
the readers can access an entry that has been freed and
crash the sytem.

Signed-off-by: Vlad Yasevich <[EMAIL PROTECTED]>
---
 include/net/sctp/sctp.h|1 +
 include/net/sctp/structs.h |2 +
 net/sctp/bind_addr.c   |2 +
 net/sctp/ipv6.c|   33 
 net/sctp/protocol.c|   50 ++-
 net/sctp/socket.c  |   38 ++---
 6 files changed, 88 insertions(+), 38 deletions(-)

diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
index d529045..c9cc00c 100644
--- a/include/net/sctp/sctp.h
+++ b/include/net/sctp/sctp.h
@@ -123,6 +123,7 @@
  * sctp/protocol.c
  */
 extern struct sock *sctp_get_ctl_sock(void);
+extern void sctp_local_addr_free(struct rcu_head *head);
 extern int sctp_copy_local_addr_list(struct sctp_bind_addr *,
 sctp_scope_t, gfp_t gfp,
 int flags);
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index c0d5848..2591c49 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -737,8 +737,10 @@ const union sctp_addr *sctp_source(const struct sctp_chunk 
*chunk);
 /* This is a structure for holding either an IPv6 or an IPv4 address.  */
 struct sctp_sockaddr_entry {
struct list_head list;
+   struct rcu_head rcu;
union sctp_addr a;
__u8 use_as_src;
+   __u8 valid;
 };
 
 typedef struct sctp_chunk *(sctp_packet_phandler_t)(struct sctp_association *);
diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
index fdb287a..7fc369f 100644
--- a/net/sctp/bind_addr.c
+++ b/net/sctp/bind_addr.c
@@ -163,8 +163,10 @@ int sctp_add_bind_addr(struct sctp_bind_addr *bp, union 
sctp_addr *new,
addr->a.v4.sin_port = htons(bp->port);
 
addr->use_as_src = use_as_src;
+   addr->valid = 1;
 
INIT_LIST_HEAD(&addr->list);
+   INIT_RCU_HEAD(&addr->rcu);
list_add_tail(&addr->list, &bp->address_list);
SCTP_DBG_OBJCNT_INC(addr);
 
diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index f8aa23d..fc2e4e2 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -77,13 +77,18 @@
 
 #include 
 
-/* Event handler for inet6 address addition/deletion events.  */
+/* Event handler for inet6 address addition/deletion events.
+ * This even is part of the atomic notifier call chain
+ * and thus happens atomically and can NOT sleep.  As a result
+ * we can't and really don't need to add any locks to guard the
+ * RCU.
+ */
 static int sctp_inet6addr_event(struct notifier_block *this, unsigned long ev,
void *ptr)
 {
struct inet6_ifaddr *ifa = (struct inet6_ifaddr *)ptr;
-   struct sctp_sockaddr_entry *addr;
-   struct list_head *pos, *temp;
+   struct sctp_sockaddr_entry *addr = NULL;
+   struct sctp_sockaddr_entry *temp;
 
switch (ev) {
case NETDEV_UP:
@@ -94,19 +99,26 @@ static int sctp_inet6addr_event(struct notifier_block 
*this, unsigned long ev,
memcpy(&addr->a.v6.sin6_addr, &ifa->addr,
 sizeof(struct in6_addr));
addr->a.v6.sin6_scope_id = ifa->idev->dev->ifindex;
-   list_add_tail(&addr->list, &sctp_local_addr_list);
+   addr->valid = 1;
+   rcu_read_lock();
+   list_add_tail_rcu(&addr->list, &sctp_local_addr_list);
+   rcu_read_unlock();
}
break;
case NETDEV_DOWN:
-   list_for_each_safe(pos, temp, &sctp_local_addr_list) {
-   addr = list_entry(pos, struct sctp_sockaddr_entry, 
list);
-   if (ipv6_addr_equal(&addr->a.v6.sin6_addr, &ifa->addr)) 
{
-   list_del(pos);
-   kfree(addr);
+   rcu_read_lock();
+   list_for_each_entry_safe(addr, temp,
+   &sctp_local_addr_list, list) {
+   if (ipv6_addr_equal(&addr->a.v6.sin6_addr,
+&ifa->addr)) {
+   addr->valid = 0;
+   list_del_rcu(&addr->list);
break;
}
}
-
+   rcu_read_unlock();
+   if (addr && !addr->valid)
+   call_rcu(&addr->rcu, sctp_local_addr_free);
break;
}
 
@@ -368,6 +380,7 @@ static void sctp_v6_copy_addrlist(struct list_head 
*addrlist,
addr->a.v6.sin6_addr = ifp->addr;
addr->a.v6.sin6_scope_id = dev->ifindex;
 

[RFC PATCH 2/2] SCTP: Convert bind_addr_list locking to RCU

2007-09-10 Thread Vlad Yasevich
Since the sctp_sockaddr_entry is now RCU enabled as part of
the patch to synchronize sctp_localaddr_list, it makes sense to
change all handling of these entries to RCU.  This includes the
sctp_bind_addrs structure and it's list of bound addresses.

This list is currently protected by an external rw_lock and that
looks like an overkill.  There are only 2 writers to the list:
bind()/bindx() calls, and BH processing of ASCONF-ACK chunks.
These are already seriealized via the socket lock, so they will
not step on each other.  These are also relatively rare, so we
should be good with RCU.

The readers are varied and they are easily converted to RCU.

Signed-off-by: Vlad Yasevich <[EMAIL PROTECTED]>
---
 include/net/sctp/structs.h |3 -
 net/sctp/associola.c   |   14 +--
 net/sctp/bind_addr.c   |   59 ++--
 net/sctp/endpointola.c |   26 -
 net/sctp/ipv6.c|   12 ++---
 net/sctp/protocol.c|   25 +---
 net/sctp/sm_make_chunk.c   |   17 +++-
 net/sctp/socket.c  |   92 ++--
 8 files changed, 97 insertions(+), 151 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 2591c49..1d46f7d 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1222,9 +1222,6 @@ struct sctp_ep_common {
 * bind_addr.address_list is our set of local IP addresses.
 */
struct sctp_bind_addr bind_addr;
-
-   /* Protection during address list comparisons. */
-   rwlock_t   addr_lock;
 };
 
 
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 2ad1caf..9bad8ba 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -99,7 +99,6 @@ static struct sctp_association *sctp_association_init(struct 
sctp_association *a
 
/* Initialize the bind addr area.  */
sctp_bind_addr_init(&asoc->base.bind_addr, ep->base.bind_addr.port);
-   rwlock_init(&asoc->base.addr_lock);
 
asoc->state = SCTP_STATE_CLOSED;
 
@@ -937,8 +936,6 @@ struct sctp_transport *sctp_assoc_is_match(struct 
sctp_association *asoc,
 {
struct sctp_transport *transport;
 
-   sctp_read_lock(&asoc->base.addr_lock);
-
if ((htons(asoc->base.bind_addr.port) == laddr->v4.sin_port) &&
(htons(asoc->peer.port) == paddr->v4.sin_port)) {
transport = sctp_assoc_lookup_paddr(asoc, paddr);
@@ -952,7 +949,6 @@ struct sctp_transport *sctp_assoc_is_match(struct 
sctp_association *asoc,
transport = NULL;
 
 out:
-   sctp_read_unlock(&asoc->base.addr_lock);
return transport;
 }
 
@@ -1376,19 +1372,13 @@ int sctp_assoc_set_bind_addr_from_cookie(struct 
sctp_association *asoc,
 int sctp_assoc_lookup_laddr(struct sctp_association *asoc,
const union sctp_addr *laddr)
 {
-   int found;
+   int found = 0;
 
-   sctp_read_lock(&asoc->base.addr_lock);
if ((asoc->base.bind_addr.port == ntohs(laddr->v4.sin_port)) &&
sctp_bind_addr_match(&asoc->base.bind_addr, laddr,
-sctp_sk(asoc->base.sk))) {
+sctp_sk(asoc->base.sk)))
found = 1;
-   goto out;
-   }
 
-   found = 0;
-out:
-   sctp_read_unlock(&asoc->base.addr_lock);
return found;
 }
 
diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
index 7fc369f..9c7db1f 100644
--- a/net/sctp/bind_addr.c
+++ b/net/sctp/bind_addr.c
@@ -167,7 +167,10 @@ int sctp_add_bind_addr(struct sctp_bind_addr *bp, union 
sctp_addr *new,
 
INIT_LIST_HEAD(&addr->list);
INIT_RCU_HEAD(&addr->rcu);
-   list_add_tail(&addr->list, &bp->address_list);
+
+   rcu_read_lock();
+   list_add_tail_rcu(&addr->list, &bp->address_list);
+   rcu_read_unlock();
SCTP_DBG_OBJCNT_INC(addr);
 
return 0;
@@ -178,20 +181,23 @@ int sctp_add_bind_addr(struct sctp_bind_addr *bp, union 
sctp_addr *new,
  */
 int sctp_del_bind_addr(struct sctp_bind_addr *bp, union sctp_addr *del_addr)
 {
-   struct list_head *pos, *temp;
-   struct sctp_sockaddr_entry *addr;
+   struct sctp_sockaddr_entry *addr, *temp;
 
-   list_for_each_safe(pos, temp, &bp->address_list) {
-   addr = list_entry(pos, struct sctp_sockaddr_entry, list);
+   rcu_read_lock_bh();
+   list_for_each_entry_safe(addr, temp, &bp->address_list, list) {
if (sctp_cmp_addr_exact(&addr->a, del_addr)) {
/* Found the exact match. */
-   list_del(pos);
-   kfree(addr);
-   SCTP_DBG_OBJCNT_DEC(addr);
-
-   return 0;
+   addr->valid = 0;
+   list_del_rcu(&addr->list);
+   break;
}
}
+   rcu_read_unlock_bh();
+
+   if (addr && !addr->valid) {
+   call_rcu_bh(&addr->rcu, sctp

Re: [PATCH 16/16] net: netlink support for moving devices between network namespaces.

2007-09-10 Thread Eric W. Biederman
"Serge E. Hallyn" <[EMAIL PROTECTED]> writes:
>> 
>> +static struct net *get_net_ns_by_pid(pid_t pid)
>> +{
>> +struct task_struct *tsk;
>> +struct net *net;
>> +
>> +/* Lookup the network namespace */
>> +net = ERR_PTR(-ESRCH);
>> +rcu_read_lock();
>> +tsk = find_task_by_pid(pid);
>> +if (tsk) {
>> +task_lock(tsk);
>> +if (tsk->nsproxy)
>> +net = get_net(tsk->nsproxy->net_ns);
>> +task_unlock(tsk);
>
> Thinking...  Ok, I'm not sure this is 100% safe in the target tree, but
> the long-term correct way probably isn't yet implemented in the net-
> tree.  Eventually you will want to:
>
>   net_ns = NULL;
>   rcu_read_lock();
>   tsk = find_task_by_pid();  /* or _pidns equiv? */
>   nsproxy = task_nsproxy(tsk);
>   if (nsproxy)
>   net_ns = get_net(nsproxy->net_ns);
>   rcu_read_unlock;
>
> What you have here is probably unsafe if tsk is the last task pointing
> to it's nsproxy and it does an unshare, bc unshare isn't protected by
> task_lock, and you're not rcu_dereferencing tsk->nsproxy (which
> task_nsproxy does).  At one point we floated a patch to reuse the same
> nsproxy in that case which would prevent you having to worry about it,
> but that isn't being done in -mm now so i doubt it's in -net.


That change isn't merged upstream yet, so it isn't in David's
net-2.6.24 tree.  Currently task->nsproxy is protected but
task_lock(current). So the code is fine.

I am aware that removing the task_lock(current) for the setting
of current->nsproxy is currently in the works, and I have planned
to revisit this later when all of these pieces come together.

For now the code is fine.

If need be we can drop this patch to remove the potential merge
conflict.  But I figured it was useful for this part of the user space
interface to be available for review.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16] net: netlink support for moving devices between network namespaces.

2007-09-10 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
> 
> The simplest thing to implement is moving network devices between
> namespaces.  However with the same attribute IFLA_NET_NS_PID we can
> easily implement creating devices in the destination network
> namespace as well.  However that is a little bit trickier so this
> patch sticks to what is simple and easy.
> 
> A pid is used to identify a process that happens to be a member
> of the network namespace we want to move the network device to.
> 
> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>
> ---
>  include/linux/if_link.h |1 +
>  net/core/rtnetlink.c|   35 +++
>  2 files changed, 36 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/if_link.h b/include/linux/if_link.h
> index 422084d..84c3492 100644
> --- a/include/linux/if_link.h
> +++ b/include/linux/if_link.h
> @@ -78,6 +78,7 @@ enum
>   IFLA_LINKMODE,
>   IFLA_LINKINFO,
>  #define IFLA_LINKINFO IFLA_LINKINFO
> + IFLA_NET_NS_PID,
>   __IFLA_MAX
>  };
> 
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index 44f91bb..1b9c32d 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -35,6 +35,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include 
>  #include 
> @@ -727,6 +728,7 @@ const struct nla_policy ifla_policy[IFLA_MAX+1] = {
>   [IFLA_WEIGHT]   = { .type = NLA_U32 },
>   [IFLA_OPERSTATE]= { .type = NLA_U8 },
>   [IFLA_LINKMODE] = { .type = NLA_U8 },
> + [IFLA_NET_NS_PID]   = { .type = NLA_U32 },
>  };
> 
>  static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
> @@ -734,12 +736,45 @@ static const struct nla_policy 
> ifla_info_policy[IFLA_INFO_MAX+1] = {
>   [IFLA_INFO_DATA]= { .type = NLA_NESTED },
>  };
> 
> +static struct net *get_net_ns_by_pid(pid_t pid)
> +{
> + struct task_struct *tsk;
> + struct net *net;
> +
> + /* Lookup the network namespace */
> + net = ERR_PTR(-ESRCH);
> + rcu_read_lock();
> + tsk = find_task_by_pid(pid);
> + if (tsk) {
> + task_lock(tsk);
> + if (tsk->nsproxy)
> + net = get_net(tsk->nsproxy->net_ns);
> + task_unlock(tsk);

Thinking...  Ok, I'm not sure this is 100% safe in the target tree, but
the long-term correct way probably isn't yet implemented in the net-
tree.  Eventually you will want to:

net_ns = NULL;
rcu_read_lock();
tsk = find_task_by_pid();  /* or _pidns equiv? */
nsproxy = task_nsproxy(tsk);
if (nsproxy)
net_ns = get_net(nsproxy->net_ns);
rcu_read_unlock;

What you have here is probably unsafe if tsk is the last task pointing
to it's nsproxy and it does an unshare, bc unshare isn't protected by
task_lock, and you're not rcu_dereferencing tsk->nsproxy (which
task_nsproxy does).  At one point we floated a patch to reuse the same
nsproxy in that case which would prevent you having to worry about it,
but that isn't being done in -mm now so i doubt it's in -net.

> + }
> + rcu_read_unlock();
> + return net;
> +}
> +
>  static int do_setlink(struct net_device *dev, struct ifinfomsg *ifm,
> struct nlattr **tb, char *ifname, int modified)
>  {
>   int send_addr_notify = 0;
>   int err;
> 
> + if (tb[IFLA_NET_NS_PID]) {
> + struct net *net;
> + net = get_net_ns_by_pid(nla_get_u32(tb[IFLA_NET_NS_PID]));
> + if (IS_ERR(net)) {
> + err = PTR_ERR(net);
> + goto errout;
> + }
> + err = dev_change_net_namespace(dev, net, ifname);
> + put_net(net);
> + if (err)
> + goto errout;
> + modified = 1;
> + }
> +
>   if (tb[IFLA_MAP]) {
>   struct rtnl_link_ifmap *u_map;
>   struct ifmap k_map;
> -- 
> 1.5.3.rc6.17.g1911
> 
> ___
> Containers mailing list
> [EMAIL PROTECTED]
> https://lists.linux-foundation.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Christoph Lameter
On Mon, 10 Sep 2007, Linus Torvalds wrote:

> The fact is, "volatile" *only* makes things worse. It generates worse 
> code, and never fixes any real bugs. This is a *fact*.

Yes, lets just drop the volatiles now! We need a patch that gets rid of 
them Volunteers?


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Christoph Lameter
On Fri, 17 Aug 2007, Segher Boessenkool wrote:

> "volatile" has nothing to do with reordering.  atomic_dec() writes
> to memory, so it _does_ have "volatile semantics", implicitly, as
> long as the compiler cannot optimise the atomic variable away
> completely -- any store counts as a side effect.

Stores can be reordered. Only x86 has (mostly) implicit write ordering. So 
no atomic_dec has no volatile semantics and may be reordered on a variety 
of processors. Writes to memory may not follow code order on several 
processors.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH resend] Fix a lock problem in generic phy code

2007-09-10 Thread Hans-Jürgen Koch
Am Montag 10 September 2007 schrieb Herbert Xu:
> Hans-J??rgen Koch <[EMAIL PROTECTED]> wrote:
> > The following patch fixes it. Tested on an AT91SAM9263-EK board, kernel
> > 2.6.23-rc4 and -rc3-mm1.
>
> Could you please audit all instances of physdev->lock and add
> _bh where necessary?  I can see that at least phys_stop also
> needs the _bh.

I think the patch does all that's necessary. At least, there're no error
messages in the logs anymore. I didn't check if there's an error on
unload, though.

>
> We should also consider whether it makes sense to move the
> timer into a work queue.

Somebody who's more involved in that PHY stuff should probably do that.
I'm on holidays, sitting in a hotel room, and will probably not have the
time to have a deeper look into that.

Thanks,
Hans



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


why does tcp_v[46]_conn_request not inc MIB stats

2007-09-10 Thread Rick Jones
I've been digging around to see about inducing /proc/net/tcp to show 
some "interesting" things for listen sockets (eg backlog depth, its max, 
and dropped connection requests).  While there I've noticed that both 
tcp_v[46]_syn_recv_sock and tcp_v[46]conn_request both check that the 
listen queue is full, but only tcp_v[46]_syn_recv_sock increments some 
mib stats for dropped connection requests.


Is that deliberate, or is that a hole in the stats?

rick jones
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] sb1250-mac.c: De-typedef, de-volatile, de-etc...

2007-09-10 Thread Ralf Baechle
On Mon, Sep 10, 2007 at 01:20:38PM +0100, Maciej W. Rozycki wrote:

>  Remove typedefs, volatiles and convert kmalloc()/memset() pairs to
> kcalloc().  Also reformat the surrounding clutter.
> 
> Signed-off-by: Maciej W. Rozycki <[EMAIL PROTECTED]>
> ---
>  Per your request, Andrew, a while ago.  It builds, runs, passes 
> checkpatch.pl and sparse.  No semantic changes.

One step closer to sanity for this driver.  So it's got my ACK.

  Ralf
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] rfkill: Add support for ultrawideband

2007-09-10 Thread Ivo van Doorn
This patch will add support for UWB keys to rfkill,
support for this has been requested by Inaky.

Signed-off-by: Ivo van Doorn <[EMAIL PROTECTED]>
Acked-by: Dmitry Torokhov <[EMAIL PROTECTED]>
---
 include/linux/input.h |1 +
 include/linux/rfkill.h|2 ++
 net/rfkill/rfkill-input.c |9 +
 net/rfkill/rfkill.c   |3 +++
 4 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/include/linux/input.h b/include/linux/input.h
index cf2b561..8e5828d 100644
--- a/include/linux/input.h
+++ b/include/linux/input.h
@@ -360,6 +360,7 @@ struct input_absinfo {
 
 #define KEY_BLUETOOTH  237
 #define KEY_WLAN   238
+#define KEY_UWB239
 
 #define KEY_UNKNOWN240
 
diff --git a/include/linux/rfkill.h b/include/linux/rfkill.h
index c4546e1..f9a50da 100644
--- a/include/linux/rfkill.h
+++ b/include/linux/rfkill.h
@@ -31,10 +31,12 @@
  * enum rfkill_type - type of rfkill switch.
  * RFKILL_TYPE_WLAN: switch is no a Wireless network devices.
  * RFKILL_TYPE_BlUETOOTH: switch is on a bluetooth device.
+ * RFKILL_TYPE_UWB: switch is on a Ultra wideband device.
  */
 enum rfkill_type {
RFKILL_TYPE_WLAN ,
RFKILL_TYPE_BLUETOOTH,
+   RFKILL_TYPE_UWB,
RFKILL_TYPE_MAX,
 };
 
diff --git a/net/rfkill/rfkill-input.c b/net/rfkill/rfkill-input.c
index 9f746be..8e4516a 100644
--- a/net/rfkill/rfkill-input.c
+++ b/net/rfkill/rfkill-input.c
@@ -81,6 +81,7 @@ static void rfkill_schedule_toggle(struct rfkill_task *task)
 
 static DEFINE_RFKILL_TASK(rfkill_wlan, RFKILL_TYPE_WLAN);
 static DEFINE_RFKILL_TASK(rfkill_bt, RFKILL_TYPE_BLUETOOTH);
+static DEFINE_RFKILL_TASK(rfkill_uwb, RFKILL_TYPE_UWB);
 
 static void rfkill_event(struct input_handle *handle, unsigned int type,
unsigned int code, int down)
@@ -93,6 +94,9 @@ static void rfkill_event(struct input_handle *handle, 
unsigned int type,
case KEY_BLUETOOTH:
rfkill_schedule_toggle(&rfkill_bt);
break;
+   case KEY_UWB:
+   rfkill_schedule_toggle(&rfkill_uwb);
+   break;
default:
break;
}
@@ -148,6 +152,11 @@ static const struct input_device_id rfkill_ids[] = {
.evbit = { BIT(EV_KEY) },
.keybit = { [LONG(KEY_BLUETOOTH)] = BIT(KEY_BLUETOOTH) },
},
+   {
+   .flags = INPUT_DEVICE_ID_MATCH_EVBIT | 
INPUT_DEVICE_ID_MATCH_KEYBIT,
+   .evbit = { BIT(EV_KEY) },
+   .keybit = { [LONG(KEY_UWB)] = BIT(KEY_UWB) },
+   },
{ }
 };
 
diff --git a/net/rfkill/rfkill.c b/net/rfkill/rfkill.c
index 50e0102..03ed7fd 100644
--- a/net/rfkill/rfkill.c
+++ b/net/rfkill/rfkill.c
@@ -106,6 +106,9 @@ static ssize_t rfkill_type_show(struct device *dev,
case RFKILL_TYPE_BLUETOOTH:
type = "bluetooth";
break;
+   case RFKILL_TYPE_UWB:
+   type = "ultrawideband";
+   break;
default:
BUG();
}
-- 
1.5.3

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] rfkill: Add rfkill documentation

2007-09-10 Thread Ivo van Doorn
Add a documentation file which contains
a short description about rfkill with some
notes about drivers and the userspace interface.

Signed-off-by: Ivo van Doorn <[EMAIL PROTECTED]>
Acked-by: Dmitry Torokhov <[EMAIL PROTECTED]>
---
 Documentation/rfkill.txt |   88 ++
 1 files changed, 88 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/rfkill.txt

diff --git a/Documentation/rfkill.txt b/Documentation/rfkill.txt
new file mode 100644
index 000..93c76fc
--- /dev/null
+++ b/Documentation/rfkill.txt
@@ -0,0 +1,88 @@
+rfkill - RF switch subsystem support
+
+
+1 Implementation details
+2 Driver support
+3 Userspace support
+
+===
+1: Implementation details
+
+The rfkill switch subsystem offers support for keys often found on laptops
+to enable wireless devices like WiFi and Bluetooth.
+
+This is done by providing the user 3 possibilities:
+ - The rfkill system handles all events, userspace is not aware of events.
+ - The rfkill system handles all events, userspace is informed about the event.
+ - The rfkill system does not handle events, userspace handles all events.
+
+The buttons to enable and disable the wireless radios are important in
+situations where the user is for example using his laptop on a location where
+wireless radios _must_ be disabled (e.g airplanes).
+Because of this requirement, userspace support for the keys should not be
+made mandatory. Because userspace might want to perform some additional smarter
+tasks when the key is pressed, rfkill still provides userspace the possibility
+to take over the task to handle the key events.
+
+The system inside the kernel has been split into 2 seperate sections:
+   1 - RFKILL
+   2 - RFKILL_INPUT
+
+The first option enables rfkill support and will make sure userspace will
+be notified of any events through the input device. It also creates several
+sysfs entries which can be used by userspace. See section "Userspace support".
+
+The second option provides a rfkill input handler. This handler will
+listen to all rfkill key events and will toggle the radio accordingly,
+with this option enabled userspace could either do nothing or simply
+perform monitoring tasks.
+
+
+2: Driver support
+
+Drivers who wish to build in rfkill subsystem support should
+make sure their driver depends of the Kconfig option RFKILL, it should
+_not_ depend on RFKILL_INPUT.
+
+Unless key events trigger a interrupt to which the driver listens, polling
+will be required to determine the key state changes. For this the input
+layer providers the input-polldev handler.
+
+A driver should implement a few steps to correctly make use of the
+rfkill subsystem. First for non-polling drivers:
+
+   - rfkill_allocate()
+   - input_allocate_device()
+   - rfkill_register()
+   - input_register_device()
+
+For polling drivers:
+
+   - rfkill_allocate()
+   - input_allocate_polled_device()
+   - rfkill_register()
+   - input_register_polled_device()
+
+When a key event has been detected, the correct event should be
+send over the input device which has been registered by the driver.
+
+
+3: Userspace support
+
+For each key a input device will be created which will send out the correct
+key event when the rfkill key has been pressed.
+
+The following sysfs entries will be created:
+
+   name: Name assigned by driver to this key (interface or driver name).
+   type: Name of the key type ("wlan", "bluetooth", etc).
+   state: Current state of the key. 1: On, 0: Off.
+   claim: 1: Userspace handles events, 0: Kernel handles events
+
+Both the "state" and "claim" entries are also writable. For the "state" entry
+this means that when 1 or 0 is written all radios will be toggled accordingly.
+For the "claim" entry writing 1 to it will mean that the kernel will no longer
+handle key events even though RFKILL_INPUT input was enabled. When "claim" has
+been set to 0, userspace should make sure it will listen for the input events
+or check the sysfs "state" entry regularly to correctly perform the required
+tasks when the rkfill key is pressed.
-- 
1.5.3

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] rfkill: Remove IRDA

2007-09-10 Thread Ivo van Doorn
As Dmitry pointed out earlier, rfkill-input.c
doesn't support irda because there are no users
and we shouldn't add unrequired KEY_ defines.

However, RFKILL_TYPE_IRDA was defined in the
rfkill.h header file and would confuse people
about whether it is implemented or not.

This patch removes IRDA support completely,
so it can be added whenever a driver wants the
feature.

Signed-off-by: Ivo van Doorn <[EMAIL PROTECTED]>
Acked-by: Dmitry Torokhov <[EMAIL PROTECTED]>
---
 include/linux/rfkill.h |8 +++-
 net/rfkill/Kconfig |2 +-
 net/rfkill/rfkill.c|5 +
 3 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/include/linux/rfkill.h b/include/linux/rfkill.h
index a8a6ea8..c4546e1 100644
--- a/include/linux/rfkill.h
+++ b/include/linux/rfkill.h
@@ -31,13 +31,11 @@
  * enum rfkill_type - type of rfkill switch.
  * RFKILL_TYPE_WLAN: switch is no a Wireless network devices.
  * RFKILL_TYPE_BlUETOOTH: switch is on a bluetooth device.
- * RFKILL_TYPE_IRDA: switch is on an infrared devices.
  */
 enum rfkill_type {
-   RFKILL_TYPE_WLAN = 0,
-   RFKILL_TYPE_BLUETOOTH = 1,
-   RFKILL_TYPE_IRDA = 2,
-   RFKILL_TYPE_MAX = 3,
+   RFKILL_TYPE_WLAN ,
+   RFKILL_TYPE_BLUETOOTH,
+   RFKILL_TYPE_MAX,
 };
 
 enum rfkill_state {
diff --git a/net/rfkill/Kconfig b/net/rfkill/Kconfig
index 8b31759..d28a6d9 100644
--- a/net/rfkill/Kconfig
+++ b/net/rfkill/Kconfig
@@ -5,7 +5,7 @@ menuconfig RFKILL
tristate "RF switch subsystem support"
help
  Say Y here if you want to have control over RF switches
- found on many WiFi, Bluetooth and IRDA cards.
+ found on many WiFi and Bluetooth cards.
 
  To compile this driver as a module, choose M here: the
  module will be called rfkill.
diff --git a/net/rfkill/rfkill.c b/net/rfkill/rfkill.c
index db3395b..50e0102 100644
--- a/net/rfkill/rfkill.c
+++ b/net/rfkill/rfkill.c
@@ -106,9 +106,6 @@ static ssize_t rfkill_type_show(struct device *dev,
case RFKILL_TYPE_BLUETOOTH:
type = "bluetooth";
break;
-   case RFKILL_TYPE_IRDA:
-   type = "irda";
-   break;
default:
BUG();
}
@@ -281,7 +278,7 @@ static void rfkill_remove_switch(struct rfkill *rfkill)
 /**
  * rfkill_allocate - allocate memory for rfkill structure.
  * @parent: device that has rf switch on it
- * @type: type of the switch (wlan, bluetooth, irda)
+ * @type: type of the switch (RFKILL_TYPE_*)
  *
  * This function should be called by the network driver when it needs
  * rfkill structure. Once the structure is allocated the driver shoud
-- 
1.5.3

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: auto recycling of TIME_WAIT connections

2007-09-10 Thread Rick Jones

Pádraig Brady wrote:

Rick Jones wrote:

This was an issue over a decade ago with SPECweb96 benchmarking.  The
initial solution was to make the explicit bind() calls and not rely on
the anonymous/ephemeral port space.  After that, one starts adding
additional IP's into the mix (at least where possible).  And if that
fails, one has to go back to the beginning and ask oneself exactly why a
client is trying to churn through so many connections per second in the
first place.



right. This is for benchmarking mainly.
Sane applications would use persistent connections,
or a different form of IPC.


All the more reason to go the "add more client IP's" path then.  It 
gives you more connections per second, and gives you a much broader 
umber of "client" IP's hitting the server which will be more realistic.
That is one thing I like very much about polygraph (based on what I've 
read) - it's use of _lots_ of client IPs to better simulate reality.  I 
think other web-oriented benchmarks should start to include that as well 
for there are stacks which do indeed make "decisions" based on whether 
or not a destination is perceived to be "local" or not.


rick jones
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Denys Vlasenko
On Monday 10 September 2007 16:09, Linus Torvalds wrote:
> On Mon, 10 Sep 2007, Denys Vlasenko wrote:
> > static inline int
> > qla2x00_wait_for_loop_ready(scsi_qla_host_t *ha)
> > {
> > int  return_status = QLA_SUCCESS;
> > unsigned long loop_timeout ;
> > scsi_qla_host_t *pha = to_qla_parent(ha);
> > 
> > /* wait for 5 min at the max for loop to be ready */
> > loop_timeout = jiffies + (MAX_LOOP_TIMEOUT * HZ);
> > 
> > while ((!atomic_read(&pha->loop_down_timer) &&
> > atomic_read(&pha->loop_state) == LOOP_DOWN) ||
> > atomic_read(&pha->loop_state) != LOOP_READY) {
> > if (atomic_read(&pha->loop_state) == LOOP_DEAD) {
> ...
> > Is above correct or buggy? Correct, because msleep is a barrier.
> > Is it obvious? No.
> 
> It's *buggy*. But it has nothing to do with any msleep() in the loop, or 
> anything else.
> 
> And more importantly, it would be equally buggy even *with* a "volatile" 
> atomic_read().

I am not saying that this code is okay, this isn't the point.
(The code is in fact awful for several more reasons).

My point is that people are confused as to what atomic_read()
exactly means, and this is bad. Same for cpu_relax().
First one says "read", and second one doesn't say "barrier".

This is real code from current kernel which demonstrates this:

"I don't know that cpu_relax() is a barrier already":

drivers/kvm/kvm_main.c
        while (atomic_read(&completed) != needed) {
                cpu_relax();
                barrier();
        }

"I think that atomic_read() is a read from memory and therefore
I don't need a barrier":

arch/x86_64/kernel/crash.c
        msecs = 1000; /* Wait at most a second for the other cpus to stop */
        while ((atomic_read(&waiting_for_crash_ipi) > 0) && msecs) {
                mdelay(1);
                msecs--;
        }

Since neither camp seems to give up, I am proposing renaming
them to something less confusing, and make everybody happy.

cpu_relax_barrier()
atomic_value(&x)
atomic_fetch(&x)

I'm not native English speaker, do these sound better?
--
vda
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Arjan van de Ven
On Mon, 10 Sep 2007 15:38:23 +0100
Denys Vlasenko <[EMAIL PROTECTED]> wrote:

> On Monday 10 September 2007 15:51, Arjan van de Ven wrote:
> > On Mon, 10 Sep 2007 11:56:29 +0100
> > Denys Vlasenko <[EMAIL PROTECTED]> wrote:
> > 
> > > 
> > > Well, if you insist on having it again:
> > > 
> > > Waiting for atomic value to be zero:
> > > 
> > > while (atomic_read(&x))
> > > continue;
> > > 
> > 
> > and this I would say is buggy code all the way.
> >
> > Not from a pure C level semantics, but from a "busy waiting is
> > buggy" semantics level and a "I'm inventing my own locking"
> > semantics level.
> 
> After inspecting arch/*, I cannot agree with you.

the arch/ people obviously are allowed to do their own locking stuff...
BECAUSE THEY HAVE TO IMPLEMENT THAT!


the arch maintainers know EXACTLY how their hw behaves (well, we hope)
so they tend to be the exception to many rules in the kernel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/16] net: Basic network namespace infrastructure.

2007-09-10 Thread Eric W. Biederman
Pavel Emelyanov <[EMAIL PROTECTED]> writes:

> Eric W. Biederman wrote:
>
> [snip]
>
>> --- /dev/null
>> +++ b/include/net/net_namespace.h
>> @@ -0,0 +1,68 @@
>> +/*
>> + * Operations on the network namespace
>> + */
>> +#ifndef __NET_NET_NAMESPACE_H
>> +#define __NET_NET_NAMESPACE_H
>> +
>> +#include 
>> +#include 
>> +#include 
>> +
>> +struct net {
>
> Isn't this name is too generic? Why not net_namespace?

My fingers went on strike.  struct netns probably wouldn't be bad.
The very long and spelled out version seemed painful.  I don't really
care except that not changing it is easier for me.  I'm happy to do
whatever is considered most maintainable.

>> +/* Walk through the list backwards calling the exit functions
>> + * for the pernet modules whose init functions did not fail.
>> + */
>> +for (ptr = ptr->prev; ptr != &pernet_list; ptr = ptr->prev) {
>
> Good reason for adding list_for_each_continue_reverse :)

Sounds reasonable.

>> +static int register_pernet_operations(struct list_head *list,
>> +  struct pernet_operations *ops)
>> +{
>> +struct net *net, *undo_net;
>> +int error;
>> +
>> +error = 0;
>> +list_add_tail(&ops->list, list);
>> +for_each_net(net) {
>> +if (ops->init) {
>
> Maybe it's better to do it vice-versa?
> if (ops->init)
>for_each_net(net)
>ops->init(net);
> ...

My gut feel says it is more readable with the test on the inside.
Although something like
if (ops->init)
   goto out;

might be more readable.

>> +int register_pernet_device(struct pernet_operations *ops)
>> +{
>> +int error;
>> +mutex_lock(&net_mutex);
>> +error = register_pernet_operations(&pernet_list, ops);
>> +if (!error && (first_device == &pernet_list))
>
> Very minor: why do you give the name "device" to some pernet_operations?

Subsystems need to be initialized before and cleaned up after network
devices.  We don't have much in the way that needs this distinction,
but we have just enough that it is useful to make this distinction.

Looking at my complete patchset all I have in this category is the
loopback device, and it is important on the teardown side of things
that the loopback device be unregistered before I clean up the
protocols or else I get weird leaks.

Reflecting on it I'm not quite certain about the setup side of things.
I'm on the fence if I need to completely dynamically allocate the
loopback device or if I need to embed it struct net.

There also may be a call for some other special network devices
to have one off instances in each network namespace.With the
new netlink creation API it isn't certain we need that idiom anymore
but before that point I was certain we would have other network
devices besides the loopback that would care.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 17/16] net: Disable netfilter sockopts when not in the initial network namespace

2007-09-10 Thread Eric W. Biederman
Pavel Emelyanov <[EMAIL PROTECTED]> writes:

> Eric W. Biederman wrote:
>> Until we support multiple network namespaces with netfilter only allow
>> netfilter configuration in the initial network namespace.
>
> PATCH 17/16? :)

Exactly!

If my target was the core of the networking stack I figured I better
include the change that keeps netfilter commands isolated to the
initial network namespace, and in my review of completeness I had
missed that in my first pass through my patches.

Eric

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/16] net: Support multiple network namespaces with netlink

2007-09-10 Thread Eric W. Biederman
Pavel Emelyanov <[EMAIL PROTECTED]> writes:
>
> Rr. This is the 5th or even the 6th patch that changes tens of files
> but (!) most of these changes are just propagating some core thing into
> protocols, drivers, etc. E.g. you add an argument to some function and
> then make all the rest use it, but the chunk adding the argument itself
> is buried in these changes.
>
> Why not make a reviewers' lifes easier and make (with hands) the core 
> hunks go first and the "propagation" ones at the end? For RFC purpose 
> I would even break the git-bisect safeness and splitted these patches 
> into 2 parts: those with the core and those with the propagation.

Agreed, this is an issue for easy review. My apologies.

I guess it was a failure in my imagination on how to prepare these
patches for review, in a way that was both reviewer and preparer friendly.

There is at least one /proc idiom change that needs to be made so I
will keep this in mind for the resend.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Linus Torvalds

On Mon, 10 Sep 2007, Denys Vlasenko wrote:
> 
> static inline int
> qla2x00_wait_for_loop_ready(scsi_qla_host_t *ha)
> {
> int  return_status = QLA_SUCCESS;
> unsigned long loop_timeout ;
> scsi_qla_host_t *pha = to_qla_parent(ha);
> 
> /* wait for 5 min at the max for loop to be ready */
> loop_timeout = jiffies + (MAX_LOOP_TIMEOUT * HZ);
> 
> while ((!atomic_read(&pha->loop_down_timer) &&
> atomic_read(&pha->loop_state) == LOOP_DOWN) ||
> atomic_read(&pha->loop_state) != LOOP_READY) {
> if (atomic_read(&pha->loop_state) == LOOP_DEAD) {
...
> Is above correct or buggy? Correct, because msleep is a barrier.
> Is it obvious? No.

It's *buggy*. But it has nothing to do with any msleep() in the loop, or 
anything else.

And more importantly, it would be equally buggy even *with* a "volatile" 
atomic_read().

Why is this so hard for people to understand? You're all acting like 
morons.

The reason it is buggy has absolutely nothing to do with whether the read 
is done or not, it has to do with the fact that the CPU may re-order the 
reads *regardless* of whether the read is done in some specific order by 
the compiler ot not! In effect, there is zero ordering between all those 
three reads, and if you don't have memory barriers (or a lock or other 
serialization), that code is buggy.

So stop this idiotic discussion thread already. The above kind of code 
needs memory barriers to be non-buggy. The whole "volatile or not" 
discussion is totally idiotic, and pointless, and anybody who doesn't 
understand that by now needs to just shut up and think about it more, 
rather than make this discussion drag out even further.

The fact is, "volatile" *only* makes things worse. It generates worse 
code, and never fixes any real bugs. This is a *fact*.

Linus
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


sky2.c: length mismatch errors with vlan frames

2007-09-10 Thread Pierre-Yves Ritschard
Hi list,

I have been running recent linux kernel on nexcom NSA 1086's equipped
with sysconnect NICs.

Like some people previously have on this list I am running into
problems with these NICs and seeing frequent errors in my dmesg:

sky2 eth4: rx error, status 0x402300 length 60
printk: 17 messages suppressed.
sky2 eth4: rx error, status 0x402300 length 60
printk: 32 messages suppressed.
sky2 eth4: rx error, status 0x602300 length 92
printk: 25 messages suppressed.
sky2 eth4: rx error, status 0x6e2300 length 106
printk: 16 messages suppressed.
sky2 eth4: rx error, status 0x402300 length 60
printk: 10 messages suppressed.
sky2 eth4: rx error, status 0x402300 length 60
printk: 17 messages suppressed.
sky2 eth4: rx error, status 0x402300 length 60


I have investigated a bit and status doesn't match any of the errors in
GMR_FS_ANY_ERR.
The block generating the error is now len_mismatch, due to the fact
that on packets having the GMR_FS_VLAN bit set, the length argument to
sky2_receive is 4 bytes shorter that the 16 bit value in status (status
``right shift'' 16).

Since both these values are read from the card I don't know how to
solve this other than by incrementing the length arg by 4 when
GMR_FS_VLAN is set in status. So here's a diff that does this, although
I'm not sure its an elegant solution:

--- sky2.c.orig 2007-09-10 15:34:15.0 +0200
+++ sky2.c  2007-09-10 16:20:28.0 +0200
@@ -2059,13 +2059,16 @@ static struct sk_buff *sky2_receive(stru
sky2->rx_next = (sky2->rx_next + 1) % sky2->rx_pending;
prefetch(sky2->rx_ring + sky2->rx_next);
 
+   if (status & GMR_FS_VLAN)
+   length += 4;
+
if (status & GMR_FS_ANY_ERR)
goto error;
 
if (!(status & GMR_FS_RX_OK))
goto resubmit;
 
-   if (status >> 16 != length)
+   if ((status >> 16) != length)
goto len_mismatch;
 
if (length < copybreak)
@@ -2081,6 +2084,8 @@ len_mismatch:
/* Truncation of overlength packets
   causes PHY length to not match MAC length */
++sky2->net_stats.rx_length_errors;
+   printk(KERN_INFO PFX "%s: rx length mismatch: length %d != %d\n",
+  dev->name, length, status >> 16);
 
 error:
++sky2->net_stats.rx_errors;

Any thoughts on how to solve this ?
Further testing shows that sometimes there are packets without the
GMR_FS_VLAN bit set which have a 4 bytes length difference, has shown
by this log excerpt:

sky2 eth4: rx length mismatch: length 243 != 247
sky2 eth4: rx error, status 0xf70300 length 243

For debugging purposes, I here is the little program with
excerpts from sky2.c I use to see what bits are set in the status field:

#include 
#include 
#include 
#include 

enum {
GMR_FS_LEN  = 0x<<16, /* Bit 31..16:Rx Frame Length */
GMR_FS_VLAN = 1<<13, /* VLAN Packet */
GMR_FS_JABBER   = 1<<12, /* Jabber Packet */
GMR_FS_UN_SIZE  = 1<<11, /* Undersize Packet */
GMR_FS_MC   = 1<<10, /* Multicast Packet */
GMR_FS_BC   = 1<<9,  /* Broadcast Packet */
GMR_FS_RX_OK= 1<<8,  /* Receive OK (Good Packet) */
GMR_FS_GOOD_FC  = 1<<7,  /* Good Flow-Control Packet */
GMR_FS_BAD_FC   = 1<<6,  /* Bad  Flow-Control Packet */
GMR_FS_MII_ERR  = 1<<5,  /* MII Error */
GMR_FS_LONG_ERR = 1<<4,  /* Too Long Packet */
GMR_FS_FRAGMENT = 1<<3,  /* Fragment */

GMR_FS_CRC_ERR  = 1<<1,  /* CRC Error */
GMR_FS_RX_FF_OV = 1<<0,  /* Rx FIFO Overflow */

GMR_FS_ANY_ERR  = GMR_FS_RX_FF_OV | GMR_FS_CRC_ERR |
  GMR_FS_FRAGMENT | GMR_FS_LONG_ERR |
  GMR_FS_MII_ERR | GMR_FS_BAD_FC |
  GMR_FS_UN_SIZE | GMR_FS_JABBER,
};

struct bitdesc {
const char  *name;
u_int32_tfield;
};

struct bitdesc bits[] = {
{ "GMR_FS_VLAN", 1 << 13 },
{ "GMR_FS_JABBER", 1 << 12 },
{ "GMR_FS_UN_SIZE", 1 << 11 },
{ "GMR_FS_MC", 1 << 10 },
{ "GMR_FS_BC", 1 << 9 },
{ "GMR_FS_RX_OK", 1 << 8 },
{ "GMR_FS_GOOD_FC", 1 << 7 },
{ "GMR_FS_BAD_FC", 1 << 6 },
{ "GMR_FS_MII_ERR", 1 << 5 },
{ "GMR_FS_LONG_ERR", 1 << 4 },
{ "GMR_FS_FRAGMENT", 1 << 3 },
{ "GMR_FS_CRC_ERR", 1 << 1 },
{ "GMR_FS_RX_FF_OV", 1 << 0 }
};

int
main(int argc, const char *argv[])
{
int status;
int i;

if (argc < 2)
return (1);

status = strtol(argv[1], NULL, 16);
printf("status: 0x%08x\n", status);
if (status & GMR_FS_ANY_ERR)
printf("packet has an error\n");
printf("length: %u\n", status >> 16);

for (i = 0; i < (sizeof(bits) / sizeof(bits[0])); i++) {
if (status & bits[i].field) {
printf("has bit %s\n", bits[i].name);
}
}
return (0);
}
-
To un

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Denys Vlasenko
On Monday 10 September 2007 15:51, Arjan van de Ven wrote:
> On Mon, 10 Sep 2007 11:56:29 +0100
> Denys Vlasenko <[EMAIL PROTECTED]> wrote:
> 
> > 
> > Well, if you insist on having it again:
> > 
> > Waiting for atomic value to be zero:
> > 
> > while (atomic_read(&x))
> > continue;
> > 
> 
> and this I would say is buggy code all the way.
>
> Not from a pure C level semantics, but from a "busy waiting is buggy"
> semantics level and a "I'm inventing my own locking" semantics level.

After inspecting arch/*, I cannot agree with you.
Otherwise almost all major architectures use
"conceptually buggy busy-waiting":

arch/alpha
arch/i386
arch/ia64
arch/m32r
arch/mips
arch/parisc
arch/powerpc
arch/sh
arch/sparc64
arch/um
arch/x86_64

All of the above contain busy-waiting on atomic_read.

Including these loops without barriers:

arch/mips/kernel/smtc.c
while (atomic_read(&idle_hook_initialized) < 1000)
;
arch/mips/sgi-ip27/ip27-nmi.c
while (atomic_read(&nmied_cpus) != num_online_cpus());

[Well maybe num_online_cpus() is a barrier, I didn't check]

arch/sh/kernel/smp.c
if (wait)
while (atomic_read(&smp_fn_call.finished) != (nr_cpus - 1));

Bugs?
--
vda
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] [PATCH V4 0/10] net/bonding: ADD IPoIB support for the bonding driver

2007-09-10 Thread Moni Shoua
Hi all,
This patch series is a bit neglected.
Since our goal is to have bonding support for IPoIB in kernel 2.6.24 it is 
very important for us to get comments soon.

We would appreciate if you take some time to look at this and help us push this 
code upstream.

thanks

 MoniS  


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/16] net: Basic network namespace infrastructure.

2007-09-10 Thread Pavel Emelyanov
Eric W. Biederman wrote:

[snip]

> --- /dev/null
> +++ b/include/net/net_namespace.h
> @@ -0,0 +1,68 @@
> +/*
> + * Operations on the network namespace
> + */
> +#ifndef __NET_NET_NAMESPACE_H
> +#define __NET_NET_NAMESPACE_H
> +
> +#include 
> +#include 
> +#include 
> +
> +struct net {

Isn't this name is too generic? Why not net_namespace?

> + atomic_tcount;  /* To decided when the network
> +  *  namespace should be freed.
> +  */
> + atomic_tuse_count;  /* To track references we
> +  * destroy on demand
> +  */
> + struct list_headlist;   /* list of network namespaces */
> + struct work_struct  work;   /* work struct for freeing */
> +};

[snip]

> --- /dev/null
> +++ b/net/core/net_namespace.c

[snip]

> +static int setup_net(struct net *net)
> +{
> + /* Must be called with net_mutex held */
> + struct pernet_operations *ops;
> + struct list_head *ptr;
> + int error;
> +
> + memset(net, 0, sizeof(struct net));
> + atomic_set(&net->count, 1);
> + atomic_set(&net->use_count, 0);
> +
> + error = 0;
> + list_for_each(ptr, &pernet_list) {
> + ops = list_entry(ptr, struct pernet_operations, list);
> + if (ops->init) {
> + error = ops->init(net);
> + if (error < 0)
> + goto out_undo;
> + }
> + }
> +out:
> + return error;
> +out_undo:
> + /* Walk through the list backwards calling the exit functions
> +  * for the pernet modules whose init functions did not fail.
> +  */
> + for (ptr = ptr->prev; ptr != &pernet_list; ptr = ptr->prev) {

Good reason for adding list_for_each_continue_reverse :)

> + ops = list_entry(ptr, struct pernet_operations, list);
> + if (ops->exit)
> + ops->exit(net);
> + }
> + goto out;
> +}
> +
> +static int __init net_ns_init(void)
> +{
> + int err;
> +
> + printk(KERN_INFO "net_namespace: %zd bytes\n", sizeof(struct net));
> + net_cachep = kmem_cache_create("net_namespace", sizeof(struct net),
> + SMP_CACHE_BYTES,
> + SLAB_PANIC, NULL);
> + mutex_lock(&net_mutex);
> + err = setup_net(&init_net);
> +
> + net_lock();
> + list_add_tail(&init_net.list, &net_namespace_list);
> + net_unlock();
> +
> + mutex_unlock(&net_mutex);
> + if (err)
> + panic("Could not setup the initial network namespace");
> +
> + return 0;
> +}
> +
> +pure_initcall(net_ns_init);
> +
> +static int register_pernet_operations(struct list_head *list,
> +   struct pernet_operations *ops)
> +{
> + struct net *net, *undo_net;
> + int error;
> +
> + error = 0;
> + list_add_tail(&ops->list, list);
> + for_each_net(net) {
> + if (ops->init) {

Maybe it's better to do it vice-versa?
if (ops->init)
   for_each_net(net)
   ops->init(net);
...

> + error = ops->init(net);
> + if (error)
> + goto out_undo;
> + }
> + }
> +out:
> + return error;
> +
> +out_undo:
> + /* If I have an error cleanup all namespaces I initialized */
> + list_del(&ops->list);
> + for_each_net(undo_net) {
> + if (undo_net == net)
> + goto undone;
> + if (ops->exit)
> + ops->exit(undo_net);
> + }
> +undone:
> + goto out;
> +}
> +
> +static void unregister_pernet_operations(struct pernet_operations *ops)
> +{
> + struct net *net;
> +
> + list_del(&ops->list);
> + for_each_net(net)
> + if (ops->exit)

The same here.

> + ops->exit(net);
> +}
> +
> +/**
> + *  register_pernet_subsys - register a network namespace subsystem
> + *   @ops:  pernet operations structure for the subsystem
> + *
> + *   Register a subsystem which has init and exit functions
> + *   that are called when network namespaces are created and
> + *   destroyed respectively.
> + *
> + *   When registered all network namespace init functions are
> + *   called for every existing network namespace.  Allowing kernel
> + *   modules to have a race free view of the set of network namespaces.
> + *
> + *   When a new network namespace is created all of the init
> + *   methods are called in the order in which they were registered.
> + *
> + *   When a network namespace is destroyed all of the exit methods
> + *   are called in the reverse of the order with which they were
> + *   registered.
> + */
> +int register_pernet_subsys(struct pernet_operations *ops)
> +{
> + int error;
> + mutex_lock(&net_mutex);

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Denys Vlasenko
On Monday 10 September 2007 14:38, Denys Vlasenko wrote:
> You are basically trying to educate me how to use atomic properly.
> You don't need to do it, as I am (currently) not a driver author.
> 
> I am saying that people who are already using atomic_read()
> (and who unfortunately did not read your explanation above)
> will still sometimes use atomic_read() as a way to read atomic value
> *from memory*, and will create nasty heisenbugs for you to debug.

static inline int
qla2x00_wait_for_loop_ready(scsi_qla_host_t *ha)
{
int  return_status = QLA_SUCCESS;
unsigned long loop_timeout ;
scsi_qla_host_t *pha = to_qla_parent(ha);

/* wait for 5 min at the max for loop to be ready */
loop_timeout = jiffies + (MAX_LOOP_TIMEOUT * HZ);

while ((!atomic_read(&pha->loop_down_timer) &&
atomic_read(&pha->loop_state) == LOOP_DOWN) ||
atomic_read(&pha->loop_state) != LOOP_READY) {
if (atomic_read(&pha->loop_state) == LOOP_DEAD) {
return_status = QLA_FUNCTION_FAILED;
break;
}
msleep(1000);
if (time_after_eq(jiffies, loop_timeout)) {
return_status = QLA_FUNCTION_FAILED;
break;
}
}
return (return_status);
}

Is above correct or buggy? Correct, because msleep is a barrier.
Is it obvious? No.

static void
qla2x00_rst_aen(scsi_qla_host_t *ha)
{
if (ha->flags.online && !ha->flags.reset_active &&
!atomic_read(&ha->loop_down_timer) &&
!(test_bit(ABORT_ISP_ACTIVE, &ha->dpc_flags))) {
do {
clear_bit(RESET_MARKER_NEEDED, &ha->dpc_flags);

/*
 * Issue marker command only when we are going to start
 * the I/O.
 */
ha->marker_needed = 1;
} while (!atomic_read(&ha->loop_down_timer) &&
(test_bit(RESET_MARKER_NEEDED, &ha->dpc_flags)));
}
}

Is above correct? I honestly don't know. Correct, because set_bit is
a barrier on _all _memory_? Will it break if set_bit will be changed
to be a barrier only on its operand? Probably yes.

drivers/kvm/kvm_main.c

while (atomic_read(&completed) != needed) {
cpu_relax();
barrier();
}

Obviously author did not know that cpu_relax is already a barrier.
See why I think driver authors will be confused?

arch/x86_64/kernel/crash.c

static void nmi_shootdown_cpus(void)
{
...
msecs = 1000; /* Wait at most a second for the other cpus to stop */
while ((atomic_read(&waiting_for_crash_ipi) > 0) && msecs) {
mdelay(1);
msecs--;
}
...
}

Is mdelay(1) a barrier? Yes, because it is a function on x86_64.
Absolutely the same code will be buggy on an arch where
mdelay(1) == udelay(1000), and udelay is implemented
as inline busy-wait.

arch/sparc64/kernel/smp.c

/* Wait for response */
while (atomic_read(&data.finished) != cpus)
cpu_relax();
...later in the same file...
while (atomic_read(&smp_capture_registry) != ncpus)
rmb();

I'm confused. Do we need cpu_relax() or rmb()? Does cpu_relax() imply rmb()?
(No it doesn't). Which of those two while loops needs correcting?
--
vda
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lksctp-developers] [-mm patch] net/sctp/socket.c: make 3 variables static

2007-09-10 Thread Neil Horman
On Sun, Sep 09, 2007 at 10:25:54PM +0200, Adrian Bunk wrote:
> On Fri, Aug 31, 2007 at 09:58:22PM -0700, Andrew Morton wrote:
> >...
> > Changes since 2.6.23-rc3-mm1:
> >...
> >  git-net.patch
> >...
> >  git trees
> >...
> 
> This patch makes the following needlessly globalvariables static:
> - sctp_memory_pressure
> - sctp_memory_allocated
> - sctp_sockets_allocated
> 
> Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>
> 
Looks fine to me
Acked-by: Neil Horman <[EMAIL PROTECTED]>

Neil

-- 
/***
 *Neil Horman
 [EMAIL PROTECTED]
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***/
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Arjan van de Ven
On Mon, 10 Sep 2007 11:56:29 +0100
Denys Vlasenko <[EMAIL PROTECTED]> wrote:

> 
> Well, if you insist on having it again:
> 
> Waiting for atomic value to be zero:
> 
> while (atomic_read(&x))
> continue;
> 

and this I would say is buggy code all the way.

Not from a pure C level semantics, but from a "busy waiting is buggy"
semantics level and a "I'm inventing my own locking" semantics level.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 17/16] net: Disable netfilter sockopts when not in the initial network namespace

2007-09-10 Thread Pavel Emelyanov
Eric W. Biederman wrote:
> Until we support multiple network namespaces with netfilter only allow
> netfilter configuration in the initial network namespace.

PATCH 17/16? :)

Sorry,
Pavel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/16] net: Support multiple network namespaces with netlink

2007-09-10 Thread Pavel Emelyanov
Eric W. Biederman wrote:
> Each netlink socket will live in exactly one network namespace,
> this includes the controlling kernel sockets.
> 
> This patch updates all of the existing netlink protocols
> to only support the initial network namespace.  Request
> by clients in other namespaces will get -ECONREFUSED.
> As they would if the kernel did not have the support for
> that netlink protocol compiled in.
> 
> As each netlink protocol is updated to be multiple network
> namespace safe it can register multiple kernel sockets
> to acquire a presence in the rest of the network namespaces.
> 
> The implementation in af_netlink is a simple filter implementation
> at hash table insertion and hash table look up time.
> 
> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>
> ---
>  drivers/connector/connector.c   |2 +-
>  drivers/scsi/scsi_netlink.c |2 +-
>  drivers/scsi/scsi_transport_iscsi.c |2 +-
>  fs/ecryptfs/netlink.c   |2 +-
>  include/linux/netlink.h |6 ++-
>  kernel/audit.c  |4 +-
>  lib/kobject_uevent.c|5 +-
>  net/bridge/netfilter/ebt_ulog.c |5 +-
>  net/core/rtnetlink.c|4 +-
>  net/decnet/netfilter/dn_rtmsg.c |3 +-
>  net/ipv4/fib_frontend.c |4 +-
>  net/ipv4/inet_diag.c|4 +-
>  net/ipv4/netfilter/ip_queue.c   |6 +-
>  net/ipv4/netfilter/ipt_ULOG.c   |3 +-
>  net/ipv6/netfilter/ip6_queue.c  |6 +-
>  net/netfilter/nfnetlink.c   |2 +-
>  net/netfilter/nfnetlink_log.c   |3 +-
>  net/netfilter/nfnetlink_queue.c |3 +-
>  net/netlink/af_netlink.c|  106 
> ++-
>  net/netlink/genetlink.c |4 +-
>  net/xfrm/xfrm_user.c|2 +-
>  security/selinux/netlink.c  |5 +-
>  22 files changed, 122 insertions(+), 61 deletions(-)

Rr. This is the 5th or even the 6th patch that changes tens of files
but (!) most of these changes are just propagating some core thing into
protocols, drivers, etc. E.g. you add an argument to some function and
then make all the rest use it, but the chunk adding the argument itself
is buried in these changes.

Why not make a reviewers' lifes easier and make (with hands) the core 
hunks go first and the "propagation" ones at the end? For RFC purpose 
I would even break the git-bisect safeness and splitted these patches 
into 2 parts: those with the core and those with the propagation.

Thanks,
Pavel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lksctp-developers] [2.6 patch] make sctp_addto_param() static

2007-09-10 Thread Neil Horman
On Sun, Sep 09, 2007 at 10:25:50PM +0200, Adrian Bunk wrote:
> sctp_addto_param() can become static.
> 
> Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>
> 
ACK, seems reasonable to me.
Neil


-- 
/***
 *Neil Horman
 [EMAIL PROTECTED]
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***/
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2.6 patch] make sctp_addto_param() static

2007-09-10 Thread Vlad Yasevich
Adrian Bunk wrote:
> sctp_addto_param() can become static.
> 
> Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>

Ack

-vlad
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Denys Vlasenko
On Monday 10 September 2007 13:22, Kyle Moffett wrote:
> On Sep 10, 2007, at 06:56:29, Denys Vlasenko wrote:
> > On Sunday 09 September 2007 19:18, Arjan van de Ven wrote:
> >> On Sun, 9 Sep 2007 19:02:54 +0100
> >> Denys Vlasenko <[EMAIL PROTECTED]> wrote:
> >>
> >>> Why is all this fixation on "volatile"? I don't think people want  
> >>> "volatile" keyword per se, they want atomic_read(&x) to _always_  
> >>> compile into an memory-accessing instruction, not register access.
> >>
> >> and ... why is that?  is there any valid, non-buggy code sequence  
> >> that makes that a reasonable requirement?
> >
> > Well, if you insist on having it again:
> >
> > Waiting for atomic value to be zero:
> >
> > while (atomic_read(&x))
> > continue;
> >
> > gcc may happily convert it into:
> >
> > reg = atomic_read(&x);
> > while (reg)
> > continue;
> 
> Bzzt.  Even if you fixed gcc to actually convert it to a busy loop on  
> a memory variable, you STILL HAVE A BUG as it may *NOT* be gcc that  
> does the conversion, it may be that the CPU does the caching of the  
> memory value.  GCC has no mechanism to do cache-flushes or memory- 
> barriers except through our custom inline assembly.

CPU can cache the value all right, but it cannot use that cached value
*forever*, it has to react to invalidate cycles on the shared bus
and re-fetch new data.

IOW: atomic_read(&x) which compiles down to memory accessor
will work properly.

> the CPU.  Thirdly, on a large system it may take some arbitrarily  
> large amount of time for cache-propagation to update the value of the  
> variable in your local CPU cache.

Yes, but "arbitrarily large amount of time" is actually measured
in nanoseconds here. Let's say 1000ns max for hundreds of CPUs?

> Also, you   
> probably want a cpu_relax() in there somewhere to avoid overheating  
> the CPU.

Yes, but 
1. CPU shouldn't overheat (in a sense that it gets damaged),
   it will only use more power than needed.
2. cpu_relax() just throttles down my CPU, so it's performance
   optimization only. Wait, it isn't, it's a barrier too.
   Wow, "cpu_relax" is a barrier? How am I supposed to know
   that without reading lkml flamewars and/or header files?

Let's try reading headers. asm-x86_64/processor.h:

#define cpu_relax()   rep_nop()

So, is it a barrier? No clue yet.

/* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
static inline void rep_nop(void)
{
__asm__ __volatile__("rep;nop": : :"memory");
}

Comment explicitly says that it is "a good thing" (doesn't say
that it is mandatory) and says NOTHING about barriers!

Barrier-ness is not mentioned and is hidden in "memory" clobber.

Do you think it's obvious enough for average driver writer?
I think not, especially that it's unlikely for him to even start
suspecting that it is a memory barrier based on the "cpu_relax"
name.

> You simply CANNOT use an atomic_t as your sole synchronizing
> primitive, it doesn't work!  You virtually ALWAYS want to use an  
> atomic_t in the following types of situations:
> 
> (A) As an object refcount.  The value is never read except as part of  
> an atomic_dec_return().  Why aren't you using "struct kref"?
> 
> (B) As an atomic value counter (number of processes, for example).   
> Just "reading" the value is racy anyways, if you want to enforce a  
> limit or something then use atomic_inc_return(), check the result,  
> and use atomic_dec() if it's too big.  If you just want to return the  
> statistics then you are going to be instantaneous-point-in-time anyways.
> 
> (C) As an optimization value (statistics-like, but exact accuracy  
> isn't important).
> 
> Atomics are NOT A REPLACEMENT for the proper kernel subsystem, like  
> completions, mutexes, semaphores, spinlocks, krefs, etc.  It's not  
> useful for synchronization, only for keeping track of simple integer  
> RMW values.  Note that atomic_read() and atomic_set() aren't very  
> useful RMW primitives (read-nomodify-nowrite and read-set-zero- 
> write).  Code which assumes anything else is probably buggy in other  
> ways too.

You are basically trying to educate me how to use atomic properly.
You don't need to do it, as I am (currently) not a driver author.

I am saying that people who are already using atomic_read()
(and who unfortunately did not read your explanation above)
will still sometimes use atomic_read() as a way to read atomic value
*from memory*, and will create nasty heisenbugs for you to debug.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23-rc5: possible irq lock inversion dependency detected

2007-09-10 Thread Herbert Xu
On Sun, Sep 02, 2007 at 01:11:29PM +, Christian Kujau wrote:
> 
> after upgrading to 2.6.23-rc5 (and applying davem's fix [0]), lockdep 
> was quite noisy when I tried to shape my external (wireless) interface:
> 
> [ 6400.534545] FahCore_78.exe/3552 just changed the state of lock:
> [ 6400.534713]  (&dev->ingress_lock){-+..}, at: [] 
> netif_receive_skb+0x2d5/0x3c0
> [ 6400.534941] but this lock took another, soft-read-irq-unsafe lock in the 
> past:
> [ 6400.535145]  (police_lock){-.--}

This is a genuine dead-lock.  The police lock can be taken
for reading with softirqs on.  If a second CPU tries to take
the police lock for writing, while holding the ingress lock,
then a softirq on the first CPU can dead-lock when it tries
to get the ingress lock.

The minimal fix would be to make sure that we disable BH on
the first CPU.  Jamal, could you take a look at this please?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Kyle Moffett

On Sep 10, 2007, at 06:56:29, Denys Vlasenko wrote:

On Sunday 09 September 2007 19:18, Arjan van de Ven wrote:

On Sun, 9 Sep 2007 19:02:54 +0100
Denys Vlasenko <[EMAIL PROTECTED]> wrote:

Why is all this fixation on "volatile"? I don't think people want  
"volatile" keyword per se, they want atomic_read(&x) to _always_  
compile into an memory-accessing instruction, not register access.


and ... why is that?  is there any valid, non-buggy code sequence  
that makes that a reasonable requirement?


Well, if you insist on having it again:

Waiting for atomic value to be zero:

while (atomic_read(&x))
continue;

gcc may happily convert it into:

reg = atomic_read(&x);
while (reg)
continue;


Bzzt.  Even if you fixed gcc to actually convert it to a busy loop on  
a memory variable, you STILL HAVE A BUG as it may *NOT* be gcc that  
does the conversion, it may be that the CPU does the caching of the  
memory value.  GCC has no mechanism to do cache-flushes or memory- 
barriers except through our custom inline assembly.  Also, you  
probably want a cpu_relax() in there somewhere to avoid overheating  
the CPU.  Thirdly, on a large system it may take some arbitrarily  
large amount of time for cache-propagation to update the value of the  
variable in your local CPU cache.  Finally, if atomics are based on  
based on spinlock+interrupt-disable then you will sit in a tight busy- 
loop of spin_lock_irqsave()->spin_unlock_irqrestore().  Depending on  
your system's internal model this may practically lock up your core  
because the spin_lock() will take the cacheline for exclusive access  
and doing that in a loop can prevent any other CPU from doing any  
operation on it!  Since your IRQs are disabled you even have a very  
small window that an IRQ will come along and free it up long enough  
for the update to take place.


The earlier code segment of:

while(atomic_read(&x) > 0)
atomic_dec(&x);
is *completely* buggy because you could very easily have 4 CPUs doing  
this on an atomic variable with a value of 1 and end up with it at  
negative 3 by the time you are done.  Moreover all the alternatives  
are also buggy, with the sole exception of this rather obvious- 
seeming one:

atomic_set(&x, 0);


You simply CANNOT use an atomic_t as your sole synchronizing  
primitive, it doesn't work!  You virtually ALWAYS want to use an  
atomic_t in the following types of situations:


(A) As an object refcount.  The value is never read except as part of  
an atomic_dec_return().  Why aren't you using "struct kref"?


(B) As an atomic value counter (number of processes, for example).   
Just "reading" the value is racy anyways, if you want to enforce a  
limit or something then use atomic_inc_return(), check the result,  
and use atomic_dec() if it's too big.  If you just want to return the  
statistics then you are going to be instantaneous-point-in-time anyways.


(C) As an optimization value (statistics-like, but exact accuracy  
isn't important).


Atomics are NOT A REPLACEMENT for the proper kernel subsystem, like  
completions, mutexes, semaphores, spinlocks, krefs, etc.  It's not  
useful for synchronization, only for keeping track of simple integer  
RMW values.  Note that atomic_read() and atomic_set() aren't very  
useful RMW primitives (read-nomodify-nowrite and read-set-zero- 
write).  Code which assumes anything else is probably buggy in other  
ways too.


So while I see no real reason for the "volatile" on the atomics, I  
also see no real reason why it's terribly harmful.  Regardless of the  
"volatile" on the operation the CPU is perfectly happy to cache it  
anyways so it doesn't buy you any actual "always-access-memory"  
guarantees.  If you are just interested in it as an optimization you  
could probably just read the properly-aligned integer counter  
directly, an atomic read on most CPUs.


If you really need it to hit main memory *every* *single* *time*  
(Why?  Are you using it instead of the proper kernel subsystem?)   
then you probably need a custom inline assembly helper anyways.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] remove asm/bitops.h includes

2007-09-10 Thread Ralf Baechle
On Sat, Sep 08, 2007 at 09:00:08PM +0100, Jiri Slaby wrote:

> 
> remove asm/bitops.h includes
> 
> including asm/bitops directly may cause compile errors. don't include it
> and include linux/bitops instead. next patch will deny including asm header
> directly.
> 
> Cc: Adrian Bunk <[EMAIL PROTECTED]>
> Signed-off-by: Jiri Slaby <[EMAIL PROTECTED]>

For the MIPS and hamradio bits:

Acked-by: Ralf Baechle <[EMAIL PROTECTED]>

  Ralf
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates

2007-09-10 Thread jamal
On Mon, 2007-10-09 at 10:20 +0100, James Chapman wrote:
> jamal wrote:
> 
> > If the problem i am trying to solve is "reduce cpu use at lower rate",
> > then this is not the right answer because your cpu use has gone up.
> 
> The problem I'm trying to solve is "reduce the max interrupt rate from 
> NAPI drivers while minimizing latency".

As long as what you are saying above translates to "there is one
interupt per packet per napi poll" then we are saying the same thing.

>  In modern systems, the interrupt 
> rate can be so high that the CPU spends too much time processing 
> interrupts, resulting in the system's behavior seen by the user being 
> degraded.

modern systems also can handle interupts a lot better. 
If you can amortize two packets per interupt per napi poll then you
have done better than the breakeven point; however, I think it is fair
to also disprove that in modern hardware the breakeven point is met with
amortizing two packets.

> Having the poll() called when idle will always increase CPU usage. But 
> the feedback you and others are giving encourages me to find a better 
> compromise. 

i dont mean in any way to discourage you - just making you work
better ;-> It is very refreshing to see that you understand the scope is
performance not vomiting endless versions of code - and for this i feel
obligated to help.

> I'll go away and do some tests. I'll post results here for discussion later.

way to go.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] sb1250-mac.c: De-typedef, de-volatile, de-etc...

2007-09-10 Thread Maciej W. Rozycki
 Remove typedefs, volatiles and convert kmalloc()/memset() pairs to
kcalloc().  Also reformat the surrounding clutter.

Signed-off-by: Maciej W. Rozycki <[EMAIL PROTECTED]>
---
 Per your request, Andrew, a while ago.  It builds, runs, passes 
checkpatch.pl and sparse.  No semantic changes.

 Please apply,

  Maciej

patch-mips-2.6.23-rc5-20070904-sb1250-mac-typedef-7
diff -up --recursive --new-file 
linux-mips-2.6.23-rc5-20070904.macro/drivers/net/sb1250-mac.c 
linux-mips-2.6.23-rc5-20070904/drivers/net/sb1250-mac.c
--- linux-mips-2.6.23-rc5-20070904.macro/drivers/net/sb1250-mac.c   
2007-07-10 04:57:46.0 +
+++ linux-mips-2.6.23-rc5-20070904/drivers/net/sb1250-mac.c 2007-09-10 
01:01:09.0 +
@@ -140,17 +140,17 @@ MODULE_PARM_DESC(int_timeout_rx, "RX tim
  * */
 
 
-typedef enum { sbmac_speed_auto, sbmac_speed_10,
-  sbmac_speed_100, sbmac_speed_1000 } sbmac_speed_t;
+enum sbmac_speed { sbmac_speed_auto, sbmac_speed_10,
+  sbmac_speed_100, sbmac_speed_1000 };
 
-typedef enum { sbmac_duplex_auto, sbmac_duplex_half,
-  sbmac_duplex_full } sbmac_duplex_t;
+enum sbmac_duplex { sbmac_duplex_auto, sbmac_duplex_half,
+   sbmac_duplex_full };
 
-typedef enum { sbmac_fc_auto, sbmac_fc_disabled, sbmac_fc_frame,
-  sbmac_fc_collision, sbmac_fc_carrier } sbmac_fc_t;
+enum sbmac_fc { sbmac_fc_auto, sbmac_fc_disabled, sbmac_fc_frame,
+   sbmac_fc_collision, sbmac_fc_carrier } sbmac_fc_t;
 
-typedef enum { sbmac_state_uninit, sbmac_state_off, sbmac_state_on,
-  sbmac_state_broken } sbmac_state_t;
+enum sbmac_state { sbmac_state_uninit, sbmac_state_off, sbmac_state_on,
+  sbmac_state_broken };
 
 
 /**
@@ -176,55 +176,61 @@ typedef enum { sbmac_state_uninit, sbmac
  *  DMA Descriptor structure
  * */
 
-typedef struct sbdmadscr_s {
+struct sbdmadscr {
uint64_t  dscr_a;
uint64_t  dscr_b;
-} sbdmadscr_t;
-
-typedef unsigned long paddr_t;
+};
 
 /**
  *  DMA Controller structure
  * */
 
-typedef struct sbmacdma_s {
+struct sbmacdma {
 
/*
 * This stuff is used to identify the channel and the registers
 * associated with it.
 */
-
-   struct sbmac_softc *sbdma_eth;  /* back pointer to associated MAC */
-   int  sbdma_channel; /* channel number */
-   int  sbdma_txdir;   /* direction (1=transmit) */
-   int  sbdma_maxdescr;/* total # of descriptors in ring */
+   struct sbmac_softc  *sbdma_eth; /* back pointer to associated
+  MAC */
+   int sbdma_channel;  /* channel number */
+   int sbdma_txdir;/* direction (1=transmit) */
+   int sbdma_maxdescr; /* total # of descriptors
+  in ring */
 #ifdef CONFIG_SBMAC_COALESCE
-   int  sbdma_int_pktcnt;  /* # descriptors rx/tx before 
interrupt*/
-   int  sbdma_int_timeout; /* # usec rx/tx interrupt */
-#endif
-
-   volatile void __iomem *sbdma_config0;   /* DMA config register 0 */
-   volatile void __iomem *sbdma_config1;   /* DMA config register 1 */
-   volatile void __iomem *sbdma_dscrbase;  /* Descriptor base address */
-   volatile void __iomem *sbdma_dscrcnt;   /* Descriptor count register */
-   volatile void __iomem *sbdma_curdscr;   /* current descriptor address */
-   volatile void __iomem *sbdma_oodpktlost;/* pkt drop (rx only) */
-
+   int sbdma_int_pktcnt;
+   /* # descriptors rx/tx
+  before interrupt */
+   int sbdma_int_timeout;
+   /* # usec rx/tx interrupt */
+#endif
+   void __iomem*sbdma_config0; /* DMA config register 0 */
+   void __iomem*sbdma_config1; /* DMA config register 1 */
+   void __iomem*sbdma_dscrbase;
+   /* descriptor base address */
+   void __iomem*sbdma_dscrcnt; /* descriptor count register */
+   void __iomem*sbdma_curdscr; /* current descriptor
+  address */
+   void __iomem*sbdma_oodpktlost;
+   /* pkt drop (rx only) */
 
/*
 * This stuff is for maintenance of the ring
 */
-
-   sbdmadscr_

Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates

2007-09-10 Thread jamal
On Sat, 2007-08-09 at 09:42 -0700, Mandeep Singh Baines wrote:

> Reading the "interrupt pending" register would require an MMIO read.
> MMIO reads are very expensive. In some systems the latency of an MMIO
> read can be 1000x that of an L1 cache access.

Indeed.

> However, work_done() doesn't have to be inefficient. For newer
> devices you can implement work_done() without an MMIO read by polling
> the next ring entry status in memory or some other mechanism. Since
> PCI is coherent, acceses to this memory location could be cached
> after the first miss. For architectures where PCI is not coherent you'd 
> have to go to memory for every poll. So for these architectures has_work()
> will be moderately expensive (memory access) even when has_work() does
> not require an MMIO read. This might affect home routers: not sure if MIPS or
> ARM have coherent PCI.

I think the effect would be clearly experimentally observable in smaller
devices e.g the Geode you seem to be experimenting on.
One other suggestion i made in the paper is to something along the lines
of cached_irq_mask for the i8259

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ipconfig.c: De-clutter IP configuration report

2007-09-10 Thread Maciej W. Rozycki
 Reformat the printk() calls removing leading new-line characters, making 
output being done line-by-line rather than partially and defining the log 
level used.

Signed-off-by: Maciej W. Rozycki <[EMAIL PROTECTED]>
---
 The new code builds fine; no semantic changes.

 Please apply,

  Maciej

patch-mips-2.6.23-rc5-20070904-ipconfig-printk-2
diff -up --recursive --new-file 
linux-mips-2.6.23-rc5-20070904.macro/net/ipv4/ipconfig.c 
linux-mips-2.6.23-rc5-20070904/net/ipv4/ipconfig.c
--- linux-mips-2.6.23-rc5-20070904.macro/net/ipv4/ipconfig.c2007-09-04 
04:56:22.0 +
+++ linux-mips-2.6.23-rc5-20070904/net/ipv4/ipconfig.c  2007-09-10 
11:53:19.0 +
@@ -1364,17 +1364,17 @@ static int __init ip_auto_config(void)
/*
 * Clue in the operator.
 */
-   printk("IP-Config: Complete:");
-   printk("\n  device=%s", ic_dev->name);
-   printk(", addr=%u.%u.%u.%u", NIPQUAD(ic_myaddr));
-   printk(", mask=%u.%u.%u.%u", NIPQUAD(ic_netmask));
-   printk(", gw=%u.%u.%u.%u", NIPQUAD(ic_gateway));
-   printk(",\n host=%s, domain=%s, nis-domain=%s",
-  utsname()->nodename, ic_domain, utsname()->domainname);
-   printk(",\n bootserver=%u.%u.%u.%u", NIPQUAD(ic_servaddr));
-   printk(", rootserver=%u.%u.%u.%u", NIPQUAD(root_server_addr));
-   printk(", rootpath=%s", root_server_path);
-   printk("\n");
+   pr_info("IP-Config: Complete:\n");
+   pr_info("  device=%s, addr=%u.%u.%u.%u, "
+   "mask=%u.%u.%u.%u, gw=%u.%u.%u.%u,\n",
+   ic_dev->name, NIPQUAD(ic_myaddr),
+   NIPQUAD(ic_netmask), NIPQUAD(ic_gateway));
+   pr_info("  host=%s, domain=%s, nis-domain=%s,\n",
+   utsname()->nodename, ic_domain, utsname()->domainname);
+   pr_info("  bootserver=%u.%u.%u.%u, "
+   "rootserver=%u.%u.%u.%u, rootpath=%s\n",
+   NIPQUAD(ic_servaddr),
+   NIPQUAD(root_server_addr), root_server_path);
 #endif /* !SILENT */
 
return 0;
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23-rc5: possible irq lock inversion dependency detected

2007-09-10 Thread Peter Zijlstra
On Sun, 2007-09-02 at 15:11 +0200, Christian Kujau wrote:
> Hi,
> 
> after upgrading to 2.6.23-rc5 (and applying davem's fix [0]), lockdep 
> was quite noisy when I tried to shape my external (wireless) interface:
> 
> [ 6400.534545] FahCore_78.exe/3552 just changed the state of lock:
> [ 6400.534713]  (&dev->ingress_lock){-+..}, at: [] 
> netif_receive_skb+0x2d5/0x3c0
> [ 6400.534941] but this lock took another, soft-read-irq-unsafe lock in the 
> past:
> [ 6400.535145]  (police_lock){-.--}
> 
> This happened when I executed: 
> http://nerdbynature.de/bits/2.6.23-rc5/qos.sh.txt
> (using iproute2-ss070313). The is still running, I just noticed a short 
> hickup, probably when it was busy writing the warning to the disk.
> 
> More details and .config: http://nerdbynature.de/bits/2.6.23-rc5/

seems unavailable at this time, please submit the whole lockdep report
if possible.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] cfg80211: fix initialisation if built-in

2007-09-10 Thread Johannes Berg
When cfg80211 is built into the kernel it needs to init earlier
so that device registrations are run after it has initialised.

Signed-off-by: Johannes Berg <[EMAIL PROTECTED]>

---
 net/wireless/core.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- wireless-dev.orig/net/wireless/core.c   2007-09-10 13:00:18.567257487 
+0200
+++ wireless-dev/net/wireless/core.c2007-09-10 13:00:59.837219347 +0200
@@ -363,7 +363,7 @@ out_fail_notifier:
 out_fail_sysfs:
return err;
 }
-module_init(cfg80211_init);
+subsys_initcall(cfg80211_init);
 
 static void cfg80211_exit(void)
 {


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Herbert Xu
On Mon, Sep 10, 2007 at 11:56:29AM +0100, Denys Vlasenko wrote:
> 
> Expecting every driver writer to remember that atomic_read is not in fact
> a "read from memory" is naive. That won't happen. Face it, majority of
> driver authors are a bit less talented than Ingo Molnar or Arjan van de Ven ;)
> The name of the macro is saying that it's a read.
> We are confusing users here.

For driver authors who're too busy to learn the intricacies
of atomic operations, we have the plain old spin lock which
then lets you use normal data structures such as u32 safely.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2/2] 2.6.23-rc5: known regressions with patches v2

2007-09-10 Thread Johannes Berg
On Mon, 2007-09-10 at 13:09 +0200, Johannes Berg wrote:
> On Sat, 2007-09-08 at 13:11 +0200, Michal Piotrowski wrote:
> 
> > Subject : Unable to access memory card reader anymore

> Last known good/caused-by doesn't apply, the bug has been in there ever
> since the mac80211 code was merged into the tree.

Eh, no, I quoted the wrong one by accident,

Subject : ifconfig eth1 - scheduling while atomic: 
ifconfig/0x0002/4170

is the one I was thinking of.

johannes


signature.asc
Description: This is a digitally signed message part


Re: [2/2] 2.6.23-rc5: known regressions with patches v2

2007-09-10 Thread Johannes Berg
On Sat, 2007-09-08 at 13:11 +0200, Michal Piotrowski wrote:

> Subject : Unable to access memory card reader anymore
> References  : http://bugzilla.kernel.org/show_bug.cgi?id=8885
> Last known good : ?
> Submitter   : Christian Casteyde <[EMAIL PROTECTED]>
> Caused-By   : ?
> Handled-By  : Alan Stern <[EMAIL PROTECTED]>
> Patch   : http://bugzilla.kernel.org/attachment.cgi?id=12438
> Status  : patch available

Last known good/caused-by doesn't apply, the bug has been in there ever
since the mac80211 code was merged into the tree.

johannes


signature.asc
Description: This is a digitally signed message part


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Denys Vlasenko
On Sunday 09 September 2007 19:18, Arjan van de Ven wrote:
> On Sun, 9 Sep 2007 19:02:54 +0100
> Denys Vlasenko <[EMAIL PROTECTED]> wrote:
> 
> > Why is all this fixation on "volatile"? I don't think
> > people want "volatile" keyword per se, they want atomic_read(&x) to
> > _always_ compile into an memory-accessing instruction, not register
> > access.
> 
> and ... why is that?
> is there any valid, non-buggy code sequence that makes that a
> reasonable requirement?

Well, if you insist on having it again:

Waiting for atomic value to be zero:

while (atomic_read(&x))
continue;

gcc may happily convert it into:

reg = atomic_read(&x);
while (reg)
continue;

Expecting every driver writer to remember that atomic_read is not in fact
a "read from memory" is naive. That won't happen. Face it, majority of
driver authors are a bit less talented than Ingo Molnar or Arjan van de Ven ;)
The name of the macro is saying that it's a read.
We are confusing users here.

It's doubly confusing that cpy_relax(), which says _nothing_ about barriers
in its name, is actually a barrier you need to insert here.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: auto recycling of TIME_WAIT connections

2007-09-10 Thread Pádraig Brady
Rick Jones wrote:
>> The first issue, requires a large timeout, and
>> the TIME_WAIT timeout is currently 60 seconds on linux.
>> That timeout effectively limits the connection rate between
>> local TCP clients and a server to 32k/60s or around 500
>> connections/second.
> 
> Actually, it would be more like 60k/60s if the application were making
> explicit calls to bind() as arguably it should if it is going to be
> churning through so many connections.


> This was an issue over a decade ago with SPECweb96 benchmarking.  The
> initial solution was to make the explicit bind() calls and not rely on
> the anonymous/ephemeral port space.  After that, one starts adding
> additional IP's into the mix (at least where possible).  And if that
> fails, one has to go back to the beginning and ask oneself exactly why a
> client is trying to churn through so many connections per second in the
> first place.

right. This is for benchmarking mainly.
Sane applications would use persistent connections,
or a different form of IPC.

> 
> If we were slavishly conformant to the RFC's :) that 60 seconds would be
> 240 seconds...
> 
>> But that issue can't really happen when the client
>> and server are on the same machine can it, and
>> even if it could, the timeouts involved would be shorter.
>>
>> Now linux does have an (undocumented)
>> /proc/sys/net/ipv4/tcp_tw_recycle flag
>> to enable recycling of TIME_WAIT connections. This is global however
>> and could cause
>> problems in general for external connections.
> 
> Rampant speculation begins...
> 
> If the client can be convinced to just call shutdown(SHUT_RDWR) rather
> than close(), and be the first to do so, ahead of the server, I think it
> will retain a link to the TCP endpoint in TIME_WAIT.  It could then, in
> TCP theory, call connect() again, and go through a path that allows
> transition from TIME_WAIT to ESTABLISHED if all the right things wrt
> Initial Sequence Number selection happen.  Whether randomization of the
> ISN allows that today is questionable.

Sounds good, unfortunately connect() returns EISCONN
unless you do a close().

> 
>> So how about auto enabling recycling for local connections?
> 
> I think the standard response is that one can never _really_ know what
> is local and what not, particularly in the presence of netfilter and the
> rewriting of headers behind one's back.

Hmm, I was afraid someone would say that :)

thanks for the feedback,
Pádraig.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


un

2007-09-10 Thread Robert Jordens
unsubscribe netdev
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates

2007-09-10 Thread James Chapman

Mandeep Singh Baines wrote:
Why would using a timer to hold off the napi_complete() rather than 
jiffy count limit the polls per packet to 2?



I was thinking a timer could be used in the way suggested in Jamal's
paper. The driver would do nothing (park) until the timer expires. So
there would be no calls to poll for the duration of the timer. Hence,
this approach would add extra latency not present in a jiffy polling
approach.


Ah, ok. I wasn't planning to test timer-driven polling. :)

Why wouldn't it be efficient? It would usually be done by reading an 
"interrupt pending" register.



Reading the "interrupt pending" register would require an MMIO read.
MMIO reads are very expensive. In some systems the latency of an MMIO
read can be 1000x that of an L1 cache access.


Agreed. Testing for any work being available should be as efficient as 
possible and would be driver specific.


--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates

2007-09-10 Thread James Chapman

Andi Kleen wrote:

James Chapman <[EMAIL PROTECTED]> writes:



On some platforms the precise timers (like ktime_get()) can be slow,
but often they are fast.  It might make sense to use a shorter
constant time wait on those with fast timers at least. Right now this
cannot be known by portable code, but there was a proposal some time
ago to export some global estimate to tell how fast
ktime_get().et.al. are. That could be reviewed.


Interesting. Is ktime_get() fast enough on P4 systems? I'll be using 
those to test with.


--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates

2007-09-10 Thread James Chapman

Jason Lunz wrote:

I'd be particularly interested to see what happens to your latency when
other apps are hogging the cpu. I assume from your description that your
cpu is mostly free to schedule the niced softirqd for the device polling
duration, but this won't always be the case. If other tasks are running
at high priority, it could be nearly a full jiffy before softirqd gets
to check the poll list again and the latency introduced could be much
higher than you've yet measured.


Indeed. The effect of cpu load on all of this is important to consider. 
The challenge will be how to test it fairly on different test runs.


One thing to bear in mind is that interrupts are processed at highest 
priority, above any scheduled work. Reducing interrupt rate gives the 
scheduler more chance to run what it thinks is the next highest priority 
work. This might be at the expense of network processing. Is it better 
to give other runnable tasks a fair chunk of the cpu pie? I think so.


I'll try to incorporate application cpu load into my tests. Thanks for 
your feedback.


--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates

2007-09-10 Thread James Chapman

jamal wrote:


If the problem i am trying to solve is "reduce cpu use at lower rate",
then this is not the right answer because your cpu use has gone up.


The problem I'm trying to solve is "reduce the max interrupt rate from 
NAPI drivers while minimizing latency". In modern systems, the interrupt 
rate can be so high that the CPU spends too much time processing 
interrupts, resulting in the system's behavior seen by the user being 
degraded.


Having the poll() called when idle will always increase CPU usage. But 
the feedback you and others are giving encourages me to find a better 
compromise. At the end of the day, it's going to be a trade-off of cpu 
and latency. The trade-off will be different for each hardware system 
and application environment so parameters need to be tunable.


I'll go away and do some tests. I'll post results here for discussion later.

Thanks for your feedback.

--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH resend] Fix a lock problem in generic phy code

2007-09-10 Thread Herbert Xu
Hans-J??rgen Koch <[EMAIL PROTECTED]> wrote:
>
> The following patch fixes it. Tested on an AT91SAM9263-EK board, kernel 
> 2.6.23-rc4 and -rc3-mm1.

Could you please audit all instances of physdev->lock and add
_bh where necessary?  I can see that at least phys_stop also
needs the _bh.

We should also consider whether it makes sense to move the
timer into a work queue.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPv6] BUG: NULL pointer dereference in(?) ip6_flush_pending_frames

2007-09-10 Thread Bernhard Schmidt

YOSHIFUJI Hideaki / 吉藤英明:

Hi,


BUG: unable to handle kernel NULL pointer dereference at virtual address
008c

:

EIP is at ip6_flush_pending_frames+0x97/0x121


I think I've found a bug.

[...]

Anyway, please try this.


FTR, I tried 2.6.22.6 without the patch and it failed as well. The 
patched kernel is running since yesterday evening (about 8h now) and 
seems to be stable so far. Too early to tell for sure, but I guess we 
have a fix.


Thanks Yoshifuji!

Regards,
Bernhard
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[INET_DIAG]: Fix oops in netlink_rcv_skb

2007-09-10 Thread Patrick McHardy
Fix the oidentd oops reported by Athanasius <[EMAIL PROTECTED]> in
http://bugzilla.kernel.org/show_bug.cgi?id=8961

The oops is a 2.6.22 regression and triggerable by normal users.
The patch applies cleanly to current -git and stable-2.6.22.

[INET_DIAG]: Fix oops in netlink_rcv_skb

netlink_run_queue() doesn't handle multiple processes processing the
queue concurrently. Serialize queue processing in inet_diag to fix
a oops in netlink_rcv_skb caused by netlink_run_queue passing a
NULL for the skb.

BUG: unable to handle kernel NULL pointer dereference at virtual address 
0054
[349587.500454]  printing eip:
[349587.500457] c03318ae
[349587.500459] *pde = 
[349587.500464] Oops:  [#1]
[349587.500466] PREEMPT SMP
[349587.500474] Modules linked in: w83627hf hwmon_vid i2c_isa
[349587.500483] CPU:0
[349587.500485] EIP:0060:[]Not tainted VLI
[349587.500487] EFLAGS: 00010246   (2.6.22.3 #1)
[349587.500499] EIP is at netlink_rcv_skb+0xa/0x7e
[349587.500506] eax:    ebx:    ecx: c148d2a0   edx: c0398819
[349587.500510] esi:    edi: c0398819   ebp: c7a21c8c   esp: c7a21c80
[349587.500517] ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
[349587.500521] Process oidentd (pid: 17943, ti=c7a2 task=cee231c0 
task.ti=c7a2)
[349587.500527] Stack:  c7a21cac f7c8ba78 c7a21ca4 c0331962 c0398819 
f7c8ba00 004c
[349587.500542]f736f000 c7a21cb4 c03988e3 0001 f7c8ba00 c7a21cc4 
c03312a5 004c
[349587.500558]f7c8ba00 c7a21cd4 c0330681 f7c8ba00 e4695280 c7a21d00 
c03307c6 7fff
[349587.500578] Call Trace:
[349587.500581]  [] show_trace_log_lvl+0x1c/0x33
[349587.500591]  [] show_stack_log_lvl+0x8d/0xaa
[349587.500595]  [] show_registers+0x1cb/0x321
[349587.500604]  [] die+0x112/0x1e1
[349587.500607]  [] do_page_fault+0x229/0x565
[349587.500618]  [] error_code+0x72/0x78
[349587.500625]  [] netlink_run_queue+0x40/0x76
[349587.500632]  [] inet_diag_rcv+0x1f/0x2c
[349587.500639]  [] netlink_data_ready+0x57/0x59
[349587.500643]  [] netlink_sendskb+0x24/0x45
[349587.500651]  [] netlink_unicast+0x100/0x116
[349587.500656]  [] netlink_sendmsg+0x1c2/0x280
[349587.500664]  [] sock_sendmsg+0xba/0xd5
[349587.500671]  [] sys_sendmsg+0x17b/0x1e8
[349587.500676]  [] sys_socketcall+0x230/0x24d
[349587.500684]  [] syscall_call+0x7/0xb
[349587.500691]  ===
[349587.500693] Code: f0 ff 4e 18 0f 94 c0 84 c0 0f 84 66 ff ff ff 89 f0 e8 86 
e2 fc ff e9 5a ff ff ff f0 ff 40 10 eb be 55 89 e5 57 89 d7 56 89 c6 53 <8b> 50 
54 83 fa 10 72 55 8b 9e 9c 00 00 00 31 c9 8b 03 83 f8 0f

Reported by Athanasius <[EMAIL PROTECTED]>

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index dbeacd8..def007e 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -836,12 +836,16 @@ static int inet_diag_rcv_msg(struct sk_buff *skb, struct 
nlmsghdr *nlh)
return inet_diag_get_exact(skb, nlh);
 }
 
+static DEFINE_MUTEX(inet_diag_mutex);
+
 static void inet_diag_rcv(struct sock *sk, int len)
 {
unsigned int qlen = 0;
 
do {
+   mutex_lock(&inet_diag_mutex);
netlink_run_queue(sk, &qlen, &inet_diag_rcv_msg);
+   mutex_unlock(&inet_diag_mutex);
} while (qlen);
 }
 


Re: sh: add support for ax88796 and 93cx6 to highlander boards

2007-09-10 Thread Paul Mundt
On Mon, Sep 10, 2007 at 03:36:26PM +0900, Magnus Damm wrote:
> --- 0004/arch/sh/boards/renesas/r7780rp/setup.c
> +++ work/arch/sh/boards/renesas/r7780rp/setup.c   2007-09-06 
> 15:35:49.0 +0900
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static struct resource r8a66597_usb_host_resources[] = {
>   [0] = {
> @@ -136,11 +137,50 @@ static struct platform_device heartbeat_
>   .resource   = heartbeat_resources,
>  };
>  
> +static struct ax_plat_data ax88796_platdata = {
> + .flags  = AXFLG_HAS_93CX6,
> + .wordlength = 2,
> + .dcr_val= 0x1,
> + .rcr_val= 0x40,
> +};
> +
> +static struct resource ax88796_resources[] = {
> + {
> +#ifdef CONFIG_SH_R7780RP
> + .start  = 0xa5800400,
> + .end= 0xa5800400 + (0x20 * 0x2) - 1,
> +#else
> + .start  = 0xa4100400,
> + .end= 0xa4100400 + (0x20 * 0x2) - 1,
> +#endif
> + .flags  = IORESOURCE_MEM,
> + },
> + {
> + .start  = IRQ_AX88796,
> + .end= IRQ_AX88796,
> + .flags  = IORESOURCE_IRQ,
> + },
> +};
> +
> +static struct platform_device ax88796_device = {
> + .name   = "ax88796",
> + .id = 0,
> +
> + .dev= {
> + .platform_data = &ax88796_platdata,
> + },
> +
> + .num_resources  = ARRAY_SIZE(ax88796_resources),
> + .resource   = ax88796_resources,
> +};
> +
> +
>  static struct platform_device *r7780rp_devices[] __initdata = {
>   &r8a66597_usb_host_device,
>   &m66592_usb_peripheral_device,
>   &cf_ide_device,
>   &heartbeat_device,
> + &ax88796_device,
>  };
>  
>  static int __init r7780rp_devices_setup(void)

These bits I'll merge separately once the other patches have been
applied.

> --- 0001/drivers/net/Kconfig
> +++ work/drivers/net/Kconfig  2007-09-06 15:35:41.0 +0900
> @@ -218,13 +218,20 @@ source "drivers/net/arm/Kconfig"
>  
>  config AX88796
>   tristate "ASIX AX88796 NE2000 clone support"
> - depends on ARM || MIPS
> + depends on ARM || MIPS || SUPERH
>   select CRC32
>   select MII
>   help
> AX88796 driver, using platform bus to provide
> chip detection and resources
>  
> +config AX88796_93CX6
> + bool "ASIX AX88796 external 93CX6 eeprom support"
> + depends on AX88796
> + select EEPROM_93CX6
> + help
> +   Select this if your platform comes with an external 93CX6 eeprom.
> +
>  config MACE
>   tristate "MACE (Power Mac ethernet) support"
>   depends on PPC_PMAC && PPC32

There are two different changes here, these should probably be split up
and applied independently of each other, given that there's no real
dependency between them.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html