date:20071220

Re: [PATCH 2.6.25 0/9]: SCTP: Update ADD-IP implementation to conform to spec

2007-12-20 Thread David Miller

From: Vlad Yasevich [EMAIL PROTECTED]
Date: Wed, 19 Dec 2007 15:53:47 -0500

 Not sure if you got the PATCH 7/9 resend, but it looks like netdev ate that
 too.
 
 I made this patch set available here:
  master.kernel.org:/pub/scm/linux/kernel/git/vxy/lksctp-dev.git addip

I got the patch, there is probably some keyword in there
that is making it get consumed by the majordomo regexp
filters we have in place.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/29] Swap over NFS -v15

2007-12-20 Thread Peter Zijlstra


On Wed, 2007-12-19 at 17:22 -0500, Bill Davidsen wrote:
 Peter Zijlstra wrote:
  Hi,
  
  Another posting of the full swap over NFS series. 
  
  Andrew/Linus, could we start thinking of sticking this in -mm?
  
 
 Two questions:
 1 - what is the memory use impact on the system which don't do swap over 
 NFS, such as embedded systems, and

It should have little to no impact if not used.

 2 - what is the advantage of this code over the two existing network 
 swap approaches, 

 swapping to NFS mounted file and 

This is not actually possible with a recent kernel, current swapfile
support requires a blockdevice.

 swap to NBD device?

 I've used the NFS file when a program was running out of memory and that 
 seemed to work, people in UNYUUG have reported that the nbd swap works, 
 so what's better here?

swap over NBD works sometimes, its rather easy to deadlock, and its
impossible to recover from a broken connection.


signature.asc
Description: This is a digitally signed message part

Re: [PATCH] One more XFRM audit fix

2007-12-20 Thread David Miller

From: Paul Moore [EMAIL PROTECTED]
Date: Wed, 19 Dec 2007 14:29:31 -0500

 The following patch is backed against David's net-2.6 tree and is pretty
 trivial.  I know we're late in the 2.6.24 cycle but I think this is worth
 merging, if you guys don't feel that way let me know and I'll resubmit it
 for 2.6.25.

Where is that patch?  Or do you mean the fix you emailed
seperately today (which I will apply, thanks)?

 As a side note, I'm unable to actually test the patch because I can't get
 the kernel to compile (M=net/xfrm works just fine).  The problem I keep
 seeing is below:

 make[3]: *** No rule to make target \
  `/blah/kernels/net-2.6_xfrm-auid-secid-fix/include/linux/ticable.h', \
   needed by \
  `/blah/kernels/net-2.6_xfrm-auid-secid-fix/usr/include/linux/ticable.h'. \
   Stop.

Remove ticable.h from include/linux/Kbuild

This is already cured in Linus's tree.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] XFRM: Audit function arguments misordered

2007-12-20 Thread David Miller

From: Paul Moore [EMAIL PROTECTED]
Date: Wed, 19 Dec 2007 14:29:38 -0500

 In several places the arguments to the xfrm_audit_start() function are in the
 wrong order resulting in incorrect user information being reported.  This
 patch corrects this by pacing the arguments in the correct order.

 Signed-off-by: Paul Moore [EMAIL PROTECTED]

Applied, thanks for fixing this bug.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][IPV4] ip_gre: set mac_header correctly in receive path

2007-12-20 Thread David Miller

From: Timo_Teräs [EMAIL PROTECTED]
Date: Wed, 19 Dec 2007 20:10:41 +0200

 From: Timo Teras [EMAIL PROTECTED]

 mac_header update in ipgre_recv() was incorrectly changed to
 skb_reset_mac_header() when it was introduced.

 Signed-off-by: Timo Teras [EMAIL PROTECTED]

Patch applied, thanks.

 ---
 This replaces my earlier patch titled ip_gre: use skb-{mac,
 network}_header consistently. Apparently I hadn't done my homework how
 to use *_header correctly. And I should have done a bit more testing to
 figure out the previous patch does not work.

 But the main problem was the receive path in the first place, and this
 patch fixes it.

 The bug was introduced in commit 459a98ed881802dee55897441bc7f77af614368e.
 There might be other similar incorrect replaces.

That commit has two other identical bad conversions, I'll
fix them up.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-2.6.25 1/3] Uninline the __inet_hash function

2007-12-20 Thread David Miller

From: Eric Dumazet [EMAIL PROTECTED]
Date: Wed, 19 Dec 2007 18:15:20 +0100

 Pavel Emelyanov a écrit :
  That's not truth, if I get you right. The __inet_hash() is called
  with 0, from all the places except for the inet_hash() one.

 OK, but on cases with 0, sk-sk_state is != TCP_LISTEN, unless I am mistaken.

This is true.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [IPROUTE]: A workaround to make larger rto_min printed correctly

2007-12-20 Thread YOSHIFUJI Hideaki / 吉藤英明

In article [EMAIL PROTECTED] (at Thu, 20 Dec 2007 12:31:27 +0900), Satoru 
SATOH [EMAIL PROTECTED] says:

 diff --git a/ip/iproute.c b/ip/iproute.c
 index f4200ae..fa722c6 100644
 --- a/ip/iproute.c
 +++ b/ip/iproute.c
 @@ -510,16 +510,16 @@ int print_route(const struct sockaddr_nl *who,
 struct nlmsghdr *n, void *arg)
 fprintf(fp,  %u,
 *(unsigned*)RTA_DATA(mxrta[i]));
 else {
 unsigned val = *(unsigned*)RTA_DATA(mxrta[i]);
 +   unsigned hz1 = hz / 1000;
 
 -   val *= 1000;
 if (i == RTAX_RTT)

I think this is incorrect; hz might not be 1000; e.g. 250 etc.

--yoshfuji
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [IPROUTE]: A workaround to make larger rto_min printed correctly

2007-12-20 Thread Jarek Poplawski

On 20-12-2007 04:31, Satoru SATOH wrote:
 ip route show does not print correct value when larger rto_min is
 set (e.g. 3sec).
 
 This problem is because of overflow in print_route() and
 the patch below is a workaround fix for that.
 
...
 --- a/ip/iproute.c
 +++ b/ip/iproute.c
 @@ -510,16 +510,16 @@ int print_route(const struct sockaddr_nl *who,
 struct nlmsghdr *n, void *arg)
 fprintf(fp,  %u,
 *(unsigned*)RTA_DATA(mxrta[i]));
 else {
 unsigned val = *(unsigned*)RTA_DATA(mxrta[i]);
 +   unsigned hz1 = hz / 1000;
...
 +   if (val = hz1)
 +   fprintf(fp,  %ums, val/hz1);
...

Probably I miss something or my iproute sources are too old, but:
does this work with hz  1000?

Regards,
Jarek P.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

A short question about net git tree and patches

2007-12-20 Thread David Shwatrz

Hello,
I have a short question regarding the net git tree and patches:
I want to write and send patches against the most recent and
bleeding edge kernel networking code.
I see in:
http://kernel.org/pub/scm/linux/kernel/git/davem/?C=M;O=A
that there are 3 git trees which can be candidates for git-clone and
making patches against;
these are:
netdev-2.6.git, net-2.6.25.git and net-2.6.git.

It seems to me that net-2.6.git is the most suitable one to work against;
am I right ?
what is the difference, in short, between the three repositories?

Regards,
DS
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: DM9000_IRQ_FLAGS

2007-12-20 Thread Ben Dooks

On Tue, Dec 11, 2007 at 08:18:23PM +0100, Daniel Mack wrote:
 Hi,
 
 on Toradex' Colibri, a PXA270 based board with a DM9000 ethernet
 controller, this driver won't work due to unsuitable DM9000_IRQ_FLAGS.
 If I understood the code behind request_irq() correctly, it's not
 recommended to register an IRQ without any of the IRQT_* flags set.
 
 Is there any concerns about applying the patch below?

Yes, that will possibly break all systems using level-triggered
interrupts.

Probably the best solution is to pass the data via the platform
information being fed to the device.

-- 
Ben ([EMAIL PROTECTED], http://www.fluff.org/)

  'a smiley only costs 4 bytes'
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: DM9000_IRQ_FLAGS

2007-12-20 Thread Ben Dooks

On Wed, Dec 12, 2007 at 02:41:53PM +0100, Daniel Mack wrote:
 Hi Remy,
 
 On Tue, Dec 11, 2007 at 09:31:03PM +0100, Remy Bohmer wrote:
  This controller is also used on many other boards, like the e.g. Atmel
  AT91sam9261-ek board. On that board on both the rising _and_ falling
  edge an interrupt is generated.
 
 However, request_irq() is called with IRQF_SHARED only, so neither
 IRQT_RISING nor IRQT_FALLING is set and the value defaults to
 IRQT_NOEDGE. How can you get IRQs?
 
  I can test tomorrow if this patch leaves this board in tact, but
  should the board-specific code not add this flag if it is required ?
  By modifying this driver you will interfere the behavior of other
  boards, and I do not know if there any level triggered types used.
 
 Actually, the best way to go is to let the platform resources flags 
 decide about that with something like
 
   resource-flags = IORESOURCE_IRQ | IRQT_RISING;
 
 but the dm9000 does not care about them at all. Changing that would also
 imply modifications to all board support code.

I did have a go at trying to get people to pass the information this
way, but it seem to be ignored last time I sent it. I can dig out the
code that converts resource-flags to IRQT_ flags.

-- 
Ben ([EMAIL PROTECTED], http://www.fluff.org/)

  'a smiley only costs 4 bytes'
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-2.6.25 (resend) 1/3] Uninline the __inet_hash function

2007-12-20 Thread Pavel Emelyanov

This one is used in quite many places in the networking code and
seems to big to be inline.

After the patch net/ipv4/build-in.o loses ~650 bytes:
add/remove: 2/0 grow/shrink: 0/5 up/down: 461/-1114 (-653)
function old new   delta
__inet_hash_nolisten   - 282+282
__inet_hash- 179+179
tcp_sacktag_write_queue 22552254  -1
__inet_lookup_listener   284 274 -10
tcp_v4_syn_recv_sock 755 493-262
tcp_v4_hash  389  35-354
inet_hash_connect   1086 599-487

This version addresses the issue pointed by Eric, that
while being inline this function was optimized by gcc
in respect to the 'listen_possible' argument.

(Patches 2 and 3 in this series are still applied after this)

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

---

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index fef4442..65ddb25 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -264,37 +264,14 @@ static inline void inet_listen_unlock(struct 
inet_hashinfo *hashinfo)
wake_up(hashinfo-lhash_wait);
 }
 
-static inline void __inet_hash(struct inet_hashinfo *hashinfo,
-  struct sock *sk, const int listen_possible)
-{
-   struct hlist_head *list;
-   rwlock_t *lock;
-
-   BUG_TRAP(sk_unhashed(sk));
-   if (listen_possible  sk-sk_state == TCP_LISTEN) {
-   list = hashinfo-listening_hash[inet_sk_listen_hashfn(sk)];
-   lock = hashinfo-lhash_lock;
-   inet_listen_wlock(hashinfo);
-   } else {
-   struct inet_ehash_bucket *head;
-   sk-sk_hash = inet_sk_ehashfn(sk);
-   head = inet_ehash_bucket(hashinfo, sk-sk_hash);
-   list = head-chain;
-   lock = inet_ehash_lockp(hashinfo, sk-sk_hash);
-   write_lock(lock);
-   }
-   __sk_add_node(sk, list);
-   sock_prot_inc_use(sk-sk_prot);
-   write_unlock(lock);
-   if (listen_possible  sk-sk_state == TCP_LISTEN)
-   wake_up(hashinfo-lhash_wait);
-}
+extern void __inet_hash(struct inet_hashinfo *hashinfo, struct sock *sk);
+extern void __inet_hash_nolisten(struct inet_hashinfo *hinfo, struct sock *sk);
 
 static inline void inet_hash(struct inet_hashinfo *hashinfo, struct sock *sk)
 {
if (sk-sk_state != TCP_CLOSE) {
local_bh_disable();
-   __inet_hash(hashinfo, sk, 1);
+   __inet_hash(hashinfo, sk);
local_bh_enable();
}
 }
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 02fc91c..f450df2 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -408,7 +408,7 @@ struct sock *dccp_v4_request_recv_sock(struct sock *sk, 
struct sk_buff *skb,
 
dccp_sync_mss(newsk, dst_mtu(dst));
 
-   __inet_hash(dccp_hashinfo, newsk, 0);
+   __inet_hash_nolisten(dccp_hashinfo, newsk);
__inet_inherit_port(dccp_hashinfo, sk, newsk);
 
return newsk;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index b07e2d3..2e5814a 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -305,6 +305,48 @@ static inline u32 inet_sk_port_offset(const struct sock 
*sk)
  inet-dport);
 }
 
+void __inet_hash_nolisten(struct inet_hashinfo *hashinfo, struct sock *sk)
+{
+   struct hlist_head *list;
+   rwlock_t *lock;
+   struct inet_ehash_bucket *head;
+
+   BUG_TRAP(sk_unhashed(sk));
+
+   sk-sk_hash = inet_sk_ehashfn(sk);
+   head = inet_ehash_bucket(hashinfo, sk-sk_hash);
+   list = head-chain;
+   lock = inet_ehash_lockp(hashinfo, sk-sk_hash);
+
+   write_lock(lock);
+   __sk_add_node(sk, list);
+   sock_prot_inc_use(sk-sk_prot);
+   write_unlock(lock);
+}
+EXPORT_SYMBOL_GPL(__inet_hash_nolisten);
+
+void __inet_hash(struct inet_hashinfo *hashinfo, struct sock *sk)
+{
+   struct hlist_head *list;
+   rwlock_t *lock;
+
+   if (sk-sk_state != TCP_LISTEN) {
+   __inet_hash_nolisten(hashinfo, sk);
+   return;
+   }
+
+   BUG_TRAP(sk_unhashed(sk));
+   list = hashinfo-listening_hash[inet_sk_listen_hashfn(sk)];
+   lock = hashinfo-lhash_lock;
+
+   inet_listen_wlock(hashinfo);
+   __sk_add_node(sk, list);
+   sock_prot_inc_use(sk-sk_prot);
+   write_unlock(lock);
+   wake_up(hashinfo-lhash_wait);
+}
+EXPORT_SYMBOL_GPL(__inet_hash);
+
 /*
  * Bind a port for a connect operation and hash it.
  */
@@ -372,7 +414,7 @@ ok:
inet_bind_hash(sk, tb, port);
if (sk_unhashed(sk)) {
inet_sk(sk)-sport = htons(port);
-   __inet_hash(hinfo, sk, 0);
+

[PATCH net-2.6.25][NEIGH] Make neigh_add_timer symmetrical to neigh_del_timer

2007-12-20 Thread Pavel Emelyanov

The neigh_del_timer() looks sane - it removes the timer and
(conditionally) puts the neighbor. I expected, that the
neigh_add_timer() is symmetrical to the del one - i.e. it
holds the neighbor and arms the timer - but it turned out
that it was not so.

I think, that making them look symmetrical makes the code 
more readable.

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

---


diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 4b6dd1e..9a283fc 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -165,6 +165,16 @@ static int neigh_forced_gc(struct neigh_table *tbl)
return shrunk;
 }
 
+static void neigh_add_timer(struct neighbour *n, unsigned long when)
+{
+   neigh_hold(n);
+   if (unlikely(mod_timer(n-timer, when))) {
+   printk(NEIGH: BUG, double timer add, state is %x\n,
+  n-nud_state);
+   dump_stack();
+   }
+}
+
 static int neigh_del_timer(struct neighbour *n)
 {
if ((n-nud_state  NUD_IN_TIMER) 
@@ -716,15 +726,6 @@ static __inline__ int neigh_max_probes(struct neighbour *n)
p-ucast_probes + p-app_probes + p-mcast_probes);
 }
 
-static inline void neigh_add_timer(struct neighbour *n, unsigned long when)
-{
-   if (unlikely(mod_timer(n-timer, when))) {
-   printk(NEIGH: BUG, double timer add, state is %x\n,
-  n-nud_state);
-   dump_stack();
-   }
-}
-
 /* Called when a timer expires for a neighbour entry. */
 
 static void neigh_timer_handler(unsigned long arg)
@@ -856,7 +857,6 @@ int __neigh_event_send(struct neighbour *neigh, struct 
sk_buff *skb)
atomic_set(neigh-probes, neigh-parms-ucast_probes);
neigh-nud_state = NUD_INCOMPLETE;
neigh-updated = jiffies;
-   neigh_hold(neigh);
neigh_add_timer(neigh, now + 1);
} else {
neigh-nud_state = NUD_FAILED;
@@ -869,7 +869,6 @@ int __neigh_event_send(struct neighbour *neigh, struct 
sk_buff *skb)
}
} else if (neigh-nud_state  NUD_STALE) {
NEIGH_PRINTK2(neigh %p is delayed.\n, neigh);
-   neigh_hold(neigh);
neigh-nud_state = NUD_DELAY;
neigh-updated = jiffies;
neigh_add_timer(neigh,
@@ -1013,13 +1012,11 @@ int neigh_update(struct neighbour *neigh, const u8 
*lladdr, u8 new,
 
if (new != old) {
neigh_del_timer(neigh);
-   if (new  NUD_IN_TIMER) {
-   neigh_hold(neigh);
+   if (new  NUD_IN_TIMER)
neigh_add_timer(neigh, (jiffies +
((new  NUD_REACHABLE) ?
 neigh-parms-reachable_time :
 0)));
-   }
neigh-nud_state = new;
}
 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] net: napi fix

2007-12-20 Thread Robert Olsson


David Miller writes:

   Is the netif_running() check even required?
  
  No, it is not.
  
  When a device is brought down, one of the first things
  that happens is that we wait for all pending NAPI polls
  to complete, then block any new polls from starting.

 Hello!

 Yes but the reason was not to wait for all pending polls to
 complete so a server/router could be rebooted even under high-
 load and DOS. We've experienced some nasty problems with this.

 Cheers.
--ro
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A short question about net git tree and patches

2007-12-20 Thread Sam Ravnborg

On Thu, Dec 20, 2007 at 11:20:26AM +0200, David Shwatrz wrote:
 Hello,
 I have a short question regarding the net git tree and patches:
 I want to write and send patches against the most recent and
 bleeding edge kernel networking code.
 I see in:
 http://kernel.org/pub/scm/linux/kernel/git/davem/?C=M;O=A
 that there are 3 git trees which can be candidates for git-clone and
 making patches against;
 these are:
 netdev-2.6.git, net-2.6.25.git and net-2.6.git.
 
 It seems to me that net-2.6.git is the most suitable one to work against;
 am I right ?
 what is the difference, in short, between the three repositories?

IIRC the usage is:
netdev-2.6.git   = old stuff, 4 weeks since last update. Not in use
net-2.6.25.git   = patches for current kernel release (only fixes)
net-2.6.git  = patches for next kernel relase and planned to be
applied in next merge window

So net-2.6.git is the correct choice for bleeding edge.

Sam
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: DM9000_IRQ_FLAGS

2007-12-20 Thread Remy Bohmer

Hello Ben,

 
  Actually, the best way to go is to let the platform resources flags
  decide about that with something like
 
resource-flags = IORESOURCE_IRQ | IRQT_RISING;
 
  but the dm9000 does not care about them at all. Changing that would also
  imply modifications to all board support code.

 I did have a go at trying to get people to pass the information this
 way, but it seem to be ignored last time I sent it. I can dig out the
 code that converts resource-flags to IRQT_ flags.

I thought this issue was already solved by using set_irq_type() in the
BSP, just like all the other boards do...

Kind Regards,

Remy
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A short question about net git tree and patches

2007-12-20 Thread Ilpo Järvinen

On Thu, 20 Dec 2007, Sam Ravnborg wrote:

 On Thu, Dec 20, 2007 at 11:20:26AM +0200, David Shwatrz wrote:
  Hello,
  I have a short question regarding the net git tree and patches:
  I want to write and send patches against the most recent and
  bleeding edge kernel networking code.
  I see in:
  http://kernel.org/pub/scm/linux/kernel/git/davem/?C=M;O=A
  that there are 3 git trees which can be candidates for git-clone and
  making patches against;
  these are:
  netdev-2.6.git, net-2.6.25.git and net-2.6.git.
  
  It seems to me that net-2.6.git is the most suitable one to work against;
  am I right ?
  what is the difference, in short, between the three repositories?
 
 IIRC the usage is:
 netdev-2.6.git   = old stuff, 4 weeks since last update. Not in use
 net-2.6.25.git   = patches for current kernel release (only fixes)

Nope, we don't even have 2.6.24 yet. :-)

 net-2.6.git  = patches for next kernel relase and planned to be
 applied in next merge window

 So net-2.6.git is the correct choice for bleeding edge.

net-2.6 is for fixes only, net-2.6.25 will become net-2.6 once 2.6.24
gets released, eventually net-2.6.26 gets opened (not necessarily at the 
same time as the merge window is closed but a bit later) and the cycle 
repeats with similar transitions when 2.6.25 gets released.

The netdev trees are for network drivers and are usually managed by
Jeff Garzik but there were recently some arrangement between Dave and 
Jeff due to vacations so that also netdev was managed by Dave.

-- 
 i.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Badness at net/core/dev.c:2199

2007-12-20 Thread Meelis Roos

 I already sendout a correct patch last week. It should pre-increment.

Any hope getting it upstream?

-- 
Meelis Roos ([EMAIL PROTECTED])  http://www.cs.ut.ee/~mroos/
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] net: napi fix

2007-12-20 Thread David Miller

From: Robert Olsson [EMAIL PROTECTED]
Date: Thu, 20 Dec 2007 10:52:17 +0100

  Yes but the reason was not to wait for all pending polls to
  complete so a server/router could be rebooted even under high-
  load and DOS. We've experienced some nasty problems with this.

I know, see the rest of the thread where I agree that
we need to deal with this somehow.

The device is marked down first, and somehow we need to
tip off of that to break out of the NAPI loop.  This
how is what hasn't been resolved yet.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A short question about net git tree and patches

2007-12-20 Thread David Miller

From: Sam Ravnborg [EMAIL PROTECTED]
Date: Thu, 20 Dec 2007 10:55:10 +0100

 net-2.6.25.git   = patches for current kernel release (only fixes)
 net-2.6.git  = patches for next kernel relase and planned to be
 applied in next merge window
 
 So net-2.6.git is the correct choice for bleeding edge.

You reversed them, net-2.6.25.git is for bleeding edge
stuff, net-2.6.git is for bug fixes only.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/4] [UDP]: fix send buffer check

2007-12-20 Thread David Miller

From: Hideo AOKI [EMAIL PROTECTED]
Date: Mon, 17 Dec 2007 21:38:03 -0500

 This patch introduces sndbuf size check before memory allocation for
 send buffer.

 signed-off-by: Satoshi Oshima [EMAIL PROTECTED]
 signed-off-by: Hideo Aoki [EMAIL PROTECTED]
 ...
 diff -pruN net-2.6/net/ipv4/ip_output.c 
 net-2.6-udp-take11a1-p1/net/ipv4/ip_output.c
 --- net-2.6/net/ipv4/ip_output.c  2007-12-11 10:54:55.0 -0500
 +++ net-2.6-udp-take11a1-p1/net/ipv4/ip_output.c  2007-12-17 
 14:42:31.0 -0500
 @@ -1004,6 +1004,11 @@ alloc_new_skb:
   frag = skb_shinfo(skb)-frags[i];
   }
   } else if (i  MAX_SKB_FRAGS) {
 + if (atomic_read(sk-sk_wmem_alloc) + PAGE_SIZE
 +  2 * sk-sk_sndbuf) {
 + err = -ENOBUFS;
 + goto error;
 + }
   if (copy  PAGE_SIZE)
   copy = PAGE_SIZE;
   page = alloc_pages(sk-sk_allocation, 0);

If we are going to do this, we need to add the same check to
skb_append_datato_frags() which is invoked via ip_ufo_append_data().

We also have to be very careful in this area.  One problem we had a
long time ago was that we would socket account when fragmenting an
outgoing frame.  This was bogus because even if the socket had enough
space for one full sized frame, the packet send would fail because it
could not fit the space for both the original frame and the
fragmented copy of it.

This situation was cured by simply not enforcing accounting for the
fragmented copy.  It is valid because after we fragment, we keep
the fragmented copy but free the original.

This doesn't apply directly to this specific patch, but it is
something to keep in mind when doing these changes.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TSO trimming question

2007-12-20 Thread Ilpo Järvinen

On Wed, 19 Dec 2007, David Miller wrote:

 From: Ilpo_Järvinen [EMAIL PROTECTED]
 Date: Wed, 19 Dec 2007 23:46:33 +0200 (EET)

  I'm not fully sure what's purpose of this code in tcp_write_xmit:

 if (skb-len  limit) {
 unsigned int trim = skb-len % mss_now;

 if (trim)
 limit = skb-len - trim;
  }

  Is it used to make sure we send only multiples of mss_now here and leave 
  the left-over into another skb?

Yeah, I now understand that this part is correct. I somehow got such 
impression while trying to figure this out that it ends up being dead code 
but that wasn't correct thought from my side. However, it caught my 
attention and after some thinking I'd say there's more to handle here 
(covered by the second question).

Also note that patch I sent earlier is not right either but needs some 
refining to do the right thing.

  Or does it try to make sure that
  tso_fragment result honors multiple of mss_now boundaries when snd_wnd
  is the limitting factor? For latter IMHO this would be necessary:

  if (skb-len  limit)
  limit -= limit % mss_now;

 The purpose of the test is to make sure we process tail sub-mss chunks
 correctly wrt. Nagle, which most closely matches the first purpose
 you've listed.

 So I think the calculation really does belong where it is.

 Because of the way that the sendmsg() super-skb formation logic
 works, we always will tack on more data and grow the tail
 SKB before creating a new one.  So any sub-mss chunk at the
 end of a TSO frame really is at the end of the write queue
 and really should get nagle processing.

Yes, I now agree this is fully correct for this task.

 Actually, there is an exception, which is when we run out of
 skb_frag_list slots.  In that case we'll potentially have breaks at
 odd boundaries in the middle of the queue.  But this can only happen
 in exceptional cases (user does tons of 1-byte sendfile()'s over
 random non-consequetive locations of a file) or outright bugs
 (MAX_SKB_FRAGS is defined incorrectly, for example) and thus this
 situation is not worth coding for.

That's not the only case, IMHO if there's odd boundary due to 
snd_una+snd_wnd - skb-seq limit (done in tcp_window_allows()), we don't 
consider it as odd but break the skb at arbitary point resulting
two small segments to the network, and what's worse, when the later skb 
resulting from the first split is matching skb-len  limit check as well 
causing an unnecessary small skb to be created for nagle purpose too, 
solving it fully requires some thought in case the mss_now != mss_cache 
even if non-odd boundaries are honored in the middle of skb.

Though whether we get there is depending on what tcp_tso_should_defer() 
decided. Hmm, there seems to be an unrelated bug in it as well :-/. A 
patch below. Please consider the fact that enabling TSO deferring may 
have some unpleasant effect to TCP dynamics, considering that I don't find 
stable mandatory for this to avoid breaking, besides things have been 
working quite well without it too... Only compile tested.

-- 
 i.

--
[PATCH] [TCP]: Fix TSO deferring

I'd say that most of what tcp_tso_should_defer had in between
there was dead code because of this.

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 net/ipv4/tcp_output.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8dafda9..693b9f6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1217,7 +1217,8 @@ static int tcp_tso_should_defer(struct sock *sk, struct 
sk_buff *skb)
goto send_now;

/* Defer for less than two clock ticks. */
-   if (!tp-tso_deferred  ((jiffies1)1) - (tp-tso_deferred1)  1)
+   if (tp-tso_deferred 
+   ((jiffies  1)  1) - (tp-tso_deferred  1)  1)
goto send_now;

in_flight = tcp_packets_in_flight(tp);
-- 
1.5.0.6

Re: [PATCH 2/4] [CORE]: datagram: basic memory accounting functions

2007-12-20 Thread David Miller

From: Hideo AOKI [EMAIL PROTECTED]
Date: Mon, 17 Dec 2007 21:38:17 -0500

Why do we need seperate stream and datagram accounting functions?

Is it just to facilitate things like the following test?

 +static inline int sk_wmem_schedule(struct sock *sk, int size)
 +{
 + if (sk-sk_type == SOCK_DGRAM)
 + return sk_datagram_wmem_schedule(sk, size);
 + else
 + return 1;
 +}

If so, this can be greatly improved.

All of these other functions are identical copies of the stream
counterparts, they should all be consolidated.

I still see a lot of special casing, instead of large pieces of common
code.

There should be one core set of functions that handle the memory
accounting, regardless of socket type.  Maybe there is one spot where
something like sk-prot-doing_memory_accounting is tested, but that's
it.

I am still very dissatisfied with these changes.  They are
full of special cases, because they mix generic facilities
(the socket memory accounting) with an unrelated issue
(we only support memory accounting for datagram sockets which
are actually UDP).

Also, the memory accounting is done at different parts in
the socket code paths for stream vs. datagram.  This is why
everything is inconsistent, and, a mess.

What's funny is that I absolutely do not care if these changes are
perfect and pass every possible regression test.  Rather, I'm more
concerned that this thing is designed correctly and will allow us to
have one core set of memory accounting functions regardless of socket
type.  As it is coded now, we have two sets of code paths to
fix, two ways of doing the socket accounting, and therefore twice
as much code to maintain and debug.

The whole thing needs to be consistent and without special cases.

The protocol supports memory accounting test can be performed, as
you did in this patch, by simply checking if
sk-sk_prot-memory_allocated is non-NULL.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: dn_neigh_table vs pneigh_lookup/pneigh_delete

2007-12-20 Thread Steven Whitehouse

Hi,

On Wed, Dec 19, 2007 at 05:11:34PM +0300, Pavel Emelyanov wrote:
 Hi
 
 The pneigh_lookup/delete silently concerns, that the 
 key_len of the table is more that 4 bytes. Look:
 
  u32 hash_val = *(u32 *)(pkey + key_len - 4);
 
 The hash_val for the proxy neighbor entry is four last bytes
 from the pkey.
 
 But the dn_neigh_tables' key_len is sizeof(__le16), that is 2,
 so setting (via netlink) the proxy neighbor entry for decnet 
 will cause this entry to reside in arbitrary hash chain.
 
 Is this too bad for decnet?
 
 Thanks,
 Pavel
The pneigh code is never used in DECnet, we only use the normal
part of the neigh code where the hash function was changed so that
it can be defined for each protocol (and thus doesn't suffer from
this problem)

Steve.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] [UDP]: memory accounting in IPv4

2007-12-20 Thread David Miller

From: Hideo AOKI [EMAIL PROTECTED]
Date: Mon, 17 Dec 2007 21:38:47 -0500

 This patch adds UDP memory usage accounting in IPv4.

 Send buffer accounting is performed by IP layer, because skbuff is
 allocated in the layer.

 Receive buffer is charged, when the buffer successfully received.
 Destructor of the buffer does uncharging and reclaiming, when the
 buffer is freed. To set destructor at proper place, we use
 __udp_queue_rcv_skb() instead of sock_queue_rcv_skb(). To maintain
 consistency of memory accounting, socket lock is used to free receive
 buffer in udp_recvmsg().

 New packet will be add to backlog when the socket is used by user.

 Cc: Satoshi Oshima [EMAIL PROTECTED]
 signed-off-by: Takahiro Yasui [EMAIL PROTECTED]
 signed-off-by: Masami Hiramatsu [EMAIL PROTECTED]
 signed-off-by: Hideo Aoki [EMAIL PROTECTED]

We can't accept these changes, even once the other issues
are fixed, until IPV6 is supported as well.

It's pointless to support proper UDP memory accounting only
in IPV4 and not in IPV6 as well.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TSO trimming question

2007-12-20 Thread David Miller

From: Ilpo_Järvinen [EMAIL PROTECTED]
Date: Thu, 20 Dec 2007 13:40:51 +0200 (EET)

 [PATCH] [TCP]: Fix TSO deferring
 
 I'd say that most of what tcp_tso_should_defer had in between
 there was dead code because of this.
 
 Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]

Yikes!

John, we've been living a lie for more than a year. :-/

On the bright side this explains a lot of small TSO frames I've been
seeing in traces over the past year but never got a chance to
investigate.

 diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
 index 8dafda9..693b9f6 100644
 --- a/net/ipv4/tcp_output.c
 +++ b/net/ipv4/tcp_output.c
 @@ -1217,7 +1217,8 @@ static int tcp_tso_should_defer(struct sock *sk, struct 
 sk_buff *skb)
   goto send_now;
  
   /* Defer for less than two clock ticks. */
 - if (!tp-tso_deferred  ((jiffies1)1) - (tp-tso_deferred1)  1)
 + if (tp-tso_deferred 
 + ((jiffies  1)  1) - (tp-tso_deferred  1)  1)
   goto send_now;
  
   in_flight = tcp_packets_in_flight(tp);
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TSO trimming question

2007-12-20 Thread David Miller

From: Ilpo_Järvinen [EMAIL PROTECTED]
Date: Thu, 20 Dec 2007 13:40:51 +0200 (EET)

 That's not the only case, IMHO if there's odd boundary due to 
 snd_una+snd_wnd - skb-seq limit (done in tcp_window_allows()), we don't 
 consider it as odd but break the skb at arbitary point resulting
 two small segments to the network, and what's worse, when the later skb 
 resulting from the first split is matching skb-len  limit check as well 
 causing an unnecessary small skb to be created for nagle purpose too, 
 solving it fully requires some thought in case the mss_now != mss_cache 
 even if non-odd boundaries are honored in the middle of skb.

In the most ideal sense, tcp_window_allows() should probably
be changed to only return MSS multiples.

Unfortunately this would add an expensive modulo operation,
however I think it would elimiate this problem case.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 08/11] drivers/net/sunvnet.c: Use print_mac

2007-12-20 Thread David Miller

From: Joe Perches [EMAIL PROTECTED]
Date: Fri, 14 Dec 2007 14:34:09 -0800

 Signed-off-by: Joe Perches [EMAIL PROTECTED]

Applied to net-2.6.25
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/11] drivers/net/tg3.c: Use print_mac

2007-12-20 Thread David Miller

From: Joe Perches [EMAIL PROTECTED]
Date: Fri, 14 Dec 2007 14:34:10 -0800

 Signed-off-by: Joe Perches [EMAIL PROTECTED]

Applied to net-2.6.25
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/11] drivers/net/niu.c: Use print_mac

2007-12-20 Thread David Miller

From: Joe Perches [EMAIL PROTECTED]
Date: Fri, 14 Dec 2007 14:34:06 -0800

 Signed-off-by: Joe Perches [EMAIL PROTECTED]

Applied to net-2.6.25
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] [UDP6]: Counter increment on BH mode

2007-12-20 Thread David Miller

From: Herbert Xu [EMAIL PROTECTED]
Date: Sat, 15 Dec 2007 21:58:52 +0800

 [SNMP]: Fix SNMP counters with PREEMPT

 The SNMP macros use raw_smp_processor_id() in process context
 which is illegal because the process may be preempted and then
 migrated to another CPU.

 This patch makes it use get_cpu/put_cpu to disable preemption.

 Signed-off-by: Herbert Xu [EMAIL PROTECTED]

Applied to net-2.6.25, thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [SNMP]: Fix SNMP counters with PREEMPT

2007-12-20 Thread David Miller

From: Herbert Xu [EMAIL PROTECTED]
Date: Sun, 16 Dec 2007 10:30:25 +0800

 On Sat, Dec 15, 2007 at 06:03:19PM +0100, Eric Dumazet wrote:

  How come you change SNMP_INC_STATS_USER() but not SNMP_INC_STATS() ?

 Heh, my brain must have blocked me from seeing it because it's
 too hard :)

 Let's fix it the stupid way first and I'll do a local_t conversion
 later.

 [SNMP]: Fix SNMP counters with PREEMPT

 The SNMP macros use raw_smp_processor_id() in process context
 which is illegal because the process may be preempted and then
 migrated to another CPU.

 This patch makes it use get_cpu/put_cpu to disable preemption.

 Signed-off-by: Herbert Xu [EMAIL PROTECTED]

I just noticed this and replaced the other SNMP fix patch
with this one.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] One more XFRM audit fix

2007-12-20 Thread Paul Moore

On Thursday 20 December 2007 3:00:09 am David Miller wrote:
 From: Paul Moore [EMAIL PROTECTED]
 Date: Wed, 19 Dec 2007 14:29:31 -0500

  The following patch is backed against David's net-2.6 tree and is pretty
  trivial.  I know we're late in the 2.6.24 cycle but I think this is worth
  merging, if you guys don't feel that way let me know and I'll resubmit it
  for 2.6.25.

 Where is that patch?  Or do you mean the fix you emailed
 seperately today (which I will apply, thanks)?

Yes, it was the patch you applied, XFRM: Audit function arguments 
misordered.  I was using stacked-git to post the patch and it apparently 
doesn't annotate the cover email's subject line with 0/1 when you only send 
one patch.

Sorry about that.

  As a side note, I'm unable to actually test the patch because I can't get
  the kernel to compile (M=net/xfrm works just fine).  The problem I keep
  seeing is below:

  make[3]: *** No rule to make target \
   `/blah/kernels/net-2.6_xfrm-auid-secid-fix/include/linux/ticable.h', \
needed by \
   `/blah/kernels/net-2.6_xfrm-auid-secid-fix/usr/include/linux/ticable.h'.
  \ Stop.

 Remove ticable.h from include/linux/Kbuild

 This is already cured in Linus's tree.

Noted, thanks.

-- 
paul moore
linux security @ hp
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TSO trimming question

2007-12-20 Thread Ilpo Järvinen

On Thu, 20 Dec 2007, David Miller wrote:

 From: Ilpo_Järvinen [EMAIL PROTECTED]
 Date: Thu, 20 Dec 2007 13:40:51 +0200 (EET)

  That's not the only case, IMHO if there's odd boundary due to 
  snd_una+snd_wnd - skb-seq limit (done in tcp_window_allows()), we don't 
  consider it as odd but break the skb at arbitary point resulting
  two small segments to the network, and what's worse, when the later skb 
  resulting from the first split is matching skb-len  limit check as well 
  causing an unnecessary small skb to be created for nagle purpose too, 
  solving it fully requires some thought in case the mss_now != mss_cache 
  even if non-odd boundaries are honored in the middle of skb.

 In the most ideal sense, tcp_window_allows() should probably
 be changed to only return MSS multiples.

That's what Herbert suggested already, I'll send a patch later
on... :-)

 Unfortunately this would add an expensive modulo operation,
 however I think it would elimiate this problem case.

Yes. Should we still call tcp_minshall_update() if split in the middle of 
wq results in smaller than MSS tail (occurs only if mss_now != mss_cache)?

-- 
 i.

neigh: timer !nud_in_timer

2007-12-20 Thread John Sigler


Hello,

I noticed the following message in my kernel log.
kernel: neigh: timer  !nud_in_timer
(Might be due to a race condition.)

I'm running a UP Linux version 2.6.22.1-rt9
( http://rt.wiki.kernel.org/index.php )

The following /proc entries might be relevant.

/proc/sys/net/ipv4/conf/all/arp_accept
0
/proc/sys/net/ipv4/conf/all/arp_announce
2
/proc/sys/net/ipv4/conf/all/arp_filter
0
/proc/sys/net/ipv4/conf/all/arp_ignore
1

I also lowered the priority of softirq-timer/0 to 10 which means
it can be interrupted by other IRQ handlers.

Regards.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A short question about net git tree and patches

2007-12-20 Thread Sam Ravnborg

On Thu, Dec 20, 2007 at 03:22:58AM -0800, David Miller wrote:
 From: Sam Ravnborg [EMAIL PROTECTED]
 Date: Thu, 20 Dec 2007 10:55:10 +0100
 
  net-2.6.25.git   = patches for current kernel release (only fixes)
  net-2.6.git  = patches for next kernel relase and planned to be
  applied in next merge window
  
  So net-2.6.git is the correct choice for bleeding edge.
 
 You reversed them, net-2.6.25.git is for bleeding edge
 stuff, net-2.6.git is for bug fixes only.

Sorry - thanks for clarifying it.

Sam - who should refrain from thinking too much in current crappy 
condition
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Re: Nested VLAN causes recursive locking error

2007-12-20 Thread Jarek Poplawski

On 19-12-2007 00:03, Chuck Ebbert wrote:
 From:
 https://bugzilla.redhat.com/show_bug.cgi?id=426164
 
 
 kernel version is 2.6.24-0.107.rc5.git3.fc9
 
 From boot log on serial console:
 (full log attached)
 
 Added VLAN with VID == 2 to IF -:eth0.1568:-
 
 =
 [ INFO: possible recursive locking detected ]
 2.6.24-0.107.rc5.git3.fc9 #1
 -
 ifconfig/15011 is trying to acquire lock:
  (vlan_netdev_xmit_lock_key){-+..}, at: [c05d9450] dev_mc_sync+0x1c/0x102
 
 but task is already holding lock:
  (vlan_netdev_xmit_lock_key){-+..}, at: [c05d51bd] 
 dev_set_rx_mode+0x14/0x3c
 
 other info that might help us debug this:
 2 locks held by ifconfig/15011:
  #0:  (rtnl_mutex){--..}, at: [c05de4f7] rtnl_lock+0xf/0x11
  #1:  (vlan_netdev_xmit_lock_key){-+..}, at: [c05d51bd] 
 dev_set_rx_mode+0x14/0x3c
...


Subject: [PATCH] nested VLAN: fix lockdep's recursive locking warning

Allow vlans nesting other vlans without lockdep's warnings (max. 8 levels).

Reported-by: Benny Amorsen
Tested-by: Benny Amorsen(?) NEEDS TESTING!

Signed-off-by: Jarek Poplawski [EMAIL PROTECTED]

---

diff -Nurp linux-2.6.24-rc5-/net/8021q/vlan.c linux-2.6.24-rc5+/net/8021q/vlan.c
--- linux-2.6.24-rc5-/net/8021q/vlan.c  2007-12-17 13:29:19.0 +0100
+++ linux-2.6.24-rc5+/net/8021q/vlan.c  2007-12-20 14:21:02.0 +0100
@@ -307,12 +307,15 @@ int unregister_vlan_device(struct net_de
return ret;
 }
 
+#ifdef CONFIG_LOCKDEP
 /*
  * vlan network devices have devices nesting below it, and are a special
  * super class of normal network devices; split their locks off into a
  * separate class since they always nest.
  */
 static struct lock_class_key vlan_netdev_xmit_lock_key;
+static int subclass; /* vlan nesting vlan */
+#endif
 
 static const struct header_ops vlan_header_ops = {
.create  = vlan_dev_hard_header,
@@ -349,7 +352,14 @@ static int vlan_dev_init(struct net_devi
dev-hard_start_xmit = vlan_dev_hard_start_xmit;
}
 
-   lockdep_set_class(dev-_xmit_lock, vlan_netdev_xmit_lock_key);
+#ifdef CONFIG_LOCKDEP
+   if ((real_dev-priv_flags  IFF_802_1Q_VLAN) 
+   subclass  MAX_LOCKDEP_SUBCLASSES - 1)
+   subclass++;
+
+   lockdep_set_class_and_subclass(dev-_xmit_lock,
+   vlan_netdev_xmit_lock_key, subclass);
+#endif
return 0;
 }
 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH/.24] [NET] fs_enet: check for phydev existence in the ethtool handlers

2007-12-20 Thread Anton Vorontsov

Otherwise oops will happen if ethernet device has not been opened:

Unable to handle kernel paging request for data at address 0x014c
Faulting instruction address: 0xc016f7f0
Oops: Kernel access of bad area, sig: 11 [#1]
MPC85xx
NIP: c016f7f0 LR: c01722a0 CTR: 
REGS: c79ddc70 TRAP: 0300   Not tainted  (2.6.24-rc3-g820a386b)
MSR: 00029000 EE,ME  CR: 20004428  XER: 2000
DEAR: 014c, ESR: 
TASK = c789f5e0[999] 'snmpd' THREAD: c79dc000
GPR00: c01aceb8 c79ddd20 c789f5e0  c79ddd3c  c79ddd64 
GPR08:  c7845b60 c79dde3c c01ace80 20004422 200249fc 02a0 100da728
GPR16: 100c    20022078 0009 200220e0 bfc85558
GPR24: c79ddd3c   c02e0e70 c022fc64  c7845800 bfc85498
NIP [c016f7f0] phy_ethtool_gset+0x0/0x4c
LR [c01722a0] fs_get_settings+0x18/0x28
Call Trace:
[c79ddd20] [c79dde38] 0xc79dde38 (unreliable)
[c79ddd30] [c01aceb8] dev_ethtool+0x294/0x11ec
[c79dde30] [c01aaa44] dev_ioctl+0x454/0x6a8
[c79ddeb0] [c019b9d4] sock_ioctl+0x84/0x230
[c79dded0] [c007ded8] do_ioctl+0x34/0x8c
[c79ddee0] [c007dfbc] vfs_ioctl+0x8c/0x41c
[c79ddf10] [c007e38c] sys_ioctl+0x40/0x74
[c79ddf40] [c000d4c0] ret_from_syscall+0x0/0x3c
Instruction dump:
8163 800b0030 2f80 419e0010 7c0803a6 4e800021 7c691b78 80010014
7d234b78 38210010 7c0803a6 4e800020 8003014c 7c6b1b78 3860 90040004

Signed-off-by: Anton Vorontsov [EMAIL PROTECTED]
---
 drivers/net/fs_enet/fs_enet-main.c |   11 +--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/net/fs_enet/fs_enet-main.c 
b/drivers/net/fs_enet/fs_enet-main.c
index f2a4d39..23fddc3 100644
--- a/drivers/net/fs_enet/fs_enet-main.c
+++ b/drivers/net/fs_enet/fs_enet-main.c
@@ -897,14 +897,21 @@ static void fs_get_regs(struct net_device *dev, struct 
ethtool_regs *regs,
 static int fs_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
 {
struct fs_enet_private *fep = netdev_priv(dev);
+
+   if (!fep-phydev)
+   return -ENODEV;
+
return phy_ethtool_gset(fep-phydev, cmd);
 }
 
 static int fs_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
 {
struct fs_enet_private *fep = netdev_priv(dev);
-   phy_ethtool_sset(fep-phydev, cmd);
-   return 0;
+
+   if (!fep-phydev)
+   return -ENODEV;
+
+   return phy_ethtool_sset(fep-phydev, cmd);
 }
 
 static int fs_nway_reset(struct net_device *dev)
-- 
1.5.2.2
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TSO trimming question

2007-12-20 Thread Herbert Xu

On Thu, Dec 20, 2007 at 04:00:37AM -0800, David Miller wrote:

 In the most ideal sense, tcp_window_allows() should probably
 be changed to only return MSS multiples.
 
 Unfortunately this would add an expensive modulo operation,
 however I think it would elimiate this problem case.

Well you only have to divide in the unlikely case of us being
limited by the receiver window.  In that case speed is probably
not of the essence anyway.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: After many hours all outbound connections get stuck in SYN_SENT

2007-12-20 Thread Glen Turner

[speculation by network engineer -- not kernel hacker -- follows]

 The router could be sooo crappy that it drops all packets from
 TCP streams that have SACK enabled and the client has opened
 200+ SACK connections previously... something like that?

As far as any third party is concerned the existing TCP connections
continue to have negotiated SACK Permitted. Only new connections
will not negotiate this.  So router crappiness promptly disappearing
doesn't seem too likely (a way I could see this happening is if the
Linux box sends a Ack for each connection and this clears out Sack
datastructures on the third party).

But I'd be very surprised if the router is acting as anything more
that a network-layer device. It might perhaps have some soft connection
state being used for generating accounting records.  Being Cisco
it's probably a switch-router, so it might carry some per-port hard
state for validating source IP addresses and ARPs on each port.

The firewall is much more likely to be carrying per-flow Sack
state. The Cisco PIX had a bug with SACK handling (CSCse14419,
fixed in 7.0(7), 7.1(2.34), 7.2(2.2), 8.0(0.141) but perhaps it
has regressed). A simple trace either side of the firewall will
show the inconsistency between the TCP sequence number (which
gets randomised) and the Sack sequence number (which didn't).
You could disable the TCP Sequence Number Randomisation feature
and see if the fault reoccurs.

You'd probably should also investigate the Linux kernel,
especially the size and locks of the components of the Sack data
structures and what happens to those data structures after Sack is
disabled (presumably the Sack data structure is in some unhappy
circumstance, and disabling Sack allows the data to be discarded,
magically unclaging the box).

In the absence of the reporter wanting to dump the kernel's
core, how about a patch to print the Sack datastructure when
the command to disable Sack is received by the kernel?
Maybe just print the last 16b of the IP address?

Best wishes, Glen

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH/.24] [NET] fs_enet: check for phydev existence in the ethtool handlers

2007-12-20 Thread Vitaly Bordug

On Thu, 20 Dec 2007 16:59:23 +0300
Anton Vorontsov wrote:

 Otherwise oops will happen if ethernet device has not been opened:
 
 Unable to handle kernel paging request for data at address 0x014c
 Faulting instruction address: 0xc016f7f0
 Oops: Kernel access of bad area, sig: 11 [#1]
 MPC85xx
 NIP: c016f7f0 LR: c01722a0 CTR: 
 REGS: c79ddc70 TRAP: 0300   Not tainted  (2.6.24-rc3-g820a386b)
 MSR: 00029000 EE,ME  CR: 20004428  XER: 2000
 DEAR: 014c, ESR: 
 TASK = c789f5e0[999] 'snmpd' THREAD: c79dc000
 GPR00: c01aceb8 c79ddd20 c789f5e0  c79ddd3c  c79ddd64
  GPR08:  c7845b60 c79dde3c c01ace80 20004422 200249fc
 02a0 100da728 GPR16: 100c    20022078
 0009 200220e0 bfc85558 GPR24: c79ddd3c   c02e0e70
 c022fc64  c7845800 bfc85498 NIP [c016f7f0]
 phy_ethtool_gset+0x0/0x4c LR [c01722a0] fs_get_settings+0x18/0x28
 Call Trace:
 [c79ddd20] [c79dde38] 0xc79dde38 (unreliable)
 [c79ddd30] [c01aceb8] dev_ethtool+0x294/0x11ec
 [c79dde30] [c01aaa44] dev_ioctl+0x454/0x6a8
 [c79ddeb0] [c019b9d4] sock_ioctl+0x84/0x230
 [c79dded0] [c007ded8] do_ioctl+0x34/0x8c
 [c79ddee0] [c007dfbc] vfs_ioctl+0x8c/0x41c
 [c79ddf10] [c007e38c] sys_ioctl+0x40/0x74
 [c79ddf40] [c000d4c0] ret_from_syscall+0x0/0x3c
 Instruction dump:
 8163 800b0030 2f80 419e0010 7c0803a6 4e800021 7c691b78
 80010014 7d234b78 38210010 7c0803a6 4e800020 8003014c 7c6b1b78
 3860 90040004
 
 Signed-off-by: Anton Vorontsov [EMAIL PROTECTED]
Acked-by: Vitaly Bordug [EMAIL PROTECTED]

Jeff: this fix is important and should be merged if possible.
 ---
  drivers/net/fs_enet/fs_enet-main.c |   11 +--
  1 files changed, 9 insertions(+), 2 deletions(-)
 
 diff --git a/drivers/net/fs_enet/fs_enet-main.c
 b/drivers/net/fs_enet/fs_enet-main.c index f2a4d39..23fddc3 100644
 --- a/drivers/net/fs_enet/fs_enet-main.c
 +++ b/drivers/net/fs_enet/fs_enet-main.c
 @@ -897,14 +897,21 @@ static void fs_get_regs(struct net_device *dev,
 struct ethtool_regs *regs, static int fs_get_settings(struct
 net_device *dev, struct ethtool_cmd *cmd) {
   struct fs_enet_private *fep = netdev_priv(dev);
 +
 + if (!fep-phydev)
 + return -ENODEV;
 +
   return phy_ethtool_gset(fep-phydev, cmd);
  }
  
  static int fs_set_settings(struct net_device *dev, struct
 ethtool_cmd *cmd) {
   struct fs_enet_private *fep = netdev_priv(dev);
 - phy_ethtool_sset(fep-phydev, cmd);
 - return 0;
 +
 + if (!fep-phydev)
 + return -ENODEV;
 +
 + return phy_ethtool_sset(fep-phydev, cmd);
  }
  
  static int fs_nway_reset(struct net_device *dev)


-- 
Sincerely, Vitaly
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TSO trimming question

2007-12-20 Thread John Heffner

David Miller wrote:

From: Ilpo_Järvinen [EMAIL PROTECTED]
Date: Thu, 20 Dec 2007 13:40:51 +0200 (EET)

[PATCH] [TCP]: Fix TSO deferring

I'd say that most of what tcp_tso_should_defer had in between
there was dead code because of this.

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]

Yikes!

John, we've been living a lie for more than a year. :-/

On the bright side this explains a lot of small TSO frames I've been
seeing in traces over the past year but never got a chance to
investigate.

Ouch.  This fix may improve some benchmarks.

Re-checking this function was on my list of things to do because I had 
also noticed some TSO frames that seemed a bit small.  This clearly 
explains it.

  -John
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: After many hours all outbound connections get stuck in SYN_SENT

2007-12-20 Thread James Nichols

 I still dont understand.

 tcpdump -p -n -s 1600 -c 1 doesnt reveal User data at all.

 Without any exact data from you, I am afraid nobody can help.

Oh, I didn't see that you specified specific options.  I'll still have
to anonymize 2000+ IP addresses, but I think there is an open source
tool that will do this for you.



  2) Are you sure you are not using connection tracking, and hit a limit on 
  it ?
 
  I'm using ip_conntrack, but the limit I have for max entries is 65K.
  The most I've seen in there are a couple thousand- that was one of the
  first things I monitored very closely.

 Now please try without conn tracking module. I saw many failures in the past
 that were trigered by conntrack.

 Do you have some firewall rules, using some netfilter modules like hashlimit ?

I will have to look into this.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [IPROUTE]: A workaround to make larger rto_min printed correctly

2007-12-20 Thread Satoru SATOH

i see. HZ can be  1000.. i should be wrong.

however, i got the following,

[root iproute2.org]# ./ip/ip route change 192.168.140.0/24 dev eth1 rto_min 4s
[root iproute2.org]# gdb -q ./ip/ip
Using host libthread_db library /lib/libthread_db.so.1.
(gdb) br iproute.c:512
Breakpoint 1 at 0x804fc8d: file iproute.c, line 512.
(gdb) r route show dev eth1
Starting program: /root/iproute2.org/ip/ip route show dev eth1

Breakpoint 1, print_route (who=0xbfb9854c, n=0xbfb94528, arg=0x6404c0)
at iproute.c:512
512 unsigned val =
*(unsigned*)RTA_DATA(mxrta[i]);
(gdb) l 512,522
512 unsigned val =
*(unsigned*)RTA_DATA(mxrta[i]);
513
514 val *= 1000;
515 if (i == RTAX_RTT)
516 val /= 8;
517 else if (i == RTAX_RTTVAR)
518 val /= 4;
519 if (val = hz)
520 fprintf(fp,  %ums, val/hz);
521 else
522 fprintf(fp,  %.2fms,
(float)val/hz);
(gdb) p hz
$1 = 10
(gdb) n
514 val *= 1000;
(gdb) p val
$2 = 40
(gdb) p val/ (hz / 1000)
$3 = 4000
(gdb) n
515 if (i == RTAX_RTT)
(gdb) p val
$4 = 1385447424
(gdb) c
Continuing.
192.168.140.0/24  scope link  rto_min lock 1ms

Program exited normally.
(gdb)


Thanks,
Satoru SATOH

2007/12/20, Jarek Poplawski [EMAIL PROTECTED]:
 On 20-12-2007 04:31, Satoru SATOH wrote:
  ip route show does not print correct value when larger rto_min is
  set (e.g. 3sec).
 
  This problem is because of overflow in print_route() and
  the patch below is a workaround fix for that.
 
 ...
  --- a/ip/iproute.c
  +++ b/ip/iproute.c
  @@ -510,16 +510,16 @@ int print_route(const struct sockaddr_nl *who,
  struct nlmsghdr *n, void *arg)
  fprintf(fp,  %u,
  *(unsigned*)RTA_DATA(mxrta[i]));
  else {
  unsigned val = 
  *(unsigned*)RTA_DATA(mxrta[i]);
  +   unsigned hz1 = hz / 1000;
 ...
  +   if (val = hz1)
  +   fprintf(fp,  %ums, val/hz1);
 ...

 Probably I miss something or my iproute sources are too old, but:
 does this work with hz  1000?

 Regards,
 Jarek P.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Please pull 'fixes-jgarzik' branch of wireless-2.6

2007-12-20 Thread John W. Linville

Jeff,

Here are a few more for 2.6.24...please let me know if there are any
problems!

Thanks,

John

P.S.  The rtl8187 USB ID is already in your upstream branch -- I'm sure
it would seem like a fix if it was the ID for your wireless stick. :-)

---

Individual patches are available here:


http://www.kernel.org//pub/linux/kernel/people/linville/wireless-2.6/fixes-jgarzik

---

The following changes since commit 82d29bf6dc7317aeb0a3a13c2348ca8591965875:
  Linus Torvalds (1):
Linux 2.6.24-rc5

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git 
fixes-jgarzik

Matthias Mueller (1):
  rtl8187: Add USB ID for Sitecom WL-168 v1 001

Michael Wu (1):
  p54: add Kconfig description

Reinette Chatre (1):
  ipw2200: prevent alloc of unspecified size on stack

Zhu Yi (1):
  iwlwifi: fix possible priv-mutex deadlock during suspend

 drivers/net/wireless/Kconfig|   51 +++
 drivers/net/wireless/ipw2200.c  |   13 ++-
 drivers/net/wireless/iwlwifi/iwl3945-base.c |   18 +++---
 drivers/net/wireless/iwlwifi/iwl4965-base.c |   18 +++---
 drivers/net/wireless/rtl8187_dev.c  |2 +
 5 files changed, 75 insertions(+), 27 deletions(-)

diff --git a/drivers/net/wireless/Kconfig b/drivers/net/wireless/Kconfig
index 2b733c5..7bdf9da 100644
--- a/drivers/net/wireless/Kconfig
+++ b/drivers/net/wireless/Kconfig
@@ -586,15 +586,66 @@ config ADM8211
 config P54_COMMON
tristate Softmac Prism54 support
depends on MAC80211  WLAN_80211  FW_LOADER  EXPERIMENTAL
+   ---help---
+ This is common code for isl38xx based cards.
+ This module does nothing by itself - the USB/PCI frontends
+ also need to be enabled in order to support any devices.
+
+ These devices require softmac firmware which can be found at
+ http://prism54.org/
+
+ If you choose to build a module, it'll be called p54common.
 
 config P54_USB
tristate Prism54 USB support
depends on P54_COMMON  USB
select CRC32
+   ---help---
+ This driver is for USB isl38xx based wireless cards.
+ These are USB based adapters found in devices such as:
+
+ 3COM 3CRWE254G72
+ SMC 2862W-G
+ Accton 802.11g WN4501 USB
+ Siemens Gigaset USB
+ Netgear WG121
+ Netgear WG111
+ Medion 40900, Roper Europe
+ Shuttle PN15, Airvast WM168g, IOGear GWU513
+ Linksys WUSB54G
+ Linksys WUSB54G Portable
+ DLink DWL-G120 Spinnaker
+ DLink DWL-G122
+ Belkin F5D7050 ver 1000
+ Cohiba Proto board
+ SMC 2862W-G version 2
+ U.S. Robotics U5 802.11g Adapter
+ FUJITSU E-5400 USB D1700
+ Sagem XG703A
+ DLink DWL-G120 Cohiba
+ Spinnaker Proto board
+ Linksys WUSB54AG
+ Inventel UR054G
+ Spinnaker DUT
+
+ These devices require softmac firmware which can be found at
+ http://prism54.org/
+
+ If you choose to build a module, it'll be called p54usb.
 
 config P54_PCI
tristate Prism54 PCI support
depends on P54_COMMON  PCI
+   ---help---
+ This driver is for PCI isl38xx based wireless cards.
+ This driver supports most devices that are supported by the
+ fullmac prism54 driver plus many devices which are not
+ supported by the fullmac driver/firmware.
+
+ This driver requires softmac firmware which can be found at
+ http://prism54.org/
+
+ If you choose to build a module, it'll be called p54pci.
 
 source drivers/net/wireless/iwlwifi/Kconfig
 source drivers/net/wireless/hostap/Kconfig
diff --git a/drivers/net/wireless/ipw2200.c b/drivers/net/wireless/ipw2200.c
index 54f44e5..38ce8ee 100644
--- a/drivers/net/wireless/ipw2200.c
+++ b/drivers/net/wireless/ipw2200.c
@@ -1233,9 +1233,19 @@ static ssize_t show_event_log(struct device *d,
 {
struct ipw_priv *priv = dev_get_drvdata(d);
u32 log_len = ipw_get_event_log_len(priv);
-   struct ipw_event log[log_len];
+   u32 log_size;
+   struct ipw_event *log;
u32 len = 0, i;
 
+   /* not using min() because of its strict type checking */
+   log_size = PAGE_SIZE / sizeof(*log)  log_len ?
+   sizeof(*log) * log_len : PAGE_SIZE;
+   log = kzalloc(log_size, GFP_KERNEL);
+   if (!log) {
+   IPW_ERROR(Unable to allocate memory for log\n);
+   return 0;
+   }
+   log_len = log_size / sizeof(*log);
ipw_capture_event_log(priv, log_len, log);
 
len += snprintf(buf + len, PAGE_SIZE - len, %08X, log_len);
@@ -1244,6 +1254,7 @@ static ssize_t show_event_log(struct device *d,
\n%08X%08X%08X,
log[i].time, log[i].event, log[i].data);
len += snprintf(buf + len,

Please pull 'fixes-davem' branch of wireless-2.6

2007-12-20 Thread John W. Linville

Dave,

A few more stragglers for 2.6.24...let me know if there are any
problems!

Thanks,

John

---

Individual patches available here:


http://www.kernel.org/pub/linux/kernel/people/linville/wireless-2.6/fixes-davem

---

The following changes since commit 82d29bf6dc7317aeb0a3a13c2348ca8591965875:
  Linus Torvalds (1):
Linux 2.6.24-rc5

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git 
fixes-davem

Johannes Berg (2):
  mac80211: round station cleanup timer
  mac80211: warn when receiving frames with unaligned data

 net/mac80211/rx.c   |   13 +
 net/mac80211/sta_info.c |7 +--
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/net/mac80211/rx.c b/net/mac80211/rx.c
index 00f908d..a7263fc 100644
--- a/net/mac80211/rx.c
+++ b/net/mac80211/rx.c
@@ -1443,6 +1443,7 @@ void __ieee80211_rx(struct ieee80211_hw *hw, struct 
sk_buff *skb,
struct ieee80211_sub_if_data *prev = NULL;
struct sk_buff *skb_new;
u8 *bssid;
+   int hdrlen;
 
/*
 * key references and virtual interfaces are protected using RCU
@@ -1472,6 +1473,18 @@ void __ieee80211_rx(struct ieee80211_hw *hw, struct 
sk_buff *skb,
rx.fc = le16_to_cpu(hdr-frame_control);
type = rx.fc  IEEE80211_FCTL_FTYPE;
 
+   /*
+* Drivers are required to align the payload data to a four-byte
+* boundary, so the last two bits of the address where it starts
+* may not be set. The header is required to be directly before
+* the payload data, padding like atheros hardware adds which is
+* inbetween the 802.11 header and the payload is not supported,
+* the driver is required to move the 802.11 header further back
+* in that case.
+*/
+   hdrlen = ieee80211_get_hdrlen(rx.fc);
+   WARN_ON_ONCE(((unsigned long)(skb-data + hdrlen))  3);
+
if (type == IEEE80211_FTYPE_DATA || type == IEEE80211_FTYPE_MGMT)
local-dot11ReceivedFragmentCount++;
 
diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c
index e849155..cfd8ee9 100644
--- a/net/mac80211/sta_info.c
+++ b/net/mac80211/sta_info.c
@@ -14,6 +14,7 @@
 #include linux/slab.h
 #include linux/skbuff.h
 #include linux/if_arp.h
+#include linux/timer.h
 
 #include net/mac80211.h
 #include ieee80211_i.h
@@ -306,7 +307,8 @@ static void sta_info_cleanup(unsigned long data)
}
read_unlock_bh(local-sta_lock);
 
-   local-sta_cleanup.expires = jiffies + STA_INFO_CLEANUP_INTERVAL;
+   local-sta_cleanup.expires =
+   round_jiffies(jiffies + STA_INFO_CLEANUP_INTERVAL);
add_timer(local-sta_cleanup);
 }
 
@@ -345,7 +347,8 @@ void sta_info_init(struct ieee80211_local *local)
INIT_LIST_HEAD(local-sta_list);
 
init_timer(local-sta_cleanup);
-   local-sta_cleanup.expires = jiffies + STA_INFO_CLEANUP_INTERVAL;
+   local-sta_cleanup.expires =
+   round_jiffies(jiffies + STA_INFO_CLEANUP_INTERVAL);
local-sta_cleanup.data = (unsigned long) local;
local-sta_cleanup.function = sta_info_cleanup;
 
-- 
John W. Linville
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Please pull 'upstream-davem' branch of wireless-2.6

2007-12-20 Thread John W. Linville

Dave,

These are destined for 2.6.25.  The patches fall mostly into two
categories: a new rate control algorithm for mac80211, and some
cfg80211 enhancements (including mac80211 patches to use them).

Also there are some small hits in the iwlwifi drivers related to
rate control.  I'll CC Jeff since his tree has a lot of iwlwifi symbol
renames and those patches will conflict (or break the build, or both)
when your tree and his finally come together.

Let me know if there are any problems!

John

P.S.  I have a few more related to the cfg80211 changes, but the
patches are cross-dependent on both your tree and Jeff's.  I will
probably send those to akpm in the meantime, and push them after
Linus has pulled both your tree and Jeff's in the 2.6.25 merge window.

---

Individual patches are available here:


http://www.kernel.org/pub/linux/kernel/people/linville/wireless-2.6/upstream-davem

---

The following changes since commit adc292d3280278282d7b0e0813ccda711e739b5f:
  Herbert Xu (1):
[IPSEC]: Do xfrm_state_check_space before encapsulation

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git 
upstream-davem

Johannes Berg (13):
  mac80211: clean up eapol frame handling/port control
  mac80211: clean up eapol handling in TX path
  mac80211: make ieee80211_rx_mgmt_action static
  mac80211: allow easier multicast/broadcast buffering in hardware
  cfg80211/nl80211: introduce key handling
  mac80211: support adding/removing keys via cfg80211
  mac80211: support getting key sequence counters via cfg80211
  cfg80211/nl80211: add beacon settings
  cfg80211/nl80211: station handling
  cfg80211/nl80211: implement station attribute retrieval
  mac80211: implement station stats retrieval
  mac80211: move tx crypto decision
  mac80211: don't read ERP information from (re)association response

Mattias Nissler (4):
  mac80211: clean up rate selection
  mac80211: add PID controller based rate control algorithm
  rc80211-pid: add debugging
  rc80211-pid: export tuning parameters through debugfs

Ron Rindjunsky (1):
  mac80211: pass in PS_POLL frames

Stefano Brivio (4):
  mac80211: make PID rate control algorithm the default
  rc80211-pid: add rate behaviour learning algorithm
  rc80211-pid: add sharpening factor
  doc: fix typo in feature-removal-schedule

 Documentation/feature-removal-schedule.txt |   10 +-
 drivers/net/wireless/iwlwifi/iwl-3945-rs.c |   44 +--
 drivers/net/wireless/iwlwifi/iwl-4965-rs.c |   46 +--
 include/linux/nl80211.h|  154 ++
 include/net/cfg80211.h |  167 +++
 include/net/mac80211.h |   17 +-
 net/mac80211/Kconfig   |   63 +++-
 net/mac80211/Makefile  |   16 +-
 net/mac80211/cfg.c |  202 -
 net/mac80211/debugfs_netdev.c  |   27 +-
 net/mac80211/ieee80211.c   |   21 +-
 net/mac80211/ieee80211_i.h |   24 +-
 net/mac80211/ieee80211_iface.c |1 -
 net/mac80211/ieee80211_rate.c  |   59 +++-
 net/mac80211/ieee80211_rate.h  |   76 ++--
 net/mac80211/ieee80211_sta.c   |   35 +-
 net/mac80211/rc80211_pid.h |  261 ++
 net/mac80211/rc80211_pid_algo.c|  510 +++
 net/mac80211/rc80211_pid_debugfs.c |  223 +
 net/mac80211/rc80211_simple.c  |   64 +--
 net/mac80211/rx.c  |  144 +++---
 net/mac80211/tx.c  |  171 ---
 net/mac80211/util.c|   24 +-
 net/mac80211/wep.c |   10 -
 net/mac80211/wpa.c |   14 -
 net/wireless/core.c|3 +
 net/wireless/nl80211.c |  737 
 27 files changed, 2692 insertions(+), 431 deletions(-)
 create mode 100644 net/mac80211/rc80211_pid.h
 create mode 100644 net/mac80211/rc80211_pid_algo.c
 create mode 100644 net/mac80211/rc80211_pid_debugfs.c

Omnibus patch attached as 'upstream-davem.patch.bz2' due to size concerns.
-- 
John W. Linville
[EMAIL PROTECTED]


upstream-davem.patch.bz2
Description: BZip2 compressed data

Please pull 'upstream-jgarzik' branch of wireless-2.6

2007-12-20 Thread John W. Linville

Jeff,

More for 2.6.25...Mr. Woodhouse continues his savage assault on
libertas, the b43legacy version of the rfkill led patch is here
(b43legacy rfkill stuff is not in 2.6.24), and there are a couple of
iwlwifi patches as well.

Let me know if there are problems!

Thanks,

John

---

Individual patches are available here:


http://www.kernel.org/pub/linux/kernel/people/linville/wireless-2.6/upstream-jgarzik

---

The following changes since commit b503d38b01bf313e4f1250c4ded89fc10a1d3da0:
  Ramkrishna Vepa (1):
S2io: Fixes to enable multiple transmit fifos

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git 
upstream-jgarzik

David Woodhouse (38):
  libertas: don't exit worker thread until kthread_stop() is called
  libertas: stop attempting to reset devices on unload
  libertas: clean up if_usb driver
  libertas: kill whitespace at end of lines
  libertas: kill unused wait_option field in struct cmd_ctrl_node
  libertas: rename and clean up DownloadcommandToStation
  libertas: don't use __lbs_cmd() with empty callback in if_usb.c
  libertas: remove some pointless checks for cmdnode buffer being present
  libertas: introduce and use lbs_complete_command() for command completion
  libertas: don't re-initialise cmdnode when taking it off the free queue
  libertas: kill cleanup_cmdnode()
  libertas: let __lbs_cmd() free its own cmdnode
  libertas: kill pdata_buf member of struct cmd_ctrl_node
  libertas: store command result in cmdnode instead of priv-cur_cmd_retcode
  libertas: add __lbs_cmd_async() for asynchronous command submission
  libertas: ensure response buffer size is always set for 
lbs_cmd_with_response
  libertas: handle command timeout in main thread instead of directly in 
timer
  libertas: kill 'addtail' argument to lbs_queue_cmd() and make it static
  libertas: fix return from lbs_update_channel()
  libertas: add SLEEP_PERIOD and FW_WAKE_METHOD command definitions
  libertas: fix buffer handling of PS_MODE commands and responses
  libertas: don't clear priv-dnld_sent after sending sleep confirm
  libertas: handle HOST_AWAKE event by sending WAKEUP_CONFIRM command
  libertas: allow for PS mode to be disabled when firmware doesn't support 
it
  libertas: Check for PS mode support on USB devices
  libertas: reduce explicit references to priv-cur_cmd-cmdbuf
  libertas: use priv-upld_buf for command responses
  libertas: discard DEFER responses to commands; let the timeout trigger
  libertas: make lbs_submit_command always 'succeed' and set command timer
  libertas: submit RSSI command on tx timeout, to check whether module is 
dead
  libertas: convert RADIO_CONTROL to a direct command
  libertas: convert INACTIVITY_TIMEOUT to a direct command
  libertas: convert SLEEP_PARAMS to a direct command
  libertas: convert SET_WEP to a direct command
  libertas: convert ENABLE_RSN to a direct command
  libertas: change inference about buffer size in lbs_cmd()
  libertas: convert SUBSCRIBE_EVENT to a direct command
  libertas: remove check for driver_lock in lbs_interrupt()

Larry Finger (1):
  b43legacy: Fix rfkill radio LED

Zhu Yi (2):
  iwlwifi: proper monitor support
  iwlwifi: skip mac80211 conf during a hardware scan and replay it 
afterwards

 drivers/net/wireless/b43legacy/leds.c   |4 +
 drivers/net/wireless/b43legacy/main.c   |   20 +-
 drivers/net/wireless/b43legacy/rfkill.c |  133 ---
 drivers/net/wireless/iwlwifi/iwl-3945.c |  120 +-
 drivers/net/wireless/iwlwifi/iwl-3945.h |   38 +--
 drivers/net/wireless/iwlwifi/iwl-4965.c |  120 ++-
 drivers/net/wireless/iwlwifi/iwl-4965.h |   26 +--
 drivers/net/wireless/iwlwifi/iwl3945-base.c |  139 +--
 drivers/net/wireless/iwlwifi/iwl4965-base.c |  122 +--
 drivers/net/wireless/libertas/assoc.c   |   61 ++--
 drivers/net/wireless/libertas/cmd.c |  565 +++
 drivers/net/wireless/libertas/cmd.h |   29 ++-
 drivers/net/wireless/libertas/cmdresp.c |  162 +++-
 drivers/net/wireless/libertas/debugfs.c |  350 -
 drivers/net/wireless/libertas/decl.h|9 +-
 drivers/net/wireless/libertas/dev.h |   19 +-
 drivers/net/wireless/libertas/host.h|8 +
 drivers/net/wireless/libertas/hostcmd.h |   47 ++-
 drivers/net/wireless/libertas/if_cs.c   |   10 +-
 drivers/net/wireless/libertas/if_sdio.c |   10 +-
 drivers/net/wireless/libertas/if_usb.c  |  470 ++-
 drivers/net/wireless/libertas/if_usb.h  |   95 ++---
 drivers/net/wireless/libertas/main.c|   92 +++--
 drivers/net/wireless/libertas/tx.c  |4 +-
 drivers/net/wireless/libertas/wext.c|7 +
 25 files changed, 1200 insertions(+), 1460 deletions(-)

Omnibus

Re: After many hours all outbound connections get stuck in SYN_SENT

2007-12-20 Thread James Nichols

 But I'd be very surprised if the router is acting as anything more
 that a network-layer device. It might perhaps have some soft connection
 state being used for generating accounting records.  Being Cisco
 it's probably a switch-router, so it might carry some per-port hard
 state for validating source IP addresses and ARPs on each port.

 The firewall is much more likely to be carrying per-flow Sack
 state. The Cisco PIX had a bug with SACK handling (CSCse14419,
 fixed in 7.0(7), 7.1(2.34), 7.2(2.2), 8.0(0.141) but perhaps it
 has regressed). A simple trace either side of the firewall will
 show the inconsistency between the TCP sequence number (which
 gets randomised) and the Sack sequence number (which didn't).
 You could disable the TCP Sequence Number Randomisation feature
 and see if the fault reoccurs.

I do have TCP Sequence # Randomization enabled on my router.  However,
if this was causing an issue, wouldn't it always occur and cause
connection issues, not just after 38 hours of correct operation?  I
can look into turning this off, but I'll likely have to jump through
several hoops which will be challenging if I don't have a very clear
definitive reason why this is causing this issue.  Plus, I've had this
problem with at least 2 other sets of network switches over the past 4
years.  I'm actually running 7.0(6), which doesn't have the fix you
mentioned.  If it really is possible that this issue wouldn't always
cause problems, but only after hours of succesful operation, then I
could probably motivate the upgrade.  I can try to setup a trace, but
this is a lot of work for other people in my organization, so it will
take quite some time.


 You'd probably should also investigate the Linux kernel,
 especially the size and locks of the components of the Sack data
 structures and what happens to those data structures after Sack is
 disabled (presumably the Sack data structure is in some unhappy
 circumstance, and disabling Sack allows the data to be discarded,
 magically unclaging the box).

 In the absence of the reporter wanting to dump the kernel's
 core, how about a patch to print the Sack datastructure when
 the command to disable Sack is received by the kernel?
 Maybe just print the last 16b of the IP address?

Given the fact that I've had this problem for so long, over a variety
of networking hardware vendors and colo-facilities, this really sounds
good to me.  It will be challenging for me to justify a kernel core
dump, but a simple patch to dump the Sack data would be do-able.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] net: neighbor timer power saving

2007-12-20 Thread Stephen Hemminger

On Wed, 19 Dec 2007 08:23:43 +0100
Eric Dumazet [EMAIL PROTECTED] wrote:

 Stephen Hemminger a écrit :
  The neighbor GC timer runs once a second, but it doesn't need to wake
  up the machine.
  
  Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]
  
  --- a/net/core/neighbour.c  2007-12-18 07:46:07.0 -0800
  +++ b/net/core/neighbour.c  2007-12-18 07:47:36.0 -0800
  @@ -270,7 +270,7 @@ static struct neighbour *neigh_alloc(str
  n-nud_state  = NUD_NONE;
  n-output = neigh_blackhole;
  n-parms  = neigh_parms_clone(tbl-parms);
  -   init_timer(n-timer);
  +   init_timer_deferrable(n-timer);
  n-timer.function = neigh_timer_handler;
  n-timer.data = (unsigned long)n;
   
  @@ -740,7 +740,7 @@ static void neigh_timer_handler(unsigned
   
  state = neigh-nud_state;
  now = jiffies;
  -   next = now + HZ;
  +   next = round_jiffies(now + HZ);
   
  if (!(state  NUD_IN_TIMER)) {
   #ifndef CONFIG_SMP
  @@ -1372,7 +1372,7 @@ void neigh_table_init_no_netlink(struct 
  get_random_bytes(tbl-hash_rnd, sizeof(tbl-hash_rnd));
   
  rwlock_init(tbl-lock);
  -   init_timer(tbl-gc_timer);
  +   init_timer_deferrable(tbl-gc_timer);
  tbl-gc_timer.data = (unsigned long)tbl;
  tbl-gc_timer.function = neigh_periodic_timer;
  tbl-gc_timer.expires  = now + 1;
 
 I wonder if this deferrable timer thing is the right way to go.
 
 (like read_mostly thing if you want :) )
 
 We are going to convert 99% timers to deferrable.
 
 Maybe the right move should be to have the reverse attribute, to mark a timer 
 as non deferrable...
 
 Also, why use round_jiffies() on a deferrable timer ? That sounds unecessary ?

Thinking about it more, this looks like a case for just using round_jiffies().
The GC timer needs to run to clean up under DoS attack, and deferring it 
probably
isn't a good idea.

-- 
Stephen Hemminger [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Stephen Hemminger

On Tue, 18 Dec 2007 20:13:28 -0500 (EST)
Parag Warudkar [EMAIL PROTECTED] wrote:

 
 sky2 can use deferrable timer for watchdog - reduces wakeups from idle per 
 second.
 
 Signed-off-by: Parag Warudkar [EMAIL PROTECTED]
 
 --- linux-2.6/drivers/net/sky2.c  2007-12-07 10:04:39.0 -0500
 +++ linux-2.6-work/drivers/net/sky2.c 2007-12-18 20:07:58.0 -0500
 @@ -4230,7 +4230,10 @@
   sky2_show_addr(dev1);
   }
 
 - setup_timer(hw-watchdog_timer, sky2_watchdog, (unsigned long) hw);
 + hw-watchdog_timer.function = sky2_watchdog;
 + hw-watchdog_timer.data = (unsigned long) hw;
 + init_timer_deferrable(hw-watchdog_timer);
 +
   INIT_WORK(hw-restart_work, sky2_restart);
 
   pci_set_drvdata(pdev, hw);

Does it really reduce the wakeup's or only change who gets charged by powertop?
The system is going to wakeup once a second anyway. Looks to me that if the
timer is using round_jiffies(), that setting deferrable just changes the 
accounting.

My interpretation of the api is:
   * round_jiffies()  - timer wants to wakeup but isn't precise about when so 
schedule
on next second when system will wake up anyway;
e.g why meetings are usually scheduled on the hour

   * deferrable   - timer doesn't have to really wakeup but wants to happen 
near
a particular time. e.g. I'll meet you at the pub 
around 8pm

Therefore doing deferrable is unnecessary for timers using round_jiffies unless 
system
is so good at doing timers that it is going to skip doing timer once per second.

-- 
Stephen Hemminger [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] e1000e: Use deferrable timer for watchdog

2007-12-20 Thread Kok, Auke

Parag Warudkar wrote:
 
 Reduce wakeups from idle per second.
 
 Signed-off-by: Parag Warudkar [EMAIL PROTECTED]
 
 --- linux-2.6/drivers/net/e1000e/netdev.c2007-12-07
 10:04:39.0 -0500
 +++ linux-2.6-work/drivers/net/e1000e/netdev.c2007-12-18
 20:45:59.0 -0500
 @@ -3899,7 +3899,7 @@
  goto err_eeprom;
  }
 
 -init_timer(adapter-watchdog_timer);
 +init_timer_deferrable(adapter-watchdog_timer);
  adapter-watchdog_timer.function = e1000_watchdog;
  adapter-watchdog_timer.data = (unsigned long) adapter;


I can't even apply this patch and the e1000 one... not only is it whitespace
damaged it is also not properly formatted as patch at all. If you want me to 
take
these patches seriously, then please fix the formatting issues.

Auke
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Stephen Hemminger

On Thu, 20 Dec 2007 17:29:23 +

 -Original Message-
 From: Stephen Hemminger [EMAIL PROTECTED]

 Date: Thu, 20 Dec 2007 09:16:03 
 To:[EMAIL PROTECTED]
 Cc:netdev@vger.kernel.org, [EMAIL PROTECTED],   [EMAIL PROTECTED]
 Subject: Re: [PATCH] sky2: Use deferrable timer for watchdog

 On Tue, 18 Dec 2007 20:13:28 -0500 (EST)
 Parag Warudkar [EMAIL PROTECTED] wrote:

  sky2 can use deferrable timer for watchdog - reduces wakeups from idle per 
  second.

  Signed-off-by: Parag Warudkar [EMAIL PROTECTED]

  --- linux-2.6/drivers/net/sky2.c2007-12-07 10:04:39.0 -0500
  +++ linux-2.6-work/drivers/net/sky2.c   2007-12-18 20:07:58.0 
  -0500
  @@ -4230,7 +4230,10 @@
  sky2_show_addr(dev1);
  }

  -   setup_timer(hw-watchdog_timer, sky2_watchdog, (unsigned long) hw);
  +   hw-watchdog_timer.function = sky2_watchdog;
  +   hw-watchdog_timer.data = (unsigned long) hw;
  +   init_timer_deferrable(hw-watchdog_timer);
  +
  INIT_WORK(hw-restart_work, sky2_restart);

  pci_set_drvdata(pdev, hw);

 Does it really reduce the wakeup's or only change who gets charged by 
 powertop?
 The system is going to wakeup once a second anyway. Looks to me that if the
 timer is using round_jiffies(), that setting deferrable just changes the 
 accounting.

 My interpretation of the api is:
* round_jiffies()  - timer wants to wakeup but isn't precise about when so 
 schedule
 on next second when system will wake up anyway;
 e.g why meetings are usually scheduled on the hour

* deferrable   - timer doesn't have to really wakeup but wants to 
 happen near
 a particular time. e.g. I'll meet you at the pub 
 around 8pm

 Therefore doing deferrable is unnecessary for timers using round_jiffies 
 unless system
 is so good at doing timers that it is going to skip doing timer once per 
 second.

[EMAIL PROTECTED] wrote:

 NO_HZ kernels don't do timers every second - if you do round_jiffies() the 
 kernel will wakeup and run the timer at that time no matter what. 

 The reason deferrable was introduced is to avoid waking up the kernel just 
 for this one timer that can be called when the CPU is not idle for some 
 reason other than this timer.

 In other words let's say there were two timers - one non-deferrable expiring 
 in 3 seconds and other deferrable, expiring in 1.5 seconds. The kernel will 
 not wake up twice - once for 1.5 second and other for 3 second - it will wake 
 up once at expiry of 3 second timer and execute both the 1.5 second and 3 
 second timers.

 And this is not just powertop accounting thing - like I said the total num of 
 wakeups per second go down with this patch.

 Parag

 Sent via BlackBerry from T-Mobile

Quit top-posting!

If this is the case then the whole usage of round_jiffies() is bogus. All users 
of round_jiffies()
should just be converted to deferrable??  I am a bit concerned that if 
deferrable gets used everywhere
then a strange situation would occur where all timers were waiting for some 
other timer to finally
happen, kind of a wierd timelock situation. Like the old chip/dale cartoon:
 you first, no you first, after you mister chip, no after you mister dale,...

-- 
Stephen Hemminger [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] e1000e: Use deferrable timer for watchdog

2007-12-20 Thread Parag Warudkar

On Dec 20, 2007 12:05 PM, Kok, Auke [EMAIL PROTECTED] wrote:

 I can't even apply this patch and the e1000 one... not only is it whitespace
 damaged it is also not properly formatted as patch at all. If you want me to 
 take
 these patches seriously, then please fix the formatting issues.

Sigh - I use Pine, follow Documents/email-clients.txt for the
recommended settings and obviously the pathces are not generated with
whitespace damage at my end as I test those before sending out.

So although I hate to see this happen there is nothing at this moment
that I can do - except for attaching the patch instead of inlining it.
Since they have already been reviewed inline, please see if the
attached patches work for you.

[EMAIL PROTECTED] linux-2.6]$ scripts/checkpatch.pl --no-tree
../../Patches/e1000_main.c.patch
total: 0 errors, 0 warnings, 8 lines checked

Your patch has no obvious style problems and is ready for submission.
[EMAIL PROTECTED] linux-2.6]$
[EMAIL PROTECTED] linux-2.6]$ vim drivers/net/e1000/e1000_main.c
[EMAIL PROTECTED] linux-2.6]$ patch -p1  ../../Patches/e1000_main.c.patch
patching file drivers/net/e1000/e1000_main.c

[EMAIL PROTECTED] linux-2.6]$ scripts/checkpatch.pl --no-tree
../../Patches/e1000e-netdev.c.patch
total: 0 errors, 0 warnings, 8 lines checked

Your patch has no obvious style problems and is ready for submission.
[EMAIL PROTECTED] linux-2.6]$ patch -p1  ../../Patches/e1000e-netdev.c.patch
patching file drivers/net/e1000e/netdev.c

Thanks

Parag


e1000_main.c.patch
Description: Binary data


e1000e-netdev.c.patch
Description: Binary data

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Parag Warudkar

On Dec 20, 2007 12:51 PM, Stephen Hemminger
[EMAIL PROTECTED] wrote:

 Quit top-posting!

 If this is the case then the whole usage of round_jiffies() is bogus. All 
 users of round_jiffies()
 should just be converted to deferrable??  I am a bit concerned that if 
 deferrable gets used everywhere
 then a strange situation would occur where all timers were waiting for some 
 other timer to finally
 happen, kind of a wierd timelock situation. Like the old chip/dale cartoon:
  you first, no you first, after you mister chip, no after you mister 
 dale,...


Haha - I thought about this too. I think there should be mechanism
where the machine does not idle infinitely even if there are no
non-deferrable timers. Something like an affordable QoS for non
deferrable timers - the kernel wakes up after that interval and runs
all deferrable timers  even if nothing non-deferrable is set to run.
So we still get advantage of not having to wake individually for each
timer and the non-deferrable timers do get all run in reasonable
amount of time.

Who knows Thomas/Ingo already built in something of that nature or effect?!

Parag
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] net: neighbor timer power saving

2007-12-20 Thread Parag Warudkar

On Dec 20, 2007 12:10 PM, Stephen Hemminger
[EMAIL PROTECTED] wrote:

 Thinking about it more, this looks like a case for just using round_jiffies().
 The GC timer needs to run to clean up under DoS attack, and deferring it 
 probably
 isn't a good idea.

But what are the chances that a DoSed machine will be idling which
will prevent the GC timer to run? I would think there would be lot of
other activities going on (including non-deferrable timers running)
that will avoid this situation?

Parag
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] e1000: Use deferrable timer for watchdog

2007-12-20 Thread Auke Kok

From: Parag Warudkar [EMAIL PROTECTED]

Reduces wakeups from idle per second.

Signed-off-by: Parag Warudkar [EMAIL PROTECTED]
Signed-off-by: Auke Kok [EMAIL PROTECTED]
---

 drivers/net/e1000/e1000_main.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 599153d..6af86fa 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -1066,7 +1066,7 @@ e1000_probe(struct pci_dev *pdev,
adapter-tx_fifo_stall_timer.function = e1000_82547_tx_fifo_stall;
adapter-tx_fifo_stall_timer.data = (unsigned long) adapter;
 
-   init_timer(adapter-watchdog_timer);
+   init_timer_deferrable(adapter-watchdog_timer);
adapter-watchdog_timer.function = e1000_watchdog;
adapter-watchdog_timer.data = (unsigned long) adapter;
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-2.6.25 3/3] Uninline the inet_twsk_put function

2007-12-20 Thread Ingo Oeser

Pavel Emelyanov schrieb:
 This one is not that big, but is widely used: saves 1200 bytes 
 from net/ipv4/built-in.o
 +void inet_twsk_put(struct inet_timewait_sock *tw)
 +{
 + if (atomic_dec_and_test(tw-tw_refcnt)) {
 + struct module *owner = tw-tw_prot-owner;
 + twsk_destructor((struct sock *)tw);
 +#ifdef SOCK_REFCNT_DEBUG
 + printk(KERN_DEBUG %s timewait_sock %p released\n,
 +tw-tw_prot-name, tw);
 +#endif
 + kmem_cache_free(tw-tw_prot-twsk_prot-twsk_slab, tw);
 + module_put(owner);
 + }
 +}
 +EXPORT_SYMBOL_GPL(inet_twsk_put);

More correct fix seems to be conversion to kref.

Just create out of line inet_twsk_release() containing
sth. similiar to the code inside these braces and modify 
inet_twsk_put() to sth. like this:

static inline inet_twsk_put(struct inet_timewait_sock *tw)
{
kref_put(tw-kref, inet_twsk_release);
}

David, can you see any reason (e.g. some crazy lock stuff) NOT to do this?


Best Regards

Ingo Oeser
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Kok, Auke

Stephen Hemminger wrote:
 On Thu, 20 Dec 2007 17:29:23 +

 -Original Message-
 From: Stephen Hemminger [EMAIL PROTECTED]

 Date: Thu, 20 Dec 2007 09:16:03 
 To:[EMAIL PROTECTED]
 Cc:netdev@vger.kernel.org, [EMAIL PROTECTED],   [EMAIL PROTECTED]
 Subject: Re: [PATCH] sky2: Use deferrable timer for watchdog

 On Tue, 18 Dec 2007 20:13:28 -0500 (EST)
 Parag Warudkar [EMAIL PROTECTED] wrote:

 sky2 can use deferrable timer for watchdog - reduces wakeups from idle per 
 second.

 Signed-off-by: Parag Warudkar [EMAIL PROTECTED]

 --- linux-2.6/drivers/net/sky2.c2007-12-07 10:04:39.0 -0500
 +++ linux-2.6-work/drivers/net/sky2.c   2007-12-18 20:07:58.0 
 -0500
 @@ -4230,7 +4230,10 @@
 sky2_show_addr(dev1);
 }

 -   setup_timer(hw-watchdog_timer, sky2_watchdog, (unsigned long) hw);
 +   hw-watchdog_timer.function = sky2_watchdog;
 +   hw-watchdog_timer.data = (unsigned long) hw;
 +   init_timer_deferrable(hw-watchdog_timer);
 +
 INIT_WORK(hw-restart_work, sky2_restart);

 pci_set_drvdata(pdev, hw);
 Does it really reduce the wakeup's or only change who gets charged by 
 powertop?
 The system is going to wakeup once a second anyway. Looks to me that if the
 timer is using round_jiffies(), that setting deferrable just changes the 
 accounting.

 My interpretation of the api is:
* round_jiffies()  - timer wants to wakeup but isn't precise about when 
 so schedule
 on next second when system will wake up anyway;
 e.g why meetings are usually scheduled on the hour

* deferrable   - timer doesn't have to really wakeup but wants to 
 happen near
 a particular time. e.g. I'll meet you at the pub 
 around 8pm

 Therefore doing deferrable is unnecessary for timers using round_jiffies 
 unless system
 is so good at doing timers that it is going to skip doing timer once per 
 second.

 [EMAIL PROTECTED] wrote:

 NO_HZ kernels don't do timers every second - if you do round_jiffies() the 
 kernel will wakeup and run the timer at that time no matter what. 

 The reason deferrable was introduced is to avoid waking up the kernel just 
 for this one timer that can be called when the CPU is not idle for some 
 reason other than this timer.

 In other words let's say there were two timers - one non-deferrable expiring 
 in 3 seconds and other deferrable, expiring in 1.5 seconds. The kernel will 
 not wake up twice - once for 1.5 second and other for 3 second - it will 
 wake up once at expiry of 3 second timer and execute both the 1.5 second and 
 3 second timers.

 And this is not just powertop accounting thing - like I said the total num 
 of wakeups per second go down with this patch.

 Parag

 Sent via BlackBerry from T-Mobile

 Quit top-posting!

 If this is the case then the whole usage of round_jiffies() is bogus. All 
 users of round_jiffies()
 should just be converted to deferrable??  I am a bit concerned that if 
 deferrable gets used everywhere
 then a strange situation would occur where all timers were waiting for some 
 other timer to finally
 happen, kind of a wierd timelock situation. Like the old chip/dale cartoon:
  you first, no you first, after you mister chip, no after you mister 
 dale,...

that's a dangerous situation indeed and I'd really like to know what the limits
are for deferring deferrable timers Arjan, do you know? Anyone?

I don't see a danger just yet on normal systems - I get something like 10 
wakeups
per second from just the kernel (acpi, ahci, usb) on most my systems which
guarantees that the watchdog runs often enough, but for embedded systems and
critical timers in other drivers this may be an issue quickly

Auke
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] e1000e: Use deferrable timer for watchdog

2007-12-20 Thread Kok, Auke

Parag Warudkar wrote:
 On Dec 20, 2007 12:05 PM, Kok, Auke [EMAIL PROTECTED] wrote:
 I can't even apply this patch and the e1000 one... not only is it whitespace
 damaged it is also not properly formatted as patch at all. If you want me to 
 take
 these patches seriously, then please fix the formatting issues.
 
 Sigh - I use Pine, follow Documents/email-clients.txt for the
 recommended settings and obviously the pathces are not generated with
 whitespace damage at my end as I test those before sending out.
 
 So although I hate to see this happen there is nothing at this moment
 that I can do - except for attaching the patch instead of inlining it.
 Since they have already been reviewed inline, please see if the
 attached patches work for you.

here's what the files in my Maildir spool look like in vim (my vim displays a 
'»'
char for tabs and a ¶ for EOL):

 76 --- linux-2.6/drivers/net/e1000e/netdev.c»  2007-12-07 10:04:39.
 77 +++ linux-2.6-work/drivers/net/e1000e/netdev.c» 2007-12-18 20:45:59.
 78 @@ -3899,7 +3899,7 @@¶
 79   » »   goto err_eeprom;¶
 80   » }¶
 81 ¶
 82 -»  init_timer(adapter-watchdog_timer);¶
 83 +»  init_timer_deferrable(adapter-watchdog_timer);¶
 84   » adapter-watchdog_timer.function = e1000_watchdog;¶
 85   » adapter-watchdog_timer.data = (unsigned long) adapter;¶
 86 ¶
 87 --¶

notice that there are two spaces instead of 1. Also there's no line heading the
diff with 'diff a/foo b/foo' which is what throws of stg. And the -p option is
missing.


as for content, the patch looks OK with me. I ran the numbers and allthough 
there
was a slight average delay in the link up detection time it is negligeable (less
than 0.2sec difference over a bunch of measurements), and I confirmed your
powertop numbers are correct. As for the timer interval, the watchdog may 
already
be delayed up to 3 seconds safely, this doesn't change that.

I'll forward the patch, Care to make one for e100? plenty of laptops with those
still around! The embedded guys would love it I think.

Thanks,

Auke

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] e1000e: Use deferrable timer for watchdog

2007-12-20 Thread Auke Kok

From: Parag Warudkar [EMAIL PROTECTED]

Reduce wakeups from idle per second.

Signed-off-by: Parag Warudkar [EMAIL PROTECTED]
Signed-off-by: Auke Kok [EMAIL PROTECTED]
---

 drivers/net/e1000e/netdev.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 2422d16..59960d2 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -3931,7 +3931,7 @@ static int __devinit e1000_probe(struct pci_dev *pdev,
goto err_eeprom;
}
 
-   init_timer(adapter-watchdog_timer);
+   init_timer_deferrable(adapter-watchdog_timer);
adapter-watchdog_timer.function = e1000_watchdog;
adapter-watchdog_timer.data = (unsigned long) adapter;
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Arjan van de Ven




My interpretation of the api is:
   * round_jiffies()  - timer wants to wakeup but isn't precise about when so 
schedule
on next second when system will wake up anyway;
e.g why meetings are usually scheduled on the hour

   * deferrable   - timer doesn't have to really wakeup but wants to happen 
near
a particular time. e.g. I'll meet you at the pub around 
8pm


this is not correct.

deferrable means if you're busy wake me up at this time. But if not, don't 
bother waking up for me, get to it
later.

The later can be a LONG time later, several seconds easily, if not more.
(timers are on a per cpu bases, and you may end up with a several-core system 
where the common timers are all on another cpu
than this one)




If this is the case then the whole usage of round_jiffies() is bogus. All users 
of round_jiffies()
should just be converted to deferrable??  I am a bit concerned that if 
deferrable gets used everywhere
then a strange situation would occur where all timers were waiting for some 
other timer to finally
happen, kind of a wierd timelock situation. Like the old chip/dale cartoon:
 you first, no you first, after you mister chip, no after you mister dale,...




that's a dangerous situation indeed and I'd really like to know what the limits
are for deferring deferrable timers Arjan, do you know? Anyone?


there is NO limit to deferring a timer. Do NOT use a deferrable timer if you 
can't afford the timer to not happen
within.. 10 to 100 seconds! (or more)
They are really meant for things where you CAN afford for it to not happen when 
you're idle




I don't see a danger just yet on normal systems - I get something like 10 
wakeups
per second from just the kernel (acpi, ahci, usb) on most my systems which
guarantees that the watchdog runs often enough, but for embedded systems and
critical timers in other drivers this may be an issue quickly


on my work desktop test box the average time between cpu wakeups is 1.4 seconds
(and that's single core). It would be higher if it wasn't for some hpet limit 
issues.


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Kok, Auke

Arjan van de Ven wrote:
 
 My interpretation of the api is:
* round_jiffies()  - timer wants to wakeup but isn't precise
 about when so schedule
 on next second when system will wake up anyway;
 e.g why meetings are usually scheduled on
 the hour

* deferrable   - timer doesn't have to really wakeup but
 wants to happen near
 a particular time. e.g. I'll meet you at
 the pub around 8pm
 
 this is not correct.
 
 deferrable means if you're busy wake me up at this time. But if not,
 don't bother waking up for me, get to it
 later.
 
 The later can be a LONG time later, several seconds easily, if not more.
 (timers are on a per cpu bases, and you may end up with a several-core
 system where the common timers are all on another cpu
 than this one)
 
 
 
 If this is the case then the whole usage of round_jiffies() is bogus.
 All users of round_jiffies()
 should just be converted to deferrable??  I am a bit concerned that
 if deferrable gets used everywhere
 then a strange situation would occur where all timers were waiting
 for some other timer to finally
 happen, kind of a wierd timelock situation. Like the old chip/dale
 cartoon:
  you first, no you first, after you mister chip, no after you mister
 dale,...



 that's a dangerous situation indeed and I'd really like to know what
 the limits
 are for deferring deferrable timers Arjan, do you know? Anyone?
 
 there is NO limit to deferring a timer. Do NOT use a deferrable timer if
 you can't afford the timer to not happen
 within.. 10 to 100 seconds! (or more)
 They are really meant for things where you CAN afford for it to not
 happen when you're idle

ok, that's just bad and if there's no user-defineable limit to the deferral I
definately don't like this change.

Can I safely assume that any irq will cause all deferred timers to run?

If this is the case then for e1000 this patch is still OK since the watchdog 
needs
to run (1) after a link up/down interrupt or (2) to update statistics. Those
statistics won't increase if there is no traffic of course...

Auke
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/9][BNX2]: Add MSIX support.

2007-12-20 Thread Michael Chan

David, this patchset lays the foundation for supporting multiple MSIX
IRQs.  Only 1 additional MSIX is added to handle TX separately from RX
at the moment.  Multiple TX and RX rings will be added in the future.
Please review for 2.6.25.  Thanks.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/9][BNX2]: Add function to fetch hardware tx index.

2007-12-20 Thread Michael Chan

[BNX2]: Add function to fetch hardware tx index.

This makes the code cleaner and easier to support different tx rings.

Signed-off-by: Michael Chan [EMAIL PROTECTED]

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 469d259..f19a1e9 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -2323,17 +2323,25 @@ bnx2_phy_int(struct bnx2 *bp)
 
 }
 
+static inline u16
+bnx2_get_hw_tx_cons(struct bnx2 *bp)
+{
+   u16 cons;
+
+   cons = bp-status_blk-status_tx_quick_consumer_index0;
+
+   if (unlikely((cons  MAX_TX_DESC_CNT) == MAX_TX_DESC_CNT))
+   cons++;
+   return cons;
+}
+
 static void
 bnx2_tx_int(struct bnx2 *bp)
 {
-   struct status_block *sblk = bp-status_blk;
u16 hw_cons, sw_cons, sw_ring_cons;
int tx_free_bd = 0;
 
-   hw_cons = bp-hw_tx_cons = sblk-status_tx_quick_consumer_index0;
-   if ((hw_cons  MAX_TX_DESC_CNT) == MAX_TX_DESC_CNT) {
-   hw_cons++;
-   }
+   hw_cons = bnx2_get_hw_tx_cons(bp);
sw_cons = bp-tx_cons;
 
while (sw_cons != hw_cons) {
@@ -2385,14 +2393,10 @@ bnx2_tx_int(struct bnx2 *bp)
 
dev_kfree_skb(skb);
 
-   hw_cons = bp-hw_tx_cons =
-   sblk-status_tx_quick_consumer_index0;
-
-   if ((hw_cons  MAX_TX_DESC_CNT) == MAX_TX_DESC_CNT) {
-   hw_cons++;
-   }
+   hw_cons = bnx2_get_hw_tx_cons(bp);
}
 
+   bp-hw_tx_cons = hw_cons;
bp-tx_cons = sw_cons;
/* Need to make the tx_cons update visible to bnx2_start_xmit()
 * before checking for netif_queue_stopped().  Without the
@@ -2822,7 +2826,7 @@ bnx2_has_work(struct bnx2 *bp)
struct status_block *sblk = bp-status_blk;
 
if ((bnx2_get_hw_rx_cons(bp) != bp-rx_cons) ||
-   (sblk-status_tx_quick_consumer_index0 != bp-hw_tx_cons))
+   (bnx2_get_hw_tx_cons(bp) != bp-hw_tx_cons))
return 1;
 
if ((sblk-status_attn_bits  STATUS_ATTN_EVENTS) !=
@@ -2851,7 +2855,7 @@ static int bnx2_poll_work(struct bnx2 *bp, int work_done, 
int budget)
REG_RD(bp, BNX2_HC_COMMAND);
}
 
-   if (sblk-status_tx_quick_consumer_index0 != bp-hw_tx_cons)
+   if (bnx2_get_hw_tx_cons(bp) != bp-hw_tx_cons)
bnx2_tx_int(bp);
 
if (bnx2_get_hw_rx_cons(bp) != bp-rx_cons)
@@ -4917,7 +4921,7 @@ bnx2_run_loopback(struct bnx2 *bp, int loopback_mode)
REG_RD(bp, BNX2_HC_COMMAND);
 
udelay(5);
-   rx_start_idx = bp-status_blk-status_rx_quick_consumer_index0;
+   rx_start_idx = bnx2_get_hw_rx_cons(bp);
 
num_pkts = 0;
 
@@ -4947,11 +4951,10 @@ bnx2_run_loopback(struct bnx2 *bp, int loopback_mode)
pci_unmap_single(bp-pdev, map, pkt_size, PCI_DMA_TODEVICE);
dev_kfree_skb(skb);
 
-   if (bp-status_blk-status_tx_quick_consumer_index0 != bp-tx_prod) {
+   if (bnx2_get_hw_tx_cons(bp) != bp-tx_prod)
goto loopback_test_done;
-   }
 
-   rx_idx = bp-status_blk-status_rx_quick_consumer_index0;
+   rx_idx = bnx2_get_hw_rx_cons(bp);
if (rx_idx != rx_start_idx + num_pkts) {
goto loopback_test_done;
}


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/9][BNX2]: Restructure IRQ datastructures.

2007-12-20 Thread Michael Chan

[BNX2]: Restructure IRQ datastructures.

Add a table to keep track of multiple IRQs and restructure the IRQ
request and free functions so that they can be easily expanded to
handle multiple IRQs.

Signed-off-by: Michael Chan [EMAIL PROTECTED]

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index f19a1e9..83cdbde 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -5234,18 +5234,15 @@ static int
 bnx2_request_irq(struct bnx2 *bp)
 {
struct net_device *dev = bp-dev;
-   int rc = 0;
-
-   if (bp-flags  USING_MSI_FLAG) {
-   irq_handler_t   fn = bnx2_msi;
-
-   if (bp-flags  ONE_SHOT_MSI_FLAG)
-   fn = bnx2_msi_1shot;
+   unsigned long flags;
+   struct bnx2_irq *irq = bp-irq_tbl[0];
+   int rc;
 
-   rc = request_irq(bp-pdev-irq, fn, 0, dev-name, dev);
-   } else
-   rc = request_irq(bp-pdev-irq, bnx2_interrupt,
-IRQF_SHARED, dev-name, dev);
+   if (bp-flags  USING_MSI_FLAG)
+   flags = 0;
+   else
+   flags = IRQF_SHARED;
+   rc = request_irq(irq-vector, irq-handler, flags, dev-name, dev);
return rc;
 }
 
@@ -5254,12 +5251,31 @@ bnx2_free_irq(struct bnx2 *bp)
 {
struct net_device *dev = bp-dev;
 
+   free_irq(bp-irq_tbl[0].vector, dev);
if (bp-flags  USING_MSI_FLAG) {
-   free_irq(bp-pdev-irq, dev);
pci_disable_msi(bp-pdev);
bp-flags = ~(USING_MSI_FLAG | ONE_SHOT_MSI_FLAG);
-   } else
-   free_irq(bp-pdev-irq, dev);
+   }
+}
+
+static void
+bnx2_setup_int_mode(struct bnx2 *bp, int dis_msi)
+{
+   bp-irq_tbl[0].handler = bnx2_interrupt;
+   strcpy(bp-irq_tbl[0].name, bp-dev-name);
+
+   if ((bp-flags  MSI_CAP_FLAG)  !dis_msi) {
+   if (pci_enable_msi(bp-pdev) == 0) {
+   bp-flags |= USING_MSI_FLAG;
+   if (CHIP_NUM(bp) == CHIP_NUM_5709) {
+   bp-flags |= ONE_SHOT_MSI_FLAG;
+   bp-irq_tbl[0].handler = bnx2_msi_1shot;
+   } else
+   bp-irq_tbl[0].handler = bnx2_msi;
+   }
+   }
+
+   bp-irq_tbl[0].vector = bp-pdev-irq;
 }
 
 /* Called with rtnl_lock */
@@ -5278,15 +5294,8 @@ bnx2_open(struct net_device *dev)
if (rc)
return rc;
 
+   bnx2_setup_int_mode(bp, disable_msi);
napi_enable(bp-napi);
-
-   if ((bp-flags  MSI_CAP_FLAG)  !disable_msi) {
-   if (pci_enable_msi(bp-pdev) == 0) {
-   bp-flags |= USING_MSI_FLAG;
-   if (CHIP_NUM(bp) == CHIP_NUM_5709)
-   bp-flags |= ONE_SHOT_MSI_FLAG;
-   }
-   }
rc = bnx2_request_irq(bp);
 
if (rc) {
@@ -5325,6 +5334,8 @@ bnx2_open(struct net_device *dev)
bnx2_disable_int(bp);
bnx2_free_irq(bp);
 
+   bnx2_setup_int_mode(bp, 1);
+
rc = bnx2_init_nic(bp);
 
if (!rc)
diff --git a/drivers/net/bnx2.h b/drivers/net/bnx2.h
index 1f244fa..1accf00 100644
--- a/drivers/net/bnx2.h
+++ b/drivers/net/bnx2.h
@@ -6494,6 +6494,15 @@ struct flash_spec {
u8  *name;
 };
 
+#define BNX2_MAX_MSIX_HW_VEC   9
+#define BNX2_MAX_MSIX_VEC  1
+
+struct bnx2_irq {
+   irq_handler_t   handler;
+   u16 vector;
+   charname[16];
+};
+
 struct bnx2 {
/* Fields used in the tx and intr/napi performance paths are grouped */
/* together in the beginning of the structure. */
@@ -6721,6 +6730,9 @@ struct bnx2 {
u32 flash_size;
 
int status_stats_size;
+
+   struct bnx2_irq irq_tbl[BNX2_MAX_MSIX_VEC];
+   int irq_nvecs;
 };
 
 static u32 bnx2_reg_rd_ind(struct bnx2 *bp, u32 offset);


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/9][BNX2]: Introduce new bnx2_napi structure.

2007-12-20 Thread Michael Chan

[BNX2]: Introduce new bnx2_napi structure.

Introduce a bnx2_napi structure that will hold a napi_struct and
other fields to handle NAPI polling for the napi_struct.  Various tx
and rx indexes and status block pointers will be moved from the main
bnx2 structure to this bnx2_napi structure.

Most NAPI path functions are modified to be passed this bnx2_napi
struct pointer.

Signed-off-by: Michael Chan [EMAIL PROTECTED]

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 83cdbde..3f754e6 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -407,12 +407,14 @@ bnx2_disable_int(struct bnx2 *bp)
 static void
 bnx2_enable_int(struct bnx2 *bp)
 {
+   struct bnx2_napi *bnapi = bp-bnx2_napi;
+
REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
   BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
-  BNX2_PCICFG_INT_ACK_CMD_MASK_INT | bp-last_status_idx);
+  BNX2_PCICFG_INT_ACK_CMD_MASK_INT | bnapi-last_status_idx);
 
REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
-  BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | bp-last_status_idx);
+  BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | bnapi-last_status_idx);
 
REG_WR(bp, BNX2_HC_COMMAND, bp-hc_cmd | BNX2_HC_COMMAND_COAL_NOW);
 }
@@ -426,11 +428,23 @@ bnx2_disable_int_sync(struct bnx2 *bp)
 }
 
 static void
+bnx2_napi_disable(struct bnx2 *bp)
+{
+   napi_disable(bp-bnx2_napi.napi);
+}
+
+static void
+bnx2_napi_enable(struct bnx2 *bp)
+{
+   napi_enable(bp-bnx2_napi.napi);
+}
+
+static void
 bnx2_netif_stop(struct bnx2 *bp)
 {
bnx2_disable_int_sync(bp);
if (netif_running(bp-dev)) {
-   napi_disable(bp-napi);
+   bnx2_napi_disable(bp);
netif_tx_disable(bp-dev);
bp-dev-trans_start = jiffies; /* prevent tx timeout */
}
@@ -442,7 +456,7 @@ bnx2_netif_start(struct bnx2 *bp)
if (atomic_dec_and_test(bp-intr_sem)) {
if (netif_running(bp-dev)) {
netif_wake_queue(bp-dev);
-   napi_enable(bp-napi);
+   bnx2_napi_enable(bp);
bnx2_enable_int(bp);
}
}
@@ -555,6 +569,8 @@ bnx2_alloc_mem(struct bnx2 *bp)
 
memset(bp-status_blk, 0, bp-status_stats_size);
 
+   bp-bnx2_napi.status_blk = bp-status_blk;
+
bp-stats_blk = (void *) ((unsigned long) bp-status_blk +
  status_blk_size);
 
@@ -2291,9 +2307,9 @@ bnx2_alloc_rx_skb(struct bnx2 *bp, u16 index)
 }
 
 static int
-bnx2_phy_event_is_set(struct bnx2 *bp, u32 event)
+bnx2_phy_event_is_set(struct bnx2 *bp, struct bnx2_napi *bnapi, u32 event)
 {
-   struct status_block *sblk = bp-status_blk;
+   struct status_block *sblk = bnapi-status_blk;
u32 new_link_state, old_link_state;
int is_set = 1;
 
@@ -2311,24 +2327,24 @@ bnx2_phy_event_is_set(struct bnx2 *bp, u32 event)
 }
 
 static void
-bnx2_phy_int(struct bnx2 *bp)
+bnx2_phy_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
 {
-   if (bnx2_phy_event_is_set(bp, STATUS_ATTN_BITS_LINK_STATE)) {
+   if (bnx2_phy_event_is_set(bp, bnapi, STATUS_ATTN_BITS_LINK_STATE)) {
spin_lock(bp-phy_lock);
bnx2_set_link(bp);
spin_unlock(bp-phy_lock);
}
-   if (bnx2_phy_event_is_set(bp, STATUS_ATTN_BITS_TIMER_ABORT))
+   if (bnx2_phy_event_is_set(bp, bnapi, STATUS_ATTN_BITS_TIMER_ABORT))
bnx2_set_remote_link(bp);
 
 }
 
 static inline u16
-bnx2_get_hw_tx_cons(struct bnx2 *bp)
+bnx2_get_hw_tx_cons(struct bnx2_napi *bnapi)
 {
u16 cons;
 
-   cons = bp-status_blk-status_tx_quick_consumer_index0;
+   cons = bnapi-status_blk-status_tx_quick_consumer_index0;
 
if (unlikely((cons  MAX_TX_DESC_CNT) == MAX_TX_DESC_CNT))
cons++;
@@ -2336,12 +2352,12 @@ bnx2_get_hw_tx_cons(struct bnx2 *bp)
 }
 
 static void
-bnx2_tx_int(struct bnx2 *bp)
+bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
 {
u16 hw_cons, sw_cons, sw_ring_cons;
int tx_free_bd = 0;
 
-   hw_cons = bnx2_get_hw_tx_cons(bp);
+   hw_cons = bnx2_get_hw_tx_cons(bnapi);
sw_cons = bp-tx_cons;
 
while (sw_cons != hw_cons) {
@@ -2393,7 +2409,7 @@ bnx2_tx_int(struct bnx2 *bp)
 
dev_kfree_skb(skb);
 
-   hw_cons = bnx2_get_hw_tx_cons(bp);
+   hw_cons = bnx2_get_hw_tx_cons(bnapi);
}
 
bp-hw_tx_cons = hw_cons;
@@ -2584,9 +2600,9 @@ bnx2_rx_skb(struct bnx2 *bp, struct sk_buff *skb, 
unsigned int len,
 }
 
 static inline u16
-bnx2_get_hw_rx_cons(struct bnx2 *bp)
+bnx2_get_hw_rx_cons(struct bnx2_napi *bnapi)
 {
-   u16 cons = bp-status_blk-status_rx_quick_consumer_index0;
+   u16 cons = bnapi-status_blk-status_rx_quick_consumer_index0;
 
if (unlikely((cons  MAX_RX_DESC_CNT) == MAX_RX_DESC_CNT))
cons++;
@@ -2594,13 +2610,13 @@ bnx2_get_hw_rx_cons(struct bnx2 *bp)
 }
 
 static int

[PATCH 4/9][BNX2]: Move tx indexes into bnx2_napi struct.

2007-12-20 Thread Michael Chan

[BNX2]: Move tx indexes into bnx2_napi struct.

Tx related fields used in NAPI polling are moved from the main
bnx2 struct to the bnx2_napi struct.

Signed-off-by: Michael Chan [EMAIL PROTECTED]

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 3f754e6..0300a75 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -226,7 +226,7 @@ static struct flash_spec flash_5709 = {
 
 MODULE_DEVICE_TABLE(pci, bnx2_pci_tbl);
 
-static inline u32 bnx2_tx_avail(struct bnx2 *bp)
+static inline u32 bnx2_tx_avail(struct bnx2 *bp, struct bnx2_napi *bnapi)
 {
u32 diff;
 
@@ -235,7 +235,7 @@ static inline u32 bnx2_tx_avail(struct bnx2 *bp)
/* The ring uses 256 indices for 255 entries, one of them
 * needs to be skipped.
 */
-   diff = bp-tx_prod - bp-tx_cons;
+   diff = bp-tx_prod - bnapi-tx_cons;
if (unlikely(diff = TX_DESC_CNT)) {
diff = 0x;
if (diff == TX_DESC_CNT)
@@ -2358,7 +2358,7 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
int tx_free_bd = 0;
 
hw_cons = bnx2_get_hw_tx_cons(bnapi);
-   sw_cons = bp-tx_cons;
+   sw_cons = bnapi-tx_cons;
 
while (sw_cons != hw_cons) {
struct sw_bd *tx_buf;
@@ -2412,8 +2412,8 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
hw_cons = bnx2_get_hw_tx_cons(bnapi);
}
 
-   bp-hw_tx_cons = hw_cons;
-   bp-tx_cons = sw_cons;
+   bnapi-hw_tx_cons = hw_cons;
+   bnapi-tx_cons = sw_cons;
/* Need to make the tx_cons update visible to bnx2_start_xmit()
 * before checking for netif_queue_stopped().  Without the
 * memory barrier, there is a small possibility that bnx2_start_xmit()
@@ -2422,10 +2422,10 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
smp_mb();
 
if (unlikely(netif_queue_stopped(bp-dev)) 
-(bnx2_tx_avail(bp)  bp-tx_wake_thresh)) {
+(bnx2_tx_avail(bp, bnapi)  bp-tx_wake_thresh)) {
netif_tx_lock(bp-dev);
if ((netif_queue_stopped(bp-dev)) 
-   (bnx2_tx_avail(bp)  bp-tx_wake_thresh))
+   (bnx2_tx_avail(bp, bnapi)  bp-tx_wake_thresh))
netif_wake_queue(bp-dev);
netif_tx_unlock(bp-dev);
}
@@ -2846,7 +2846,7 @@ bnx2_has_work(struct bnx2_napi *bnapi)
struct status_block *sblk = bp-status_blk;
 
if ((bnx2_get_hw_rx_cons(bnapi) != bp-rx_cons) ||
-   (bnx2_get_hw_tx_cons(bnapi) != bp-hw_tx_cons))
+   (bnx2_get_hw_tx_cons(bnapi) != bnapi-hw_tx_cons))
return 1;
 
if ((sblk-status_attn_bits  STATUS_ATTN_EVENTS) !=
@@ -2876,7 +2876,7 @@ static int bnx2_poll_work(struct bnx2 *bp, struct 
bnx2_napi *bnapi,
REG_RD(bp, BNX2_HC_COMMAND);
}
 
-   if (bnx2_get_hw_tx_cons(bnapi) != bp-hw_tx_cons)
+   if (bnx2_get_hw_tx_cons(bnapi) != bnapi-hw_tx_cons)
bnx2_tx_int(bp, bnapi);
 
if (bnx2_get_hw_rx_cons(bnapi) != bp-rx_cons)
@@ -4381,6 +4381,7 @@ bnx2_init_tx_ring(struct bnx2 *bp)
 {
struct tx_bd *txbd;
u32 cid;
+   struct bnx2_napi *bnapi = bp-bnx2_napi;
 
bp-tx_wake_thresh = bp-tx_ring_size / 2;
 
@@ -4390,8 +4391,8 @@ bnx2_init_tx_ring(struct bnx2 *bp)
txbd-tx_bd_haddr_lo = (u64) bp-tx_desc_mapping  0x;
 
bp-tx_prod = 0;
-   bp-tx_cons = 0;
-   bp-hw_tx_cons = 0;
+   bnapi-tx_cons = 0;
+   bnapi-hw_tx_cons = 0;
bp-tx_prod_bseq = 0;
 
cid = TX_CID;
@@ -5440,8 +5441,10 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device 
*dev)
u32 len, vlan_tag_flags, last_frag, mss;
u16 prod, ring_prod;
int i;
+   struct bnx2_napi *bnapi = bp-bnx2_napi;
 
-   if (unlikely(bnx2_tx_avail(bp)  (skb_shinfo(skb)-nr_frags + 1))) {
+   if (unlikely(bnx2_tx_avail(bp, bnapi) 
+   (skb_shinfo(skb)-nr_frags + 1))) {
netif_stop_queue(dev);
printk(KERN_ERR PFX %s: BUG! Tx ring full when queue awake!\n,
dev-name);
@@ -5556,9 +5559,9 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device 
*dev)
bp-tx_prod = prod;
dev-trans_start = jiffies;
 
-   if (unlikely(bnx2_tx_avail(bp) = MAX_SKB_FRAGS)) {
+   if (unlikely(bnx2_tx_avail(bp, bnapi) = MAX_SKB_FRAGS)) {
netif_stop_queue(dev);
-   if (bnx2_tx_avail(bp)  bp-tx_wake_thresh)
+   if (bnx2_tx_avail(bp, bnapi)  bp-tx_wake_thresh)
netif_wake_queue(dev);
}
 
diff --git a/drivers/net/bnx2.h b/drivers/net/bnx2.h
index 345b6db..958fdda 100644
--- a/drivers/net/bnx2.h
+++ b/drivers/net/bnx2.h
@@ -6509,6 +6509,9 @@ struct bnx2_napi {
struct status_block *status_blk;
u32 last_status_idx;
u32 int_num;
+
+   u16 tx_cons;

[PATCH 5/9][BNX2]: Move rx indexes into bnx2_napi struct.

2007-12-20 Thread Michael Chan

[BNX2]: Move rx indexes into bnx2_napi struct.

Rx related fields used in NAPI polling are moved from the main
bnx2 struct to the bnx2_napi struct.

Signed-off-by: Michael Chan [EMAIL PROTECTED]

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 0300a75..ecfaad1 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -2276,7 +2276,7 @@ bnx2_free_rx_page(struct bnx2 *bp, u16 index)
 }
 
 static inline int
-bnx2_alloc_rx_skb(struct bnx2 *bp, u16 index)
+bnx2_alloc_rx_skb(struct bnx2 *bp, struct bnx2_napi *bnapi, u16 index)
 {
struct sk_buff *skb;
struct sw_bd *rx_buf = bp-rx_buf_ring[index];
@@ -2301,7 +2301,7 @@ bnx2_alloc_rx_skb(struct bnx2 *bp, u16 index)
rxbd-rx_bd_haddr_hi = (u64) mapping  32;
rxbd-rx_bd_haddr_lo = (u64) mapping  0x;
 
-   bp-rx_prod_bseq += bp-rx_buf_use_size;
+   bnapi-rx_prod_bseq += bp-rx_buf_use_size;
 
return 0;
 }
@@ -2432,14 +2432,15 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
 }
 
 static void
-bnx2_reuse_rx_skb_pages(struct bnx2 *bp, struct sk_buff *skb, int count)
+bnx2_reuse_rx_skb_pages(struct bnx2 *bp, struct bnx2_napi *bnapi,
+   struct sk_buff *skb, int count)
 {
struct sw_pg *cons_rx_pg, *prod_rx_pg;
struct rx_bd *cons_bd, *prod_bd;
dma_addr_t mapping;
int i;
-   u16 hw_prod = bp-rx_pg_prod, prod;
-   u16 cons = bp-rx_pg_cons;
+   u16 hw_prod = bnapi-rx_pg_prod, prod;
+   u16 cons = bnapi-rx_pg_cons;
 
for (i = 0; i  count; i++) {
prod = RX_PG_RING_IDX(hw_prod);
@@ -2476,12 +2477,12 @@ bnx2_reuse_rx_skb_pages(struct bnx2 *bp, struct sk_buff 
*skb, int count)
cons = RX_PG_RING_IDX(NEXT_RX_BD(cons));
hw_prod = NEXT_RX_BD(hw_prod);
}
-   bp-rx_pg_prod = hw_prod;
-   bp-rx_pg_cons = cons;
+   bnapi-rx_pg_prod = hw_prod;
+   bnapi-rx_pg_cons = cons;
 }
 
 static inline void
-bnx2_reuse_rx_skb(struct bnx2 *bp, struct sk_buff *skb,
+bnx2_reuse_rx_skb(struct bnx2 *bp, struct bnx2_napi *bnapi, struct sk_buff 
*skb,
u16 cons, u16 prod)
 {
struct sw_bd *cons_rx_buf, *prod_rx_buf;
@@ -2494,7 +2495,7 @@ bnx2_reuse_rx_skb(struct bnx2 *bp, struct sk_buff *skb,
pci_unmap_addr(cons_rx_buf, mapping),
bp-rx_offset + RX_COPY_THRESH, PCI_DMA_FROMDEVICE);
 
-   bp-rx_prod_bseq += bp-rx_buf_use_size;
+   bnapi-rx_prod_bseq += bp-rx_buf_use_size;
 
prod_rx_buf-skb = skb;
 
@@ -2511,20 +2512,21 @@ bnx2_reuse_rx_skb(struct bnx2 *bp, struct sk_buff *skb,
 }
 
 static int
-bnx2_rx_skb(struct bnx2 *bp, struct sk_buff *skb, unsigned int len,
-   unsigned int hdr_len, dma_addr_t dma_addr, u32 ring_idx)
+bnx2_rx_skb(struct bnx2 *bp, struct bnx2_napi *bnapi, struct sk_buff *skb,
+   unsigned int len, unsigned int hdr_len, dma_addr_t dma_addr,
+   u32 ring_idx)
 {
int err;
u16 prod = ring_idx  0x;
 
-   err = bnx2_alloc_rx_skb(bp, prod);
+   err = bnx2_alloc_rx_skb(bp, bnapi, prod);
if (unlikely(err)) {
-   bnx2_reuse_rx_skb(bp, skb, (u16) (ring_idx  16), prod);
+   bnx2_reuse_rx_skb(bp, bnapi, skb, (u16) (ring_idx  16), prod);
if (hdr_len) {
unsigned int raw_len = len + 4;
int pages = PAGE_ALIGN(raw_len - hdr_len)  PAGE_SHIFT;
 
-   bnx2_reuse_rx_skb_pages(bp, NULL, pages);
+   bnx2_reuse_rx_skb_pages(bp, bnapi, NULL, pages);
}
return err;
}
@@ -2539,8 +2541,8 @@ bnx2_rx_skb(struct bnx2 *bp, struct sk_buff *skb, 
unsigned int len,
} else {
unsigned int i, frag_len, frag_size, pages;
struct sw_pg *rx_pg;
-   u16 pg_cons = bp-rx_pg_cons;
-   u16 pg_prod = bp-rx_pg_prod;
+   u16 pg_cons = bnapi-rx_pg_cons;
+   u16 pg_prod = bnapi-rx_pg_prod;
 
frag_size = len + 4 - hdr_len;
pages = PAGE_ALIGN(frag_size)  PAGE_SHIFT;
@@ -2551,9 +2553,10 @@ bnx2_rx_skb(struct bnx2 *bp, struct sk_buff *skb, 
unsigned int len,
if (unlikely(frag_len = 4)) {
unsigned int tail = 4 - frag_len;
 
-   bp-rx_pg_cons = pg_cons;
-   bp-rx_pg_prod = pg_prod;
-   bnx2_reuse_rx_skb_pages(bp, NULL, pages - i);
+   bnapi-rx_pg_cons = pg_cons;
+   bnapi-rx_pg_prod = pg_prod;
+   bnx2_reuse_rx_skb_pages(bp, bnapi, NULL,
+   pages - i);
skb-len -= tail;
if (i == 0) {
skb-tail -= tail;
@@ -2579,9 +2582,10 @@ bnx2_rx_skb(struct bnx2 *bp,

[PATCH 6/9][BNX2]: Support multiple MSIX IRQs.

2007-12-20 Thread Michael Chan

[BNX2]: Support multiple MSIX IRQs.

Change bnx2_napi struct into an array and add code to manage multiple
IRQs.  MSIX hardware structures and new registers are also added.

Signed-off-by: Michael Chan [EMAIL PROTECTED]

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index ecfaad1..196d053 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -399,44 +399,65 @@ bnx2_write_phy(struct bnx2 *bp, u32 reg, u32 val)
 static void
 bnx2_disable_int(struct bnx2 *bp)
 {
-   REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
-  BNX2_PCICFG_INT_ACK_CMD_MASK_INT);
+   int i;
+   struct bnx2_napi *bnapi;
+
+   for (i = 0; i  bp-irq_nvecs; i++) {
+   bnapi = bp-bnx2_napi[i];
+   REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, bnapi-int_num |
+  BNX2_PCICFG_INT_ACK_CMD_MASK_INT);
+   }
REG_RD(bp, BNX2_PCICFG_INT_ACK_CMD);
 }
 
 static void
 bnx2_enable_int(struct bnx2 *bp)
 {
-   struct bnx2_napi *bnapi = bp-bnx2_napi;
+   int i;
+   struct bnx2_napi *bnapi;
 
-   REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
-  BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
-  BNX2_PCICFG_INT_ACK_CMD_MASK_INT | bnapi-last_status_idx);
+   for (i = 0; i  bp-irq_nvecs; i++) {
+   bnapi = bp-bnx2_napi[i];
 
-   REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
-  BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | bnapi-last_status_idx);
+   REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, bnapi-int_num |
+  BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
+  BNX2_PCICFG_INT_ACK_CMD_MASK_INT |
+  bnapi-last_status_idx);
 
+   REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, bnapi-int_num |
+  BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
+  bnapi-last_status_idx);
+   }
REG_WR(bp, BNX2_HC_COMMAND, bp-hc_cmd | BNX2_HC_COMMAND_COAL_NOW);
 }
 
 static void
 bnx2_disable_int_sync(struct bnx2 *bp)
 {
+   int i;
+
atomic_inc(bp-intr_sem);
bnx2_disable_int(bp);
-   synchronize_irq(bp-pdev-irq);
+   for (i = 0; i  bp-irq_nvecs; i++)
+   synchronize_irq(bp-irq_tbl[i].vector);
 }
 
 static void
 bnx2_napi_disable(struct bnx2 *bp)
 {
-   napi_disable(bp-bnx2_napi.napi);
+   int i;
+
+   for (i = 0; i  bp-irq_nvecs; i++)
+   napi_disable(bp-bnx2_napi[i].napi);
 }
 
 static void
 bnx2_napi_enable(struct bnx2 *bp)
 {
-   napi_enable(bp-bnx2_napi.napi);
+   int i;
+
+   for (i = 0; i  bp-irq_nvecs; i++)
+   napi_enable(bp-bnx2_napi[i].napi);
 }
 
 static void
@@ -559,6 +580,9 @@ bnx2_alloc_mem(struct bnx2 *bp)
 
/* Combine status and statistics blocks into one allocation. */
status_blk_size = L1_CACHE_ALIGN(sizeof(struct status_block));
+   if (bp-flags  MSIX_CAP_FLAG)
+   status_blk_size = L1_CACHE_ALIGN(BNX2_MAX_MSIX_HW_VEC *
+BNX2_SBLK_MSIX_ALIGN_SIZE);
bp-status_stats_size = status_blk_size +
sizeof(struct statistics_block);
 
@@ -569,7 +593,17 @@ bnx2_alloc_mem(struct bnx2 *bp)
 
memset(bp-status_blk, 0, bp-status_stats_size);
 
-   bp-bnx2_napi.status_blk = bp-status_blk;
+   bp-bnx2_napi[0].status_blk = bp-status_blk;
+   if (bp-flags  MSIX_CAP_FLAG) {
+   for (i = 1; i  BNX2_MAX_MSIX_VEC; i++) {
+   struct bnx2_napi *bnapi = bp-bnx2_napi[i];
+
+   bnapi-status_blk = (void *)
+   ((unsigned long) bp-status_blk +
+BNX2_SBLK_MSIX_ALIGN_SIZE * i);
+   bnapi-int_num = i  24;
+   }
+   }
 
bp-stats_blk = (void *) ((unsigned long) bp-status_blk +
  status_blk_size);
@@ -2767,7 +2801,7 @@ bnx2_msi(int irq, void *dev_instance)
 {
struct net_device *dev = dev_instance;
struct bnx2 *bp = netdev_priv(dev);
-   struct bnx2_napi *bnapi = bp-bnx2_napi;
+   struct bnx2_napi *bnapi = bp-bnx2_napi[0];
 
prefetch(bnapi-status_blk);
REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
@@ -2788,7 +2822,7 @@ bnx2_msi_1shot(int irq, void *dev_instance)
 {
struct net_device *dev = dev_instance;
struct bnx2 *bp = netdev_priv(dev);
-   struct bnx2_napi *bnapi = bp-bnx2_napi;
+   struct bnx2_napi *bnapi = bp-bnx2_napi[0];
 
prefetch(bnapi-status_blk);
 
@@ -2806,7 +2840,7 @@ bnx2_interrupt(int irq, void *dev_instance)
 {
struct net_device *dev = dev_instance;
struct bnx2 *bp = netdev_priv(dev);
-   struct bnx2_napi *bnapi = bp-bnx2_napi;
+   struct bnx2_napi *bnapi = bp-bnx2_napi[0];
struct status_block *sblk = bnapi-status_blk;
 
/* When using INTx, it is possible for the interrupt to arrive
@@ -2911,7 +2945,7 @@ static int bnx2_poll(struct napi_struct *napi, int budget)

[PATCH 7/9][BNX2]: Add support for a new tx ring.

2007-12-20 Thread Michael Chan

[BNX2]: Add support for a new tx ring.

To separate TX IRQs into a different MSIX vector, we need to
support a new tx ring.  The original tx ring will still be used
when not using MSIX.

Signed-off-by: Michael Chan [EMAIL PROTECTED]

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 196d053..a4ed6ca 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -2378,7 +2378,10 @@ bnx2_get_hw_tx_cons(struct bnx2_napi *bnapi)
 {
u16 cons;
 
-   cons = bnapi-status_blk-status_tx_quick_consumer_index0;
+   if (bnapi-int_num == 0)
+   cons = bnapi-status_blk-status_tx_quick_consumer_index0;
+   else
+   cons = bnapi-status_blk_msix-status_tx_quick_consumer_index;
 
if (unlikely((cons  MAX_TX_DESC_CNT) == MAX_TX_DESC_CNT))
cons++;
@@ -2389,7 +2392,6 @@ static void
 bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
 {
u16 hw_cons, sw_cons, sw_ring_cons;
-   int tx_free_bd = 0;
 
hw_cons = bnx2_get_hw_tx_cons(bnapi);
sw_cons = bnapi-tx_cons;
@@ -2439,8 +2441,6 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
 
sw_cons = NEXT_TX_BD(sw_cons);
 
-   tx_free_bd += last + 1;
-
dev_kfree_skb(skb);
 
hw_cons = bnx2_get_hw_tx_cons(bnapi);
@@ -4369,6 +4369,24 @@ bnx2_init_chip(struct bnx2 *bp)
  BNX2_HC_CONFIG_COLLECT_STATS;
}
 
+   if (bp-flags  USING_MSIX_FLAG) {
+   REG_WR(bp, BNX2_HC_MSIX_BIT_VECTOR,
+  BNX2_HC_MSIX_BIT_VECTOR_VAL);
+
+   REG_WR(bp, BNX2_HC_SB_CONFIG_1,
+   BNX2_HC_SB_CONFIG_1_TX_TMR_MODE |
+   BNX2_HC_SB_CONFIG_1_ONE_SHOT);
+
+   REG_WR(bp, BNX2_HC_TX_QUICK_CONS_TRIP_1,
+   (bp-tx_quick_cons_trip_int  16) |
+bp-tx_quick_cons_trip);
+
+   REG_WR(bp, BNX2_HC_TX_TICKS_1,
+   (bp-tx_ticks_int  16) | bp-tx_ticks);
+
+   val |= BNX2_HC_CONFIG_SB_ADDR_INC_128B;
+   }
+
if (bp-flags  ONE_SHOT_MSI_FLAG)
val |= BNX2_HC_CONFIG_ONE_SHOT;
 
@@ -4401,6 +4419,25 @@ bnx2_init_chip(struct bnx2 *bp)
 }
 
 static void
+bnx2_clear_ring_states(struct bnx2 *bp)
+{
+   struct bnx2_napi *bnapi;
+   int i;
+
+   for (i = 0; i  BNX2_MAX_MSIX_VEC; i++) {
+   bnapi = bp-bnx2_napi[i];
+
+   bnapi-tx_cons = 0;
+   bnapi-hw_tx_cons = 0;
+   bnapi-rx_prod_bseq = 0;
+   bnapi-rx_prod = 0;
+   bnapi-rx_cons = 0;
+   bnapi-rx_pg_prod = 0;
+   bnapi-rx_pg_cons = 0;
+   }
+}
+
+static void
 bnx2_init_tx_context(struct bnx2 *bp, u32 cid)
 {
u32 val, offset0, offset1, offset2, offset3;
@@ -4433,8 +4470,17 @@ static void
 bnx2_init_tx_ring(struct bnx2 *bp)
 {
struct tx_bd *txbd;
-   u32 cid;
-   struct bnx2_napi *bnapi = bp-bnx2_napi[0];
+   u32 cid = TX_CID;
+   struct bnx2_napi *bnapi;
+
+   bp-tx_vec = 0;
+   if (bp-flags  USING_MSIX_FLAG) {
+   cid = TX_TSS_CID;
+   bp-tx_vec = BNX2_TX_VEC;
+   REG_WR(bp, BNX2_TSCH_TSS_CFG, BNX2_TX_INT_NUM |
+  (TX_TSS_CID  7));
+   }
+   bnapi = bp-bnx2_napi[bp-tx_vec];
 
bp-tx_wake_thresh = bp-tx_ring_size / 2;
 
@@ -,11 +4490,8 @@ bnx2_init_tx_ring(struct bnx2 *bp)
txbd-tx_bd_haddr_lo = (u64) bp-tx_desc_mapping  0x;
 
bp-tx_prod = 0;
-   bnapi-tx_cons = 0;
-   bnapi-hw_tx_cons = 0;
bp-tx_prod_bseq = 0;
 
-   cid = TX_CID;
bp-tx_bidx_addr = MB_GET_CID_ADDR(cid) + BNX2_L2CTX_TX_HOST_BIDX;
bp-tx_bseq_addr = MB_GET_CID_ADDR(cid) + BNX2_L2CTX_TX_HOST_BSEQ;
 
@@ -4487,12 +4530,6 @@ bnx2_init_rx_ring(struct bnx2 *bp)
u32 val, rx_cid_addr = GET_CID_ADDR(RX_CID);
struct bnx2_napi *bnapi = bp-bnx2_napi[0];
 
-   bnapi-rx_prod = 0;
-   bnapi-rx_cons = 0;
-   bnapi-rx_prod_bseq = 0;
-   bnapi-rx_pg_prod = 0;
-   bnapi-rx_pg_cons = 0;
-
bnx2_init_rxbd_rings(bp-rx_desc_ring, bp-rx_desc_mapping,
 bp-rx_buf_use_size, bp-rx_max_ring);
 
@@ -4694,6 +4731,7 @@ bnx2_reset_nic(struct bnx2 *bp, u32 reset_code)
if ((rc = bnx2_init_chip(bp)) != 0)
return rc;
 
+   bnx2_clear_ring_states(bp);
bnx2_init_tx_ring(bp);
bnx2_init_rx_ring(bp);
return 0;
@@ -4965,7 +5003,11 @@ bnx2_run_loopback(struct bnx2 *bp, int loopback_mode)
struct sw_bd *rx_buf;
struct l2_fhdr *rx_hdr;
int ret = -ENODEV;
-   struct bnx2_napi *bnapi = bp-bnx2_napi[0];
+   struct bnx2_napi *bnapi = bp-bnx2_napi[0], *tx_napi;
+
+   tx_napi = bnapi;
+   if (bp-flags  USING_MSIX_FLAG)
+   tx_napi = bp-bnx2_napi[BNX2_TX_VEC];
 
if (loopback_mode == BNX2_MAC_LOOPBACK) {

[PATCH 8/9][BNX2]: Enable new tx ring.

2007-12-20 Thread Michael Chan

[BNX2]: Enable new tx ring.

Enable new tx ring and add new MSIX handler and NAPI poll function
for the new tx ring.  Enable MSIX when the hardware supports it.

Signed-off-by: Michael Chan [EMAIL PROTECTED]

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index a4ed6ca..3745fc8 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -598,7 +598,7 @@ bnx2_alloc_mem(struct bnx2 *bp)
for (i = 1; i  BNX2_MAX_MSIX_VEC; i++) {
struct bnx2_napi *bnapi = bp-bnx2_napi[i];
 
-   bnapi-status_blk = (void *)
+   bnapi-status_blk_msix = (void *)
((unsigned long) bp-status_blk +
 BNX2_SBLK_MSIX_ALIGN_SIZE * i);
bnapi-int_num = i  24;
@@ -2388,10 +2388,11 @@ bnx2_get_hw_tx_cons(struct bnx2_napi *bnapi)
return cons;
 }
 
-static void
-bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
+static int
+bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi, int budget)
 {
u16 hw_cons, sw_cons, sw_ring_cons;
+   int tx_pkt = 0;
 
hw_cons = bnx2_get_hw_tx_cons(bnapi);
sw_cons = bnapi-tx_cons;
@@ -2442,6 +2443,9 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
sw_cons = NEXT_TX_BD(sw_cons);
 
dev_kfree_skb(skb);
+   tx_pkt++;
+   if (tx_pkt == budget)
+   break;
 
hw_cons = bnx2_get_hw_tx_cons(bnapi);
}
@@ -2463,6 +2467,7 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
netif_wake_queue(bp-dev);
netif_tx_unlock(bp-dev);
}
+   return tx_pkt;
 }
 
 static void
@@ -2875,6 +2880,23 @@ bnx2_interrupt(int irq, void *dev_instance)
return IRQ_HANDLED;
 }
 
+static irqreturn_t
+bnx2_tx_msix(int irq, void *dev_instance)
+{
+   struct net_device *dev = dev_instance;
+   struct bnx2 *bp = netdev_priv(dev);
+   struct bnx2_napi *bnapi = bp-bnx2_napi[BNX2_TX_VEC];
+
+   prefetch(bnapi-status_blk_msix);
+
+   /* Return here if interrupt is disabled. */
+   if (unlikely(atomic_read(bp-intr_sem) != 0))
+   return IRQ_HANDLED;
+
+   netif_rx_schedule(dev, bnapi-napi);
+   return IRQ_HANDLED;
+}
+
 #define STATUS_ATTN_EVENTS (STATUS_ATTN_BITS_LINK_STATE | \
 STATUS_ATTN_BITS_TIMER_ABORT)
 
@@ -2895,6 +2917,29 @@ bnx2_has_work(struct bnx2_napi *bnapi)
return 0;
 }
 
+static int bnx2_tx_poll(struct napi_struct *napi, int budget)
+{
+   struct bnx2_napi *bnapi = container_of(napi, struct bnx2_napi, napi);
+   struct bnx2 *bp = bnapi-bp;
+   int work_done = 0;
+   struct status_block_msix *sblk = bnapi-status_blk_msix;
+
+   do {
+   work_done += bnx2_tx_int(bp, bnapi, budget - work_done);
+   if (unlikely(work_done = budget))
+   return work_done;
+
+   bnapi-last_status_idx = sblk-status_idx;
+   rmb();
+   } while (bnx2_get_hw_tx_cons(bnapi) != bnapi-hw_tx_cons);
+
+   netif_rx_complete(bp-dev, napi);
+   REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, bnapi-int_num |
+  BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
+  bnapi-last_status_idx);
+   return work_done;
+}
+
 static int bnx2_poll_work(struct bnx2 *bp, struct bnx2_napi *bnapi,
  int work_done, int budget)
 {
@@ -2916,7 +2961,7 @@ static int bnx2_poll_work(struct bnx2 *bp, struct 
bnx2_napi *bnapi,
}
 
if (bnx2_get_hw_tx_cons(bnapi) != bnapi-hw_tx_cons)
-   bnx2_tx_int(bp, bnapi);
+   bnx2_tx_int(bp, bnapi, 0);
 
if (bnx2_get_hw_rx_cons(bnapi) != bnapi-rx_cons)
work_done += bnx2_rx_int(bp, bnapi, budget - work_done);
@@ -5399,10 +5444,35 @@ bnx2_free_irq(struct bnx2 *bp)
 static void
 bnx2_enable_msix(struct bnx2 *bp)
 {
+   int i, rc;
+   struct msix_entry msix_ent[BNX2_MAX_MSIX_VEC];
+
bnx2_setup_msix_tbl(bp);
REG_WR(bp, BNX2_PCI_MSIX_CONTROL, BNX2_MAX_MSIX_HW_VEC - 1);
REG_WR(bp, BNX2_PCI_MSIX_TBL_OFF_BIR, BNX2_PCI_GRC_WINDOW2_BASE);
REG_WR(bp, BNX2_PCI_MSIX_PBA_OFF_BIT, BNX2_PCI_GRC_WINDOW3_BASE);
+
+   for (i = 0; i  BNX2_MAX_MSIX_VEC; i++) {
+   msix_ent[i].entry = i;
+   msix_ent[i].vector = 0;
+   }
+
+   rc = pci_enable_msix(bp-pdev, msix_ent, BNX2_MAX_MSIX_VEC);
+   if (rc != 0)
+   return;
+
+   bp-irq_tbl[BNX2_BASE_VEC].handler = bnx2_msi_1shot;
+   bp-irq_tbl[BNX2_TX_VEC].handler = bnx2_tx_msix;
+
+   strcpy(bp-irq_tbl[BNX2_BASE_VEC].name, bp-dev-name);
+   strcat(bp-irq_tbl[BNX2_BASE_VEC].name, -base);
+   strcpy(bp-irq_tbl[BNX2_TX_VEC].name, bp-dev-name);
+   strcat(bp-irq_tbl[BNX2_TX_VEC].name, -tx);
+
+   bp-irq_nvecs = BNX2_MAX_MSIX_VEC;
+   bp-flags |= USING_MSIX_FLAG |

[PATCH 9/9][BNX2]: Update version to 1.7.1.

2007-12-20 Thread Michael Chan

[BNX2]: Update version to 1.7.1.

Signed-off-by: Michael Chan [EMAIL PROTECTED]

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 08b0349..69a3ce3 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -56,8 +56,8 @@
 
 #define DRV_MODULE_NAMEbnx2
 #define PFX DRV_MODULE_NAME: 
-#define DRV_MODULE_VERSION 1.7.0
-#define DRV_MODULE_RELDATE December 11, 2007
+#define DRV_MODULE_VERSION 1.7.1
+#define DRV_MODULE_RELDATE December 19, 2007
 
 #define RUN_AT(x) (jiffies + (x))
 


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Parag Warudkar

On Dec 20, 2007 2:22 PM, Kok, Auke [EMAIL PROTECTED] wrote:
 ok, that's just bad and if there's no user-defineable limit to the deferral I
 definately don't like this change.

 Can I safely assume that any irq will cause all deferred timers to run?

I think even other causes for wakeup like process related ones will
cause the CPU to go busy and run the timers.
This, coupled with the fact that no one is yet able to reach 0 wakeups
per second makes it pretty unlikely that deferrable timers will be
deferred indefinitely.


 If this is the case then for e1000 this patch is still OK since the watchdog 
 needs
 to run (1) after a link up/down interrupt or (2) to update statistics. Those
 statistics won't increase if there is no traffic of course...


I think it is reasonable for Network driver watchdogs to use a
deferrable timer - if the machine is 100% IDLE there is no one needing
the network to be up. If there is something running even on the other
CPU - that is going to cause an IPI, reschedule, TLB invalidation etc.
which will make it very likely in practice that each CPU will be
interrupted in reasonable amount of time.

Of course there are theoretical cases where we could land into a
situation where a CPU in a multiprocessor machine is IDLE infinitely
and that causes the watchdog that happens to be bound to run on the
same CPU to not run. To take care of these unlikely cases I think the
timer mechanism should have a reasonable limit on how long a CPU can
go IDLE if there are deferrable timers.

Parag
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Arjan van de Ven


Kok, Auke wrote:


ok, that's just bad and if there's no user-defineable limit to the deferral I
definately don't like this change.

Can I safely assume that any irq will cause all deferred timers to run?


*on that cpu*. Timers are per cpu, as are interrupts. Just not per se the same 
one ...

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Arjan van de Ven


Parag Warudkar wrote:

On Dec 20, 2007 2:22 PM, Kok, Auke [EMAIL PROTECTED] wrote:

ok, that's just bad and if there's no user-defineable limit to the deferral I
definately don't like this change.

Can I safely assume that any irq will cause all deferred timers to run?


I think even other causes for wakeup like process related ones will
cause the CPU to go busy and run the timers.
This, coupled with the fact that no one is yet able to reach 0 wakeups
per second makes it pretty unlikely that deferrable timers will be
deferred indefinitely.


0.8 is easy on single core today.
multicore just increases how idle you can be for a given core.




If this is the case then for e1000 this patch is still OK since the watchdog 
needs
to run (1) after a link up/down interrupt or (2) to update statistics. Those
statistics won't increase if there is no traffic of course...



I think it is reasonable for Network driver watchdogs to use a
deferrable timer - if the machine is 100% IDLE there is no one needing
the network to be up. If there is something running even on the other
CPU - that is going to cause an IPI, reschedule, TLB invalidation etc.
which will make it very likely in practice that each CPU will be
interrupted in reasonable amount of time.


this is not correct; many machines are idle waiting for network data. Think of 
webservers...



Of course there are theoretical cases where we could land into a
situation where a CPU in a multiprocessor machine is IDLE infinitely
and that causes the watchdog that happens to be bound to run on the
same CPU to not run. To take care of these unlikely cases I think the
timer mechanism should have a reasonable limit on how long a CPU can
go IDLE if there are deferrable timers.


how about something else instead: a timer mechanism that takes a range instead..
that at least has defined semantics; the deferrable semantics really are 
indefinite.
Lets keep at least the semantics clear and clean.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Krzysztof Oledzki




On Thu, 20 Dec 2007, Parag Warudkar wrote:


On Dec 20, 2007 2:22 PM, Kok, Auke [EMAIL PROTECTED] wrote:

ok, that's just bad and if there's no user-defineable limit to the deferral I
definately don't like this change.

Can I safely assume that any irq will cause all deferred timers to run?


I think even other causes for wakeup like process related ones will
cause the CPU to go busy and run the timers.
This, coupled with the fact that no one is yet able to reach 0 wakeups
per second makes it pretty unlikely that deferrable timers will be
deferred indefinitely.



If this is the case then for e1000 this patch is still OK since the watchdog 
needs
to run (1) after a link up/down interrupt or (2) to update statistics. Those
statistics won't increase if there is no traffic of course...



I think it is reasonable for Network driver watchdogs to use a
deferrable timer - if the machine is 100% IDLE there is no one needing
the network to be up.


Please note tha being connected to a network does not only mean to send 
but also to receive.


Best regards,

Krzysztof Oledzki
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [IPROUTE]: A workaround to make larger rto_min printed correctly

2007-12-20 Thread Jarek Poplawski

Satoru SATOH wrote, On 12/20/2007 05:21 PM:

 i see. HZ can be  1000.. i should be wrong.
 
 however, i got the following,
 
 [root iproute2.org]# ./ip/ip route change 192.168.140.0/24 dev eth1 rto_min 4s
 [root iproute2.org]# gdb -q ./ip/ip 

...

 (gdb) p hz
 $1 = 10

That's why I had some doubts! I didn't study this enough, but my
(older) version definitely showed hz == 100. Maybe I'm wrong, but
looking into lib/util.c it seems this could be set differently
depending on system's configuration (or even kernel version).

So, probably this patch could sometimes work even for HZ  1000,
but since it's your patch, I hope you do some additional checking
if it's always like this...

Cheers,
Jarek P.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [IPROUTE]: A workaround to make larger rto_min printed correctly

2007-12-20 Thread Jarek Poplawski

Jarek Poplawski wrote, On 12/20/2007 09:24 PM:
...

 but since it's your patch, I hope you do some additional checking
 if it's always like this...


...or maybe only changing this all a little bit will make it look safer!

Jarek P.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 2/4] [POWERPC][NET] ucc_geth_mii and users: get rid of device_type

2007-12-20 Thread Anton Vorontsov

device_type property is bogus, thus use proper compatible.

Also change compatible property to fsl,ucc-mdio.

Per http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048388.html

Signed-off-by: Anton Vorontsov [EMAIL PROTECTED]
---
 arch/powerpc/boot/dts/mpc832x_mds.dts |3 +--
 arch/powerpc/boot/dts/mpc832x_rdb.dts |3 +--
 arch/powerpc/boot/dts/mpc836x_mds.dts |3 +--
 arch/powerpc/boot/dts/mpc8568mds.dts  |2 +-
 drivers/net/ucc_geth_mii.c|3 +++
 5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/boot/dts/mpc832x_mds.dts 
b/arch/powerpc/boot/dts/mpc832x_mds.dts
index 588d658..8844d30 100644
--- a/arch/powerpc/boot/dts/mpc832x_mds.dts
+++ b/arch/powerpc/boot/dts/mpc832x_mds.dts
@@ -255,8 +255,7 @@
#address-cells = 1;
#size-cells = 0;
reg = 2320 18;
-   device_type = mdio;
-   compatible = ucc_geth_phy;
+   compatible = fsl,ucc-mdio;
 
phy3: [EMAIL PROTECTED] {
interrupt-parent =  ipic ;
diff --git a/arch/powerpc/boot/dts/mpc832x_rdb.dts 
b/arch/powerpc/boot/dts/mpc832x_rdb.dts
index 719f375..a7a2e45 100644
--- a/arch/powerpc/boot/dts/mpc832x_rdb.dts
+++ b/arch/powerpc/boot/dts/mpc832x_rdb.dts
@@ -236,8 +236,7 @@
#address-cells = 1;
#size-cells = 0;
reg = 3120 18;
-   device_type = mdio;
-   compatible = ucc_geth_phy;
+   compatible = fsl,ucc-mdio;
 
phy00:[EMAIL PROTECTED] {
interrupt-parent = pic;
diff --git a/arch/powerpc/boot/dts/mpc836x_mds.dts 
b/arch/powerpc/boot/dts/mpc836x_mds.dts
index 8d7124e..5f0b427 100644
--- a/arch/powerpc/boot/dts/mpc836x_mds.dts
+++ b/arch/powerpc/boot/dts/mpc836x_mds.dts
@@ -288,8 +288,7 @@
#address-cells = 1;
#size-cells = 0;
reg = 2120 18;
-   device_type = mdio;
-   compatible = ucc_geth_phy;
+   compatible = fsl,ucc-mdio;
 
phy0: [EMAIL PROTECTED] {
interrupt-parent =  ipic ;
diff --git a/arch/powerpc/boot/dts/mpc8568mds.dts 
b/arch/powerpc/boot/dts/mpc8568mds.dts
index 89add8d..ea70010 100644
--- a/arch/powerpc/boot/dts/mpc8568mds.dts
+++ b/arch/powerpc/boot/dts/mpc8568mds.dts
@@ -356,7 +356,7 @@
#address-cells = 1;
#size-cells = 0;
reg = 2120 18;
-   compatible = ucc_geth_phy;
+   compatible = fsl,ucc-mdio;
 
/* These are the same PHYs as on
 * gianfar's MDIO bus */
diff --git a/drivers/net/ucc_geth_mii.c b/drivers/net/ucc_geth_mii.c
index df884f0..e3ba14a 100644
--- a/drivers/net/ucc_geth_mii.c
+++ b/drivers/net/ucc_geth_mii.c
@@ -256,6 +256,9 @@ static struct of_device_id uec_mdio_match[] = {
.type = mdio,
.compatible = ucc_geth_phy,
},
+   {
+   .compatible = fsl,ucc-mdio,
+   },
{},
 };
 
-- 
1.5.2.2

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Parag Warudkar

On Dec 20, 2007 3:04 PM, Arjan van de Ven [EMAIL PROTECTED] wrote:
  I think it is reasonable for Network driver watchdogs to use a
  deferrable timer - if the machine is 100% IDLE there is no one needing
  the network to be up. If there is something running even on the other
  CPU - that is going to cause an IPI, reschedule, TLB invalidation etc.
  which will make it very likely in practice that each CPU will be
  interrupted in reasonable amount of time.

 this is not correct; many machines are idle waiting for network data. Think 
 of webservers...

Yes, I forgot the receive case. So if a server was 100% IDLE and a web
server was listening for network data and we reach 0 wakeups per
second on the CPU where the network watchdog timer is scheduled to run
deferred _and_ the network link went down, it would cause the watchdog
to not run and redo the link until some one else wakes up that CPU
later.
So as long as we make sure we don't convert every timer to deferrable
we should be ok - may be this can be resolved easily by having a
non-deferrable dont-allow-deferring-for-too-long timer on each CPU
that just causes at least one wake up in some reasonable time delta
from the previous wakeup (whoever caused that one.) It is still
beneficial in that all deferrable timers would run at once without
needing to have separate wakeup for each.


 
  Of course there are theoretical cases where we could land into a
  situation where a CPU in a multiprocessor machine is IDLE infinitely
  and that causes the watchdog that happens to be bound to run on the
  same CPU to not run. To take care of these unlikely cases I think the
  timer mechanism should have a reasonable limit on how long a CPU can
  go IDLE if there are deferrable timers.

 how about something else instead: a timer mechanism that takes a range 
 instead..
 that at least has defined semantics; the deferrable semantics really are 
 indefinite.
 Lets keep at least the semantics clear and clean.


Would not the simpler solution of installing a non-deferrable timer
per cpu which will not allow the CPU to go IDLE for more than x units
of time at once  (or something to that effect) work? Range would
complicate the thing and I am not sure how many cases will know
reasonably correct range for their normal operation. In this instance
of the e1000 watchdog what range could it give and be successful at
what it wants to do - bring up the link in reasonable amount of time,
while also realizing the power savings?

Perhaps depending on Server/Laptop/Desktop machine (may be based on
Preemption) we could have normal or deferrable timers but that'll
exclude Servers from power savings and I am not sure Data center folks
will like that :) .

Parag
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: After many hours all outbound connections get stuck in SYN_SENT

2007-12-20 Thread Ilpo Järvinen

On Thu, 20 Dec 2007, James Nichols wrote:

  I still dont understand.
 
  tcpdump -p -n -s 1600 -c 1 doesnt reveal User data at all.
 
  Without any exact data from you, I am afraid nobody can help.
 
 Oh, I didn't see that you specified specific options.  I'll still have
 to anonymize 2000+ IP addresses, but I think there is an open source
 tool that will do this for you.

Even a simple for loop in shell can do that. It's not that hard and 
there's very little need for manual work! Ingrediments: for, cut, grep
and sed.


-- 
 i.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: After many hours all outbound connections get stuck in SYN_SENT

2007-12-20 Thread Ilpo Järvinen

On Thu, 20 Dec 2007, James Nichols wrote:

  You'd probably should also investigate the Linux kernel,
  especially the size and locks of the components of the Sack data
  structures and what happens to those data structures after Sack is
  disabled (presumably the Sack data structure is in some unhappy
  circumstance, and disabling Sack allows the data to be discarded,
  magically unclaging the box).

...Not sure if you want now to invent such structure. Yes, we have per skb 
-sacked but again in SYN_SENT there are very few things who touch it at 
all, and they just set it to zero (though it would not even be mandatory 
for tcp_transmit_skb, IIRC, checked that just couple of days ago due to 
other things).

Another thing is the rx_opt.sack_ok which is just couple flag bits that 
tell the TCP variant in use (and it's mostly used only after SYN handshake 
completes). The rest (the actual SACK blocks) is in the ack_skb but again 
it has very little meaning in SYN_SENT state unless somebody is crazy 
enough to add SACK blocks to SYN-ACKs :-).

  In the absence of the reporter wanting to dump the kernel's
  core, how about a patch to print the Sack datastructure when
  the command to disable Sack is received by the kernel?
  Maybe just print the last 16b of the IP address?
 
 Given the fact that I've had this problem for so long, over a variety
 of networking hardware vendors and colo-facilities, this really sounds
 good to me.  It will be challenging for me to justify a kernel core
 dump, but a simple patch to dump the Sack data would be do-able.

If your symptoms really are: SYNs leaving (if they show up in tcpdump, for 
sure they've left TCP code already) and SYN-ACK not showing up even in 
something as early as in tcpdump (for sure TCP side code didn't execute at 
that point yet), there's very little change that Linux' TCP code has some 
bug in it, only things that do something in such scenario are the SYN 
generation and retransmitting SYNs (and those are trivially verifiable 
from tcpdump).


-- 
 i.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Stephen Hemminger

On Thu, 20 Dec 2007 15:36:13 -0500
Parag Warudkar [EMAIL PROTECTED] wrote:

 On Dec 20, 2007 3:04 PM, Arjan van de Ven [EMAIL PROTECTED] wrote:
   I think it is reasonable for Network driver watchdogs to use a
   deferrable timer - if the machine is 100% IDLE there is no one needing
   the network to be up. If there is something running even on the other
   CPU - that is going to cause an IPI, reschedule, TLB invalidation etc.
   which will make it very likely in practice that each CPU will be
   interrupted in reasonable amount of time.
 
  this is not correct; many machines are idle waiting for network data. Think 
  of webservers...
 
 Yes, I forgot the receive case. So if a server was 100% IDLE and a web
 server was listening for network data and we reach 0 wakeups per
 second on the CPU where the network watchdog timer is scheduled to run
 deferred _and_ the network link went down, it would cause the watchdog
 to not run and redo the link until some one else wakes up that CPU
 later.
 So as long as we make sure we don't convert every timer to deferrable
 we should be ok - may be this can be resolved easily by having a
 non-deferrable dont-allow-deferring-for-too-long timer on each CPU
 that just causes at least one wake up in some reasonable time delta
 from the previous wakeup (whoever caused that one.) It is still
 beneficial in that all deferrable timers would run at once without
 needing to have separate wakeup for each.
 
 
  
   Of course there are theoretical cases where we could land into a
   situation where a CPU in a multiprocessor machine is IDLE infinitely
   and that causes the watchdog that happens to be bound to run on the
   same CPU to not run. To take care of these unlikely cases I think the
   timer mechanism should have a reasonable limit on how long a CPU can
   go IDLE if there are deferrable timers.
 
  how about something else instead: a timer mechanism that takes a range 
  instead..
  that at least has defined semantics; the deferrable semantics really are 
  indefinite.
  Lets keep at least the semantics clear and clean.
 
 
 Would not the simpler solution of installing a non-deferrable timer
 per cpu which will not allow the CPU to go IDLE for more than x units
 of time at once  (or something to that effect) work? Range would
 complicate the thing and I am not sure how many cases will know
 reasonably correct range for their normal operation. In this instance
 of the e1000 watchdog what range could it give and be successful at
 what it wants to do - bring up the link in reasonable amount of time,
 while also realizing the power savings?
 
 Perhaps depending on Server/Laptop/Desktop machine (may be based on
 Preemption) we could have normal or deferrable timers but that'll
 exclude Servers from power savings and I am not sure Data center folks
 will like that :) .
 
 Parag


The problem is that on a server the receiver will go deaf if the chip
bug that the watchdog is looking for triggers.  Yes, no packets in
and it happily will just sit there.

So for now, I am not going to apply your simple patch and work on a 
two stage timer per arjan's suggestion for a later release.

-- 
Stephen Hemminger [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: After many hours all outbound connections get stuck in SYN_SENT

2007-12-20 Thread Justin Banks

James Nichols wrote
  I still dont understand.
 
  tcpdump -p -n -s 1600 -c 1 doesnt reveal User data at all.
 
  Without any exact data from you, I am afraid nobody can help.
 
 Oh, I didn't see that you specified specific options.  I'll still have
 to anonymize 2000+ IP addresses, but I think there is an open source
 tool that will do this for you.


tcpdump -p -n -s 1600 -c 1 | perl -pe 
's/(\d+\.\d+\.\d+\.\d+)/HIDE.THIS.IP.ADDR/g'

-justinb

-- 
Justin Banks
BakBone Software
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Kok, Auke

Stephen Hemminger wrote:
 On Thu, 20 Dec 2007 15:36:13 -0500
 Parag Warudkar [EMAIL PROTECTED] wrote:
 
 On Dec 20, 2007 3:04 PM, Arjan van de Ven [EMAIL PROTECTED] wrote:
 I think it is reasonable for Network driver watchdogs to use a
 deferrable timer - if the machine is 100% IDLE there is no one needing
 the network to be up. If there is something running even on the other
 CPU - that is going to cause an IPI, reschedule, TLB invalidation etc.
 which will make it very likely in practice that each CPU will be
 interrupted in reasonable amount of time.
 this is not correct; many machines are idle waiting for network data. Think 
 of webservers...
 Yes, I forgot the receive case. So if a server was 100% IDLE and a web
 server was listening for network data and we reach 0 wakeups per
 second on the CPU where the network watchdog timer is scheduled to run
 deferred _and_ the network link went down, it would cause the watchdog
 to not run and redo the link until some one else wakes up that CPU
 later.
 So as long as we make sure we don't convert every timer to deferrable
 we should be ok - may be this can be resolved easily by having a
 non-deferrable dont-allow-deferring-for-too-long timer on each CPU
 that just causes at least one wake up in some reasonable time delta
 from the previous wakeup (whoever caused that one.) It is still
 beneficial in that all deferrable timers would run at once without
 needing to have separate wakeup for each.

 Of course there are theoretical cases where we could land into a
 situation where a CPU in a multiprocessor machine is IDLE infinitely
 and that causes the watchdog that happens to be bound to run on the
 same CPU to not run. To take care of these unlikely cases I think the
 timer mechanism should have a reasonable limit on how long a CPU can
 go IDLE if there are deferrable timers.
 how about something else instead: a timer mechanism that takes a range 
 instead..
 that at least has defined semantics; the deferrable semantics really are 
 indefinite.
 Lets keep at least the semantics clear and clean.

 Would not the simpler solution of installing a non-deferrable timer
 per cpu which will not allow the CPU to go IDLE for more than x units
 of time at once  (or something to that effect) work? Range would
 complicate the thing and I am not sure how many cases will know
 reasonably correct range for their normal operation. In this instance
 of the e1000 watchdog what range could it give and be successful at
 what it wants to do - bring up the link in reasonable amount of time,
 while also realizing the power savings?

 Perhaps depending on Server/Laptop/Desktop machine (may be based on
 Preemption) we could have normal or deferrable timers but that'll
 exclude Servers from power savings and I am not sure Data center folks
 will like that :) .

 Parag
 
 
 The problem is that on a server the receiver will go deaf if the chip
 bug that the watchdog is looking for triggers.  Yes, no packets in
 and it happily will just sit there.
 
 So for now, I am not going to apply your simple patch and work on a 
 two stage timer per arjan's suggestion for a later release.

I also think that's the right way to go for now. I'll ask jeff to hold off on 
the
two patches for now.

Auke

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] e1000e: Use deferrable timer for watchdog

2007-12-20 Thread Kok, Auke

Auke Kok wrote:
 From: Parag Warudkar [EMAIL PROTECTED]
 
 Reduce wakeups from idle per second.
 
 Signed-off-by: Parag Warudkar [EMAIL PROTECTED]
 Signed-off-by: Auke Kok [EMAIL PROTECTED]
 ---


Jeff,

given the discussion with Stephen I'd like to skip merging this patch and the
e1000 one for now. The unforeseen implications of this are just not controlled
enough and we need to guarantee some limit of deferral first.

Auke
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/3] XFRM audit fixes/additions for net-2.6.25

2007-12-20 Thread Paul Moore

Three patches backed against net-2.6.25 from today.  Some of the audit
messages are a little difficult to test by their nature but I've verified
that I'm still able to send/receive IPsec protected traffic with the patches
applied.

The first patch was posted before but David decided it best to split the
patch so some parts could be pulled into 2.6.24; the patch was split and
the 2.6.24 bits were accepted (the SPI byteorder fix) so patch #1 in the
series is what is left for 2.6.25.

The second patch was posted before as an RFC patch without anyone complaining
too loudly.  Eric Paris made some suggestions about better handling of the
op= audit field and I've tried to take that into account with this patch.

The final patch is the audit replay counter overflow issue fix that has been
talked about on netdev.  This sounded like the best course of action from the
discussion but if I'm wrong, just drop this patch and I'll cook up something
else to solve the problem.

Thanks.

--
paul moore
linux security @ hp

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] XFRM: Drop packets when replay counter would overflow

2007-12-20 Thread Paul Moore

According to RFC4303, section 3.3.3 we need to drop outgoing packets which
cause the replay counter to overflow:

   3.3.3.  Sequence Number Generation

   The sender's counter is initialized to 0 when an SA is established.
   The sender increments the sequence number (or ESN) counter for this
   SA and inserts the low-order 32 bits of the value into the Sequence
   Number field.  Thus, the first packet sent using a given SA will
   contain a sequence number of 1.

   If anti-replay is enabled (the default), the sender checks to ensure
   that the counter has not cycled before inserting the new value in the
   Sequence Number field.  In other words, the sender MUST NOT send a
   packet on an SA if doing so would cause the sequence number to cycle.
   An attempt to transmit a packet that would result in sequence number
   overflow is an auditable event.  The audit log entry for this event
   SHOULD include the SPI value, current date/time, Source Address,
   Destination Address, and (in IPv6) the cleartext Flow ID.

Signed-off-by: Paul Moore [EMAIL PROTECTED]
---

 net/xfrm/xfrm_output.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index ebb..284eeef 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -57,8 +57,11 @@ static int xfrm_output_one(struct sk_buff *skb, int err)
 
if (x-type-flags  XFRM_TYPE_REPLAY_PROT) {
XFRM_SKB_CB(skb)-seq = ++x-replay.oseq;
-   if (unlikely(x-replay.oseq == 0))
+   if (unlikely(x-replay.oseq == 0)) {
+   x-replay.oseq--;
xfrm_audit_state_replay_overflow(x, skb);
+   goto error;
+   }
if (xfrm_aevent_is_on())
xfrm_replay_notify(x, XFRM_REPLAY_UPDATE);
}

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] XFRM: Assorted IPsec fixups

2007-12-20 Thread Paul Moore

This patch fixes a number of small but potentially troublesome things in the
XFRM/IPsec code:

 * Use the 'audit_enabled' variable already in include/linux/audit.h
   Removed the need for extern declarations local to each XFRM audit fuction

 * Convert 'sid' to 'secid' everywhere we can
   The 'sid' name is specific to SELinux, 'secid' is the common naming
   convention used by the kernel when refering to tokenized LSM labels,
   unfortunately we have to leave 'ctx_sid' in 'struct xfrm_sec_ctx' otherwise
   we risk breaking userspace

 * Convert address display to use standard NIP* macros
   Similar to what was recently done with the SPD audit code, this also also
   includes the removal of some unnecessary memcpy() calls

 * Move common code to xfrm_audit_common_stateinfo()
   Code consolidation from the less is more book on software development

 * Proper spacing around commas in function arguments
   Minor style tweak since I was already touching the code

Signed-off-by: Paul Moore [EMAIL PROTECTED]
---

 include/net/xfrm.h |   14 ++---
 net/xfrm/xfrm_policy.c |   15 ++
 net/xfrm/xfrm_state.c  |   53 
 3 files changed, 36 insertions(+), 46 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 32b99e2..ac6cf09 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -548,7 +548,7 @@ struct xfrm_audit
 };
 
 #ifdef CONFIG_AUDITSYSCALL
-static inline struct audit_buffer *xfrm_audit_start(u32 auid, u32 sid)
+static inline struct audit_buffer *xfrm_audit_start(u32 auid, u32 secid)
 {
struct audit_buffer *audit_buf = NULL;
char *secctx;
@@ -561,8 +561,8 @@ static inline struct audit_buffer *xfrm_audit_start(u32 
auid, u32 sid)
 
audit_log_format(audit_buf, auid=%u, auid);
 
-   if (sid != 0 
-   security_secid_to_secctx(sid, secctx, secctx_len) == 0) {
+   if (secid != 0 
+   security_secid_to_secctx(secid, secctx, secctx_len) == 0) {
audit_log_format(audit_buf,  subj=%s, secctx);
security_release_secctx(secctx, secctx_len);
} else
@@ -571,13 +571,13 @@ static inline struct audit_buffer *xfrm_audit_start(u32 
auid, u32 sid)
 }
 
 extern void xfrm_audit_policy_add(struct xfrm_policy *xp, int result,
- u32 auid, u32 sid);
+ u32 auid, u32 secid);
 extern void xfrm_audit_policy_delete(struct xfrm_policy *xp, int result,
- u32 auid, u32 sid);
+ u32 auid, u32 secid);
 extern void xfrm_audit_state_add(struct xfrm_state *x, int result,
-u32 auid, u32 sid);
+u32 auid, u32 secid);
 extern void xfrm_audit_state_delete(struct xfrm_state *x, int result,
-   u32 auid, u32 sid);
+   u32 auid, u32 secid);
 #else
 #define xfrm_audit_policy_add(x, r, a, s)  do { ; } while (0)
 #define xfrm_audit_policy_delete(x, r, a, s)   do { ; } while (0)
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index d2084b1..c8f0656 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -24,6 +24,7 @@
 #include linux/netfilter.h
 #include linux/module.h
 #include linux/cache.h
+#include linux/audit.h
 #include net/dst.h
 #include net/xfrm.h
 #include net/ip.h
@@ -2317,15 +2318,14 @@ static inline void xfrm_audit_common_policyinfo(struct 
xfrm_policy *xp,
}
 }
 
-void
-xfrm_audit_policy_add(struct xfrm_policy *xp, int result, u32 auid, u32 sid)
+void xfrm_audit_policy_add(struct xfrm_policy *xp, int result,
+  u32 auid, u32 secid)
 {
struct audit_buffer *audit_buf;
-   extern int audit_enabled;
 
if (audit_enabled == 0)
return;
-   audit_buf = xfrm_audit_start(sid, auid);
+   audit_buf = xfrm_audit_start(auid, secid);
if (audit_buf == NULL)
return;
audit_log_format(audit_buf,  op=SPD-add res=%u, result);
@@ -2334,15 +2334,14 @@ xfrm_audit_policy_add(struct xfrm_policy *xp, int 
result, u32 auid, u32 sid)
 }
 EXPORT_SYMBOL_GPL(xfrm_audit_policy_add);
 
-void
-xfrm_audit_policy_delete(struct xfrm_policy *xp, int result, u32 auid, u32 sid)
+void xfrm_audit_policy_delete(struct xfrm_policy *xp, int result,
+ u32 auid, u32 secid)
 {
struct audit_buffer *audit_buf;
-   extern int audit_enabled;
 
if (audit_enabled == 0)
return;
-   audit_buf = xfrm_audit_start(sid, auid);
+   audit_buf = xfrm_audit_start(auid, secid);
if (audit_buf == NULL)
return;
audit_log_format(audit_buf,  op=SPD-delete res=%u, result);
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 95df01c..dd38e6f 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -19,6 +19,7 @@
 #include

[PATCH 2/3] XFRM: RFC4303 compliant auditing

2007-12-20 Thread Paul Moore

This patch adds a number of new IPsec audit events to meet the auditing
requirements of RFC4303.  This includes audit hooks for the following events:

 * Could not find a valid SA [sections 2.1, 3.4.2]
   . xfrm_audit_state_notfound()
   . xfrm_audit_state_notfound_simple()

 * Sequence number overflow [section 3.3.3]
   . xfrm_audit_state_replay_overflow()

 * Replayed packet [section 3.4.3]
   . xfrm_audit_state_replay()

 * Integrity check failure [sections 3.4.4.1, 3.4.4.2]
   . xfrm_audit_state_icvfail()

While RFC4304 deals only with ESP most of the changes in this patch apply to
IPsec in general, i.e. both AH and ESP.  The one case, integrity check
failure, where ESP specific code had to be modified the same was done to the
AH code for the sake of consistency.

Signed-off-by: Paul Moore [EMAIL PROTECTED]
---

 include/net/xfrm.h |   33 --
 net/ipv4/ah4.c |4 +
 net/ipv4/esp4.c|1 
 net/ipv6/ah6.c |2 -
 net/ipv6/esp6.c|1 
 net/ipv6/xfrm6_input.c |4 +
 net/xfrm/xfrm_input.c  |6 +-
 net/xfrm/xfrm_output.c |2 +
 net/xfrm/xfrm_policy.c |   14 ++--
 net/xfrm/xfrm_state.c  |  153 +++-
 10 files changed, 184 insertions(+), 36 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index ac6cf09..941d5cd 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -548,26 +548,33 @@ struct xfrm_audit
 };
 
 #ifdef CONFIG_AUDITSYSCALL
-static inline struct audit_buffer *xfrm_audit_start(u32 auid, u32 secid)
+static inline struct audit_buffer *xfrm_audit_start(const char *op)
 {
struct audit_buffer *audit_buf = NULL;
-   char *secctx;
-   u32 secctx_len;
 
+   if (audit_enabled == 0)
+   return NULL;
audit_buf = audit_log_start(current-audit_context, GFP_ATOMIC,
- AUDIT_MAC_IPSEC_EVENT);
+   AUDIT_MAC_IPSEC_EVENT);
if (audit_buf == NULL)
return NULL;
+   audit_log_format(audit_buf, op=%s, op);
+   return audit_buf;
+}
 
-   audit_log_format(audit_buf, auid=%u, auid);
+static inline void xfrm_audit_helper_usrinfo(u32 auid, u32 secid,
+struct audit_buffer *audit_buf)
+{
+   char *secctx;
+   u32 secctx_len;
 
+   audit_log_format(audit_buf,  auid=%u, auid);
if (secid != 0 
security_secid_to_secctx(secid, secctx, secctx_len) == 0) {
audit_log_format(audit_buf,  subj=%s, secctx);
security_release_secctx(secctx, secctx_len);
} else
audit_log_task_context(audit_buf);
-   return audit_buf;
 }
 
 extern void xfrm_audit_policy_add(struct xfrm_policy *xp, int result,
@@ -578,11 +585,22 @@ extern void xfrm_audit_state_add(struct xfrm_state *x, 
int result,
 u32 auid, u32 secid);
 extern void xfrm_audit_state_delete(struct xfrm_state *x, int result,
u32 auid, u32 secid);
+extern void xfrm_audit_state_replay_overflow(struct xfrm_state *x,
+struct sk_buff *skb);
+extern void xfrm_audit_state_notfound_simple(struct sk_buff *skb, u16 family);
+extern void xfrm_audit_state_notfound(struct sk_buff *skb, u16 family,
+ __be32 net_spi, __be32 net_seq);
+extern void xfrm_audit_state_icvfail(struct xfrm_state *x,
+struct sk_buff *skb, u8 proto);
 #else
 #define xfrm_audit_policy_add(x, r, a, s)  do { ; } while (0)
 #define xfrm_audit_policy_delete(x, r, a, s)   do { ; } while (0)
 #define xfrm_audit_state_add(x, r, a, s)   do { ; } while (0)
 #define xfrm_audit_state_delete(x, r, a, s)do { ; } while (0)
+#define xfrm_audit_state_replay_overflow(x, s) do { ; } while (0)
+#define xfrm_audit_state_notfound_simple(s, f) do { ; } while (0)
+#define xfrm_audit_state_notfound(s, f, sp, sq)do { ; } while (0)
+#define xfrm_audit_state_icvfail(x, s, p)  do { ; } while (0)
 #endif /* CONFIG_AUDITSYSCALL */
 
 static inline void xfrm_pol_hold(struct xfrm_policy *policy)
@@ -1193,7 +1211,8 @@ extern int xfrm_state_delete(struct xfrm_state *x);
 extern int xfrm_state_flush(u8 proto, struct xfrm_audit *audit_info);
 extern void xfrm_sad_getinfo(struct xfrmk_sadinfo *si);
 extern void xfrm_spd_getinfo(struct xfrmk_spdinfo *si);
-extern int xfrm_replay_check(struct xfrm_state *x, __be32 seq);
+extern int xfrm_replay_check(struct xfrm_state *x,
+struct sk_buff *skb, __be32 seq);
 extern void xfrm_replay_advance(struct xfrm_state *x, __be32 seq);
 extern void xfrm_replay_notify(struct xfrm_state *x, int event);
 extern int xfrm_state_mtu(struct xfrm_state *x, int mtu);
diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index d76803a..ec8de0a 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -179,8 +179,10 @@ static int ah_input(struct

Re: [IPSEC]: Rename tunnel-mode functions to avoid collisions with tunnels

2007-12-20 Thread David Miller

From: Herbert Xu [EMAIL PROTECTED]
Date: Wed, 19 Dec 2007 14:38:33 +0800

 [IPSEC]: Rename tunnel-mode functions to avoid collisions with tunnels

 It appears that I've managed to create two different functions both
 called xfrm6_tunnel_output.  This is because we have the plain tunnel
 encapsulation named xfrmX_tunnel as well as the tunnel-mode encapsulation
 which lives in the files xfrmX_mode_tunnel.c.

 This patch renames functions from the latter to use the xfrmX_mode_tunnel
 prefix to avoid name-space conflicts.

 Signed-off-by: Herbert Xu [EMAIL PROTECTED]

Applied, thanks Herbert.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] include/net/: Spelling fixes

2007-12-20 Thread David Miller

From: Joe Perches [EMAIL PROTECTED]
Date: Mon, 17 Dec 2007 11:40:25 -0800

 Signed-off-by: Joe Perches [EMAIL PROTECTED]

Applied.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net/dccp/: Spelling fixes

2007-12-20 Thread David Miller

From: Joe Perches [EMAIL PROTECTED]
Date: Mon, 17 Dec 2007 11:40:30 -0800

 Signed-off-by: Joe Perches [EMAIL PROTECTED]

Applied.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net/irda/: Spelling fixes

2007-12-20 Thread David Miller

From: Joe Perches [EMAIL PROTECTED]
Date: Mon, 17 Dec 2007 11:40:33 -0800

 Signed-off-by: Joe Perches [EMAIL PROTECTED]

Applied.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net/ipv6/: Spelling fixes

2007-12-20 Thread David Miller

From: Joe Perches [EMAIL PROTECTED]
Date: Mon, 17 Dec 2007 11:40:32 -0800

 Signed-off-by: Joe Perches [EMAIL PROTECTED]

APplied.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net/core/: Spelling fixes

2007-12-20 Thread David Miller

From: Joe Perches [EMAIL PROTECTED]
Date: Mon, 17 Dec 2007 11:40:29 -0800

 
 Signed-off-by: Joe Perches [EMAIL PROTECTED]

Applied.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net/sched/: Spelling fixes

2007-12-20 Thread David Miller

From: Joe Perches [EMAIL PROTECTED]
Date: Mon, 17 Dec 2007 11:40:36 -0800

 
 Signed-off-by: Joe Perches [EMAIL PROTECTED]

Applied, thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net/netlabel/: Spelling fixes

2007-12-20 Thread David Miller

From: Joe Perches [EMAIL PROTECTED]
Date: Mon, 17 Dec 2007 11:40:35 -0800

 Signed-off-by: Joe Perches [EMAIL PROTECTED]

Applied, thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net/sctp/: Spelling fixes

2007-12-20 Thread David Miller

From: Joe Perches [EMAIL PROTECTED]
Date: Mon, 17 Dec 2007 11:40:37 -0800

 Signed-off-by: Joe Perches [EMAIL PROTECTED]

Applied.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 >

1 - 100 of 203 matches

Mail list logo