Re: Linux TCP in the presence of delays or drops...

2006-07-31 Thread Oumer Teyeb

Hi David,

My intention when I wrote the second mail was just to provide some more 
examples that further elaborate my first question. But as you noticed, I 
couldnt resist the temptation to slip in a couple of new questions on 
the new post :-(...sorry and will take your advice into consideration on 
my future postings.


Thanks for the tip!!

Regards,
Oumer

David Miller wrote:


From: Oumer Teyeb <[EMAIL PROTECTED]>
Date: Mon, 31 Jul 2006 19:49:28 +0200

 

it would be so great if some of you could spare a few minutes and take a 
look at the traces I provided.see below for the original postng...
   



If people are too backlogged and busy to reply to your original
posting, you will only ensure that it will take even longer by
bombarding the list with even more information and questions on
top of your original large query.

Just be patient.
 



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-31 Thread Rusty Russell
On Mon, 2006-07-31 at 21:47 -0700, David Miller wrote:
> From: Rusty Russell <[EMAIL PROTECTED]>
> Date: Fri, 28 Jul 2006 15:54:04 +1000
> 
> > (1) I am imagining some Grand Unified Flow Cache (Olsson trie?) that
> > holds (some subset of?) flows.  A successful lookup immediately after
> > packet comes off NIC gives destiny for packet: what route, (optionally)
> > what socket, what filtering, what connection tracking (& what NAT), etc?
> > I don't know if this should be a general array of fn & data ptrs, or
> > specialized fields for each one, or a mix.  Maybe there's a "too hard,
> > do slow path" bit, or maybe hard cases just never get put in the cache.
> > Perhaps we need a separate one for locally-generated packets, a-la
> > ip_route_output().  Anyway, we trade slightly more expensive flow setup
> > for faster packet processing within flows.
> 
> So, specifically, one of the methods you are thinking about might
> be implemented by adding:
> 
>   void (*input)(struct sk_buff *, void *);
>   void *input_data;
> 
> to "struct flow_cache_entry" or whatever replaces it?

Probably needs a return value to indicate stop packet processing, and to
be completely general I think we'd want more than one, eg:

#define MAX_GUFC_INPUTS 5
unsigned int num_inputs;
int (*input[MAX_GUFC_INPUTS])(struct sk_buff *, void *);
void *input_data[MAX_GUFC_INPUTS];

> This way we don't need some kind of "type" information in
> the flow cache entry, since the input handler knows the type.

Some things may want to jam more than a pointer into the cache entry, so
we might do something clever later, but as a first cut this would seem
to work.

> > One way to do this is to add a "have_interest" callback into the
> > hook_ops, which takes each about-to-be-inserted GUFC entry and adds any
> > destinies this hook cares about.  In the case of packet filtering this
> > would do a traversal and append a fn/data ptr to the entry for each rule
> > which could effect it.  
> 
> Can you give a concrete example of how the GUFC might make use
> of this?  Just some small abstract code snippets will do.

OK, I take it back.  I was thinking that on a miss, the GUFC called into
each subsystem to populate the new GUFC entry.  That would be a radical
departure from the current code, so forget it.

So, on a GUFC miss, we could create a new GUFC entry (on stack?), hang
it off the skb, then as each subsystem adds to it as we go through.  At
some point (handwave?) we collect the skb->gufc and insert it into the
trie.

For iptables, as a first step we'd simply do (open-coded for now):

/* FIXME: Do acceleration properly */
struct gufc *gufc = skb->gufc;
if (!gufc || gufc->num_inputs == MAX_INPUTS) {
skb->gufc = NULL;
} else {
gufc->input[gufc->num_inputs] = traverse_entire_table;
gufc->input_data[gufc->num_inputs++] = this_table;
}

Later we'd get funky:

/* Filtering code here */
...

if (num_rules_applied > 1 || !only_needed_flow_info) {
gufc->input[gufc->num_inputs] = traverse_entire_table;
gufc->input_data[gufc->num_inputs++] = this_table;
} else if (num_rules_applied == 1) {
gufc->input[gufc->num_inputs] = traverse_one_rule;
gufc->input_data[gufc->num_inputs++] = last_rule;
}

Note that this could be cleverer, too:

if (result == NF_DROP && only_needed_flow_info) {
// Who cares about other inputs, we're going to drop
gufc->input[0] = drop_skb;
gufc->num_inputs = 1;
}

Two potential performance issues: 

1) When we change rules, iptables replaces entire table from userspace.
We need pkttables (which uses incremental rule updates) to flush
intelligently.

2) Every iptables rule currently keeps pkt/byte counters, meaning we
can't bypass rules even though they might have no effect on the packet
(eg. iptables -A INPUT -i eth0 -j ETH0_RULES).  We can address this by
having pkt/byte counters in the gufc entry and a method of pushing them
back to iptables when the gufc entry is pruned, and manually traversing
the trie to flush them when the user asks for counters.

> I had the idea of a lazy scheme.  When we create a GUFC entry, we
> tack it onto a DMA'able linked list the card uses.  We do not
> notify the card, we just entail the update onto the list.
> 
> Then, if the card misses it's on-chip GUFC table on an incoming
> packet, it checks the DMA update list by reading it in from memory.
> It updates it's GUFC table with whatever entries are found on this
> list, then it retries to classify the packet.

I had assumed we would simply do full lookup on non-hw-classified
packets, so async insertion is a non-issue.  Can we assume hardware will
cover entire GUFC trie?

> This seems like a possible good solution until we try to address GUFC
> entry deletion, which unfortunat

Re: [RFC 1/4] kevent: core files.

2006-07-31 Thread Evgeniy Polyakov
On Mon, Jul 31, 2006 at 03:00:28PM -0700, David Miller ([EMAIL PROTECTED]) 
wrote:
> From: Evgeniy Polyakov <[EMAIL PROTECTED]>
> Date: Mon, 31 Jul 2006 23:41:43 +0400
> 
> > Since kevents are never generated by kernel, but only marked as ready,
> > length of the main queue performs as flow control, so we can create a
> > mapped buffer which will have space equal to the main queue length
> > multiplied by size of the copied to userspace structure plus 16 bits for
> > the start index of the kernel writing side, i.e. it will store offset
> > where the oldest event was placed.
> >
> > Since queue length is a limited factor and thus no new events can be added
> > when queue is full, that means that buffer is full too and userspace
> > must read events. When syscall is called to add new kevent and provided 
> > there offset differs from what kernel stored, that means that all events 
> > from kernel to provided index have been read and new events can be added.
> > Thus we can even allow read-only mapping. Kernel's index is incremented
> > modulo queue length. If kevent was removed after it was marked as
> > ready, it's copy stays in the mapped buffer, but special flag can be
> > assigned to show that kevent is no longer valid.
> 
> This sounds reasonable.
> 
> However we must be mindful that the thread of control trying to
> add a new event might not be in a position to drain the queue
> of pending events when the queue is full.  Usually he will be
> trying to add an event in response to handling another event.
> 
> So we'd have cases like this, assume we start with a full event
> queue:
> 
>   thread Athread B
> 
>   dequeue event
>   aha, new connection
>   accept()
>   register new kevent
>   queue is now full again
>   add kevent on new
>   connection
> 
> At this point thread A doesn't have very many options when the kevent
> add fails.  You cannot force this thread to read more events, since he
> may not be in a state where he is easily able to do so.

By default all kevents are not removed from the queue, so accept events
will be in the queue and thread B will fail to register new kevent.
To remove kevent from the queue user should either set one-shot flag or
do it by special command.
So if we are in position when queue is full and all events are not
one-shot, control thread must think about what does it do, and remove
some of them (and next time add them with one-shot flag).

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FACK and CWND

2006-07-31 Thread David Miller
From: "Ma Lin" <[EMAIL PROTECTED]>
Date: Fri, 28 Jul 2006 18:37:15 +0800

> In FACK, the holes between SACK blocks are considered as loss. To a
> sender, when SACK comes in, loss_out would be non-zero. According to
> linux-2.6.17.7/net/ipv4/tcp_input.c, function tcp_time_to_recover(),
> this non-zero loss_out will send the sender into "Recovery" state,
> in which, CWND could be reduced. In one word, it seems that, FACK
> would allow SACK holes to reduce CWND.

That's right, because when tp->lost_out is set we have some form
of absolute proof that packets were lost.

Note that even when not receiving SACK blocks, ie. pure Reno,
we emulate the SACK information the best we can.

So, if we have real SACK blocks, tcp_update_scoreboard() will
mark all packets in the retransmit queue up to "fackets_out"
minus "reordering" as lost.

Else, for non-SACK, only the head packet in the retransmit queue
will be marked as lost.

> However, in the paper "Congestion Control in Linux TCP", Section 3,
> subsection Recovery, it says that Recovery state is triggered by
> "sufficient amount of successive duplicate ACK", to my understand,
> that means 3-dup.

Under Linux it has more complicated definition.  We wait until
we see at least "tp->reordering" packets lost.

Dynamically we try to determine how deeply packets are being
reordered on the connection.

Using this value, we use "tp->fackets_out - tp->reordering"
as how many packets we think have been proven as lost.

You will note that any code path that falls through to to end
tcp_fastretrans_alert() will retransmit one packet using a call
to tcp_xmit_retransmit_queue().  And one such code path is the
transition to TCP_CA_Recovery which is guarded by the
tcp_time_to_recover() check, which encapsulates the two tests
we've discussed as:

if (tp->lost_out)
return 1;
if (tcp_fackets_out(tp) > tp->reordering)
return 1;

The next few checks try to handle some fringe cases, such as the head
packet in the retransmit queue having been sent more than an RTO ago,
and also having so few packets in the retransmit queue that normal
recovery mechanisms cannot function properly:

if (tcp_head_timedout(sk, tp))
return 1;
packets_out = tp->packets_out;
if (packets_out <= tp->reordering &&
tp->sacked_out >= max_t(__u32, packets_out/2, 
sysctl_tcp_reordering) &&
!tcp_may_send_now(sk, tp)) {
return 1;
}

Hope this helps.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] gre: transparent ethernet bridging

2006-07-31 Thread Stephen Hemminger
On Tue, 01 Aug 2006 11:15:29 +1000
Philip Craig <[EMAIL PROTECTED]> wrote:

> Stephen Hemminger wrote:
> > On Mon, 31 Jul 2006 20:06:41 +1000
> > Philip Craig <[EMAIL PROTECTED]> wrote:
> > 
> >> This patch implements transparent ethernet bridging for gre tunnels.
> >> There are a few outstanding issues.
> > 
> > Why not use existing bridge code?
> 
> It does use the existing bridge code.  Perhaps the name is misleading.
> All it does is encapsulate the full ethernet header in a gre packet,
> rather than only layer 3.  That is, currently gre uses ARPHRD_IPGRE,
> but bridging requires ARPHRD_ETHER.
> 

I am not against making the bridge code smarter to handle other
encapsulation.

> >> Some routers set LLC_SAP_BSPAN in the gre protocol field, and then
> >> give the bpdu packet without any other ethernet/llc header. This patch
> >> currently tries to fake the ethernet/llc header before passing the
> >> packet up, but it is buggy (mac addresses are wrong at least). Maybe a
> >> better approach is to call directly into the bridging code. I didn't try
> >> that at first because it isn't modular, and may break other things that
> >> want to see the packet.
> > 
> > Existing bridge code already has spanning tree.
> 
> Yes, and I want to use that.  But this packet is a bit strange in
> that it does not have the ethernet header on it.   So what is the
> best way to pass it to existing code?  Either fake the ethernet
> header, or pass it directly?

Likewise if the bridge STP bpdu input code was smarter, it could
deal with it maybe?

> 
> >> +#if 0
> >>dev = alloc_netdev(sizeof(*t), name, ipgre_tunnel_setup);
> >> +#else
> >> +  dev = alloc_netdev(sizeof(*t), name, ipgre_ether_tunnel_setup);
> >> +#endif
> > 
> > "Do, or do not there is no try"
> 
> I am looking for comments as to whether adding a netlink interface
> to control this is appropriate.

If we make bridge code type aware, then the ipgre tunnel wouldn't have to 
change.

> >> +__be16 ipgre_type_trans(struct sk_buff *skb, int offset)
> >> +{
> >> +  u8 *h = skb->data;
> >> +  __be16 flags = *(__be16*)h;
> >> +  __be16 proto = *(__be16*)(h + 2);
> >> +
> >> +  /* WCCP version 1 and 2 protocol decoding.
> >> +   * - Change protocol to IP
> >> +   * - When dealing with WCCPv2, Skip extra 4 bytes in GRE header
> >> +   */
> >> +  if (flags == 0 &&
> >> +  proto == __constant_htons(ETH_P_WCCP)) {
> >> +  proto = __constant_htons(ETH_P_IP);
> >> +  if ((*(h + offset) & 0xF0) != 0x40)
> >> +  offset += 4;
> >> +  }
> > 
> > Don't use __constant_htons() except in initializers and switch cases
> > (where gcc is too stupid to optimize the macro).
> > 
> 
> This is a problem in the existing code, which I am simply moving
> around.  Should I fix it at the same time?

Usually if a diff touches some code, I try to make it use current practice.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-31 Thread David Miller
From: Rusty Russell <[EMAIL PROTECTED]>
Date: Fri, 28 Jul 2006 15:54:04 +1000

> (1) I am imagining some Grand Unified Flow Cache (Olsson trie?) that
> holds (some subset of?) flows.  A successful lookup immediately after
> packet comes off NIC gives destiny for packet: what route, (optionally)
> what socket, what filtering, what connection tracking (& what NAT), etc?
> I don't know if this should be a general array of fn & data ptrs, or
> specialized fields for each one, or a mix.  Maybe there's a "too hard,
> do slow path" bit, or maybe hard cases just never get put in the cache.
> Perhaps we need a separate one for locally-generated packets, a-la
> ip_route_output().  Anyway, we trade slightly more expensive flow setup
> for faster packet processing within flows.

So, specifically, one of the methods you are thinking about might
be implemented by adding:

void (*input)(struct sk_buff *, void *);
void *input_data;

to "struct flow_cache_entry" or whatever replaces it?

This way we don't need some kind of "type" information in
the flow cache entry, since the input handler knows the type.

> One way to do this is to add a "have_interest" callback into the
> hook_ops, which takes each about-to-be-inserted GUFC entry and adds any
> destinies this hook cares about.  In the case of packet filtering this
> would do a traversal and append a fn/data ptr to the entry for each rule
> which could effect it.  

Can you give a concrete example of how the GUFC might make use
of this?  Just some small abstract code snippets will do.

> The other way is to have the hooks register what they are interested in
> into a general data structure which GUFC entry creation then looks up
> itself.  This general data structure will need to support wildcards
> though.

My gut reaction is that imposing a global data structure on all object
classes is not prudent.  When we take a GUFC miss, it seems better we
call into the subsystems to resolve things.  It can implement whatever
slow path lookup algorithm is most appropriate for it's data.

> We also need efficient ways of reflecting rule changes into the GUFC.
> We can be pretty slack with conntrack timeouts, but we either need to
> flush or handle callbacks from GUFC on timed-out entries.  Packet
> filtering changes need to be synchronous, definitely.

This, I will remind, is similar to the problem of doing RCU locking
of the TCP hash tables.

> (3) Smart NICs that do some flowid work themselves can accelerate lookup
> implicitly (same flow goes to same CPU/thread) or explicitly (each
> CPU/thread maintains only part of GUFC which it needs, or even NIC
> returns flow cookie which is pointer to GUFC entry or subtree?).  AFAICT
> this will magnify the payoff from the GUFC.

I want to warn you about HW issues that I mentioned to Alexey the
other week.  If we are not careful, we can run into the same issues
TOE cards run into, performance wise.

Namely, it is important to be careful about how the GUFC table entries
get updated in the card.  If you add them synchronously, your
connection rates will deteriorate dramatically.

I had the idea of a lazy scheme.  When we create a GUFC entry, we
tack it onto a DMA'able linked list the card uses.  We do not
notify the card, we just entail the update onto the list.

Then, if the card misses it's on-chip GUFC table on an incoming
packet, it checks the DMA update list by reading it in from memory.
It updates it's GUFC table with whatever entries are found on this
list, then it retries to classify the packet.

This seems like a possible good solution until we try to address GUFC
entry deletion, which unfortunately cannot be evaluated in a lazy
fashion.  It must be synchronous.  This is because if, for example, we
just killed off a TCP socket we must make sure we don't hit the GUFC
entry for the TCP identity of that socket any longer.

Just something to think about, when considering how to translate these
ideas into hardware.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [parisc-linux] [git patches] tulip fixes from parisc-linux

2006-07-31 Thread Valerie Henson
On Sun, Jul 30, 2006 at 02:54:56PM -0400, Kyle McMartin wrote:
> On Sun, Jul 30, 2006 at 11:35:32AM -0700, Andrew Morton wrote:
> > hm.  A couple of those patches have been futzing around in -mm for over a
> > year and have been nacked by Jeff and are a regular source of grumpygrams. 
> > I've been sitting on them in the pathetic hope that someone will one day
> > get down and address the bugs which they fix in an acceptable fashion,
> > whatever that is.
> > 
> 
> Jeff/Val seemed willing to merge the fixes as they stood. parisc-linux
> merged Francois' tulip workqueue patch some time ago, and have been
> running with it since without issue. This defers the tulip_select_media
> work to process context, and so should be less of an issue.

Hey Kyle,

Thanks for splitting these out.  Could you do us a favor and post the
patches themselves?  I'm not the only one who doesn't use git, and it
will be a lot less confusing if we can directly ack the patches in
email instead of referring to them third-hand.  Thanks,

-VAL
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: fix kernel panic from no dev->hard_header_len space

2006-07-31 Thread Krzysztof Halasa
David Miller <[EMAIL PROTECTED]> writes:

>> hdlc_fr: logical PVC devices have no headers (plain IPv4 etc. as seen
>> by tcpdump), but they append FR headers (4 or 10 bytes long) just
>> before passing the skb to physical device.
>
> If you hooked up fr_hard_header into dev->hard_header instead of
> invoking it via pvc_xmit(), everything would be fine.

That would have to be master_device->hard_header(), but the network
stack (IP and friends) has to send packets to pvc_device.
I can't make the headers show up on pvc device - that would break
packet interface and Ethernet framing. The headers have to be
visible only on master (physical) device.

> The complexity of this function arises from the fact that it prepends
> headers of differing lengths depending upon the protocol type
> being encapsulated, and this is the problem you should aim to
> solve.

Actually I don't think there is a problem with different header
lengths. The driver indicates it wants 10 bytes and that's enough
for all cases (except Ethernet framing where it indicates and uses
14 bytes and reallocs before prepending another 10 bytes).

> Alexey, any suggestions on how to handle this kind of thing?

What's wrong with my patch?

If it can's be accepted I can just add an empty pvc->hard_header().
That won't make other drivers work reliably, though, and it's IMHO
hardly their author's fault. I don't think we've ever advertised
"hard_header_len is valid only with non-NULL hard_header".
-- 
Krzysztof Halasa
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding offloading IPv6 addrconf and ndisc

2006-07-31 Thread Jamal Hadi Salim
On Tue, 2006-01-08 at 11:30 +1000, Herbert Xu wrote:

> 
> You can now disable the OOM killer on a per-process basis by
> 
> echo -17 > /proc//oom_adj
> 

nice to know ;-> At least you can protect some apps if you need to.
Only racoon and quagga are important for me.
But what happens then if you have a beast that just chews memory
forever? I suppose other poor apps will just get shot.

My plan was just to write a simple daemon that uses the genetlink API
that Shailabh (IBM) and company wrote and just restart the app if i see
it disappear. 

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linville's L2 rant... -- Re: PATCH Fix bonding active-backup behavior for VLAN interfaces

2006-07-31 Thread Jamal Hadi Salim
On Mon, 2006-31-07 at 08:30 -0400, John W. Linville wrote:
> On Mon, Jul 31, 2006 at 10:15:40AM +0200, Christophe Devriese wrote:
> 
> > If you bond 2 vlan subinterfaces, the patch is not necessary at all. In 
> > that 
> > case also the source device will be changed from eth0. to bond. So 
> > that's correct behavior no ?
> > 
> > In the second case, you create vlan subifs on a bonding device, vlan 
> > subinterfaces will be created on the slave interfaces. In that case the 
> > vlan 
> 
> (This is not directed at Christophe, or anyone in particular...)
> 
> 
> 
> Am I the only one that thinks that our handling of LAN L2 stuff
> is at best a little "too" flexible (and at worst a collection of
> nasty hacks)?
> 
> I mean, do we really need both the ability to bond multiple vlan
> interfaces AND the ability to have vlan interfaces on top of a bond?
> How many people really appreciate the subtle(?) differences?
> 
> Then throw bridging into the mix!  If I'm using VLANs and bonds in
> a bridged environment, do I bridge the bonds, or bond the bridges?
> Do the VLANs come before the bonds?  after the bridges?  or somewhere
> in-between?  Do all these combinations even work together?  Who has
> the definitive answer (besides the code itself)?
> 
> I have no doubt that there are plenty of opportunities for cleverness
> here (and no doubt dragons too).  I just doubt that most of them
> are worth the complexities introduced by our current collection of
> "transparently" stackable pseudo-drivers and strategically placed hacks
> (e.g. skb_bond).  All that, and it still isn't clear to me how we
> can cleanly accomodate 802.1s (which adds VLAN awareness to bridging).
> 
> Do we hold the view that our L2 code is on par with the rest of
> our code?  Is there an appetite for a clean-up?  Or is it just me?
> 
> 
> 
> If you made it this far, thanks for listening...I feel better now. :-)

Yes, I made it this far and you do make good arguement (or i may be
over-dosed ;->).
I have seen the following setups that are useful:

1) Vlans with bridges; in which one or more vlans exist per ethernet
port. Broadcast packets within such vlans are restricted to just those
vlans by the bridge.
2) complicate the above a little by having multiple spanning trees. 
3) Add to the above link layer HA (802.1ad or otherwise as presented
today by Bonding).

To answer your question; i think yes we need all 3.
Unfortunately the 3 above are all done by different people with
different intentions altogether. I think BGrears end goal was VLANs for
an end host. I think Lennert wrote the original Bridge code and for a
while had some VLAN code that worked well with bridging (that code died
as far as i know). Then bonding - theres some pre-historic relation to
it since D Becker days and then the good folks from Intel adding about
1M features to it. Yes, the fact all 3 need to work together is a
mess ;-> (but there are good pragmatic reasons for them to work
together)...
Hope that helps ;->

cheers,
jamal


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding offloading IPv6 addrconf and ndisc

2006-07-31 Thread Herbert Xu
On Mon, Jul 31, 2006 at 09:24:27PM -0400, Jamal Hadi Salim wrote:
>  
> In regards to reliability: The thing that really fscks people using
> daemons from what i have seen is the oom killer policies and the lack of
> correlation by apps. I just watched quagga die horribly on a 256M
> machine on friday once we hit around 100K routes and a lot of route
> cache hits. So apps like that may need a total rewrite. I am not looking
> forward to trying to get racoon to do 50K SAs and 100K SPDs on the same
> machine ;->

You can now disable the OOM killer on a per-process basis by

echo -17 > /proc//oom_adj

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding offloading IPv6 addrconf and ndisc

2006-07-31 Thread Jamal Hadi Salim
On Mon, 2006-31-07 at 17:49 -0700, Roland Dreier wrote:
> David> Why is this a relevant analogy?  Well, you have physical
> David> hard-disks in your computer today, but at some point that
> David> device becomes largely superfluous.  It makes more sense to
> David> have just a cpu with a 10-gigabit ethernet interface
> David> incorporated onto the cpu die, and the majority if not all
> David> of your disk access is remote.
> 
> Isn't most of the iSCSI control plane in userspace right now?

I know iscsi is supposed to integrate with ipsec as well (and SLP for
discovery) - does that happen in user space as well?

Dave (I am under heavy flu dose, so I may be incoherent;->) but heres a
devils advocate bit for you:
TCP FIN/SYN are just control packets - so move the connection
setup/teardown out to user space;->. You can then add all sorts of funky
DOS detection/prevention schemes as needed - makes it easy to experiment with. 
Actually move the slow path as well, SACK processing etc (i know it is in 
process
context today, but thats in the kernel). Just leave VJs fast path in the
kernel. Extend the user space bit to be the new VJ (channels stuff but
just for control) - asynch notification to carry the control/slow path
packets to user space.

In regards to ARP/NDISC being in user space: note people are talking
about secure DHCP or some form of initial pre-layer2 addressing over EAP
or something along those lines; i.e if you are not securely validated at
the L2 level you are not even getting an IP address. 
 
In regards to reliability: The thing that really fscks people using
daemons from what i have seen is the oom killer policies and the lack of
correlation by apps. I just watched quagga die horribly on a 256M
machine on friday once we hit around 100K routes and a lot of route
cache hits. So apps like that may need a total rewrite. I am not looking
forward to trying to get racoon to do 50K SAs and 100K SPDs on the same
machine ;->

I think I like what Hugo is saying ;-> I just hope he has time and
resources to produce code. 

cheers,
jamal



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] gre: transparent ethernet bridging

2006-07-31 Thread Philip Craig
Stephen Hemminger wrote:
> On Mon, 31 Jul 2006 20:06:41 +1000
> Philip Craig <[EMAIL PROTECTED]> wrote:
> 
>> This patch implements transparent ethernet bridging for gre tunnels.
>> There are a few outstanding issues.
> 
> Why not use existing bridge code?

It does use the existing bridge code.  Perhaps the name is misleading.
All it does is encapsulate the full ethernet header in a gre packet,
rather than only layer 3.  That is, currently gre uses ARPHRD_IPGRE,
but bridging requires ARPHRD_ETHER.

>> Some routers set LLC_SAP_BSPAN in the gre protocol field, and then
>> give the bpdu packet without any other ethernet/llc header. This patch
>> currently tries to fake the ethernet/llc header before passing the
>> packet up, but it is buggy (mac addresses are wrong at least). Maybe a
>> better approach is to call directly into the bridging code. I didn't try
>> that at first because it isn't modular, and may break other things that
>> want to see the packet.
> 
> Existing bridge code already has spanning tree.

Yes, and I want to use that.  But this packet is a bit strange in
that it does not have the ethernet header on it.   So what is the
best way to pass it to existing code?  Either fake the ethernet
header, or pass it directly?

>> +#if 0
>>  dev = alloc_netdev(sizeof(*t), name, ipgre_tunnel_setup);
>> +#else
>> +dev = alloc_netdev(sizeof(*t), name, ipgre_ether_tunnel_setup);
>> +#endif
> 
> "Do, or do not there is no try"

I am looking for comments as to whether adding a netlink interface
to control this is appropriate.

>> +__be16 ipgre_type_trans(struct sk_buff *skb, int offset)
>> +{
>> +u8 *h = skb->data;
>> +__be16 flags = *(__be16*)h;
>> +__be16 proto = *(__be16*)(h + 2);
>> +
>> +/* WCCP version 1 and 2 protocol decoding.
>> + * - Change protocol to IP
>> + * - When dealing with WCCPv2, Skip extra 4 bytes in GRE header
>> + */
>> +if (flags == 0 &&
>> +proto == __constant_htons(ETH_P_WCCP)) {
>> +proto = __constant_htons(ETH_P_IP);
>> +if ((*(h + offset) & 0xF0) != 0x40)
>> +offset += 4;
>> +}
> 
> Don't use __constant_htons() except in initializers and switch cases
> (where gcc is too stupid to optimize the macro).
> 

This is a problem in the existing code, which I am simply moving
around.  Should I fix it at the same time?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: fix kernel panic from no dev->hard_header_len space

2006-07-31 Thread David Miller
From: Krzysztof Halasa <[EMAIL PROTECTED]>
Date: Tue, 01 Aug 2006 03:04:28 +0200

> hdlc_fr: logical PVC devices have no headers (plain IPv4 etc. as seen
> by tcpdump), but they append FR headers (4 or 10 bytes long) just
> before passing the skb to physical device.

If you hooked up fr_hard_header into dev->hard_header instead of
invoking it via pvc_xmit(), everything would be fine.

The complexity of this function arises from the fact that it prepends
headers of differing lengths depending upon the protocol type
being encapsulated, and this is the problem you should aim to
solve.

Alexey, any suggestions on how to handle this kind of thing?

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/4] kevent: core files.

2006-07-31 Thread David Miller
From: Evgeniy Polyakov <[EMAIL PROTECTED]>
Date: Fri, 28 Jul 2006 09:23:12 +0400

> I completely agree that existing kevent interface is not the best, so
> I'm opened for any suggestions.
> Should kevent creation/removing/modification be separated too?

I do not think so, object for these 3 operations are the same,
so there are no typing issues.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: fix kernel panic from no dev->hard_header_len space

2006-07-31 Thread Krzysztof Halasa
David Miller <[EMAIL PROTECTED]> writes:

> Krzysztof, which device driver exactly creates this problem
> in the first place?

I have a report (not sure but I think it's that) with hdlc_fr (Frame
Relay).

Grepping through the tree there might be problems with:
- net/8021q/vlan.c (probably not with normal Ethernet, but there is
  a code path which could potentially be a problem with
  NETIF_F_HW_VLAN_TX)
- net/atm/clip.c
- net/appletalk/*
- drivers/net/gianfar.c
- drivers/net/wan/lapbether.c
- drivers/s390/net/netiucv.c will not oops but merely drop the packet
  and print a warning.

and possibly others, I haven't checked the whole tree.
Some (not all) of them might be false positives, though.

Fortunately most of the time skb comes with preallocated header space
(that common skb_reserve(2) I think) and thus the reports aren't
frequent (personally I have never seen that).

> If you have headers to prepend for your device, why do you set the
> header building function to NULL? :-)

hdlc_fr: logical PVC devices have no headers (plain IPv4 etc. as seen
by tcpdump), but they append FR headers (4 or 10 bytes long) just
before passing the skb to physical device.
-- 
Krzysztof Halasa
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/4] kevent: core files.

2006-07-31 Thread David Miller
From: Zach Brown <[EMAIL PROTECTED]>
Date: Thu, 27 Jul 2006 12:18:42 -0700

[ I kept this thread around in my inbox because I wanted to give it
  some deep thought, so sorry for replying to old bits... ]

> So as the kernel generates events in the ring it only produces an event
> if the ownership field says that userspace has consumed it and in doing
> so it sets the ownership field to tell userspace that an event is
> waiting.  userspace and the kernel now each follow their index around
> the ring as the ownership field lets them produce or consume the event
> at their index.  Can someone tell me if the cache coherence costs of
> this are extreme?  I'm hoping they're not.

No need for an owner field, we can use something like a VJ
netchannel datastructure for this.  Kernel only writes to
producer index and user only writes to consumer index.

> So, great, glibc can now find pending events very quickly if they're
> waiting in the ring and can fall back to the collection syscall if it
> wants to wait and the ring is empty.  If it consumes events via the
> syscall it increases its ring index by the number the syscall returned.

I do not think if we do a ring buffer that events should be obtainable
via a syscall at all.  Rather, I think this system call should be
purely "sleep until ring is not empty".

This is actually reasonably simple stuff to implement as Evgeniy
has tried to explain.

Events in kevent live on a ready list when they have triggered.
Existence on a list determined the state, and I think this design
btw invalidates some of the arguments against using netlink that
Ulrich mentions in his paper.  If netlink socket queuing fails,
well then kevent stays on ready list and that is all until the
kevent can be successfully published to the user.

I am not advocating netlink at all for this, as the ring buffer idea
is much better.

The ring buffer size, as Evgeniy also tried to describe, is bounded
purely by the number of registered events.  So event loop of
application might look something like this:

struct ukevent cur_event;
struct timeval timeo;

setup_timeout(&timeo);
for (;;) {
int err;
while(!(err = ukevent_dequeue(evt_fd, evt_ring,
  &cur_event, &timeo))) {
struct my_event_object *o =
event_to_object(&cur_event);
o->dispatch(o, &cur_event);
setup_timeout(&timeo);
}
if (err == -ETIMEDOUT)
timeout_processing();
else
event_error_processing(err);
}

ukevent_dequeue() is perhaps some GLIBC implemented routine which does
something like:

int err;

for (;;) {
if (!evt_ring_empty(evt_ring)) {
struct ukevent *p = evt_ring_consume(evt_ring);
memcpy(event_p, p, sizeof(struct ukevent));
return 0;
}
err = kevent_wait(evt_fd, timeo_p);
if (err < 0)
break;
}
return err;

It's just some stupid ideas... we could also choose to expose the ring
buffer layout directly to the user event loop and let it perform the
dequeue operation and kevent_wait() calls directly.  I don't see why
not to allow that.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding offloading IPv6 addrconf and ndisc

2006-07-31 Thread Roland Dreier
David> Why is this a relevant analogy?  Well, you have physical
David> hard-disks in your computer today, but at some point that
David> device becomes largely superfluous.  It makes more sense to
David> have just a cpu with a 10-gigabit ethernet interface
David> incorporated onto the cpu die, and the majority if not all
David> of your disk access is remote.

Isn't most of the iSCSI control plane in userspace right now?

 - R.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding offloading IPv6 addrconf and ndisc

2006-07-31 Thread David Miller
From: Andi Kleen <[EMAIL PROTECTED]>
Date: Tue, 1 Aug 2006 02:31:58 +0200

> Playing devil's advocate here: if the packets are processed on
> two different CPUs then this could also happen and break the test
> case.
> 
> So the test is probably a bit fragile.

Good point.

> I generally agree it's better to keep this in kernel though.

To drive this home even more, I do not believe that the people who
advocate pushing NDISC and ARP policy into userspace would be very
happy if something like the RAID transformations were moved into
userspace and they were not able to access their disks if the RAID
transformer process in userspace died.

Why is this a relevant analogy?  Well, you have physical hard-disks in
your computer today, but at some point that device becomes largely
superfluous.  It makes more sense to have just a cpu with a 10-gigabit
ethernet interface incorporated onto the cpu die, and the majority if
not all of your disk access is remote.

At that point, network access equals disk access.  It would be amusing
to need to restart such an NDISC/ARP daemon if it were to live on a
remote volume. :-)

I understand full well that on special purpose network devices this
control vs. data plane seperation into userspace might make a lot of
sense.  But for a general purpose operating system, such as Linux, the
greater concern is resiliency to failures and each piece of core
functionality you move to userspace is a new potential point of
failure.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding offloading IPv6 addrconf and ndisc

2006-07-31 Thread Andi Kleen

> If we process these in sequence in software interrupt, everything
> is fine.  Processing of "A" will add the address, and the test
> ping packet "B" will respond properly.
> 
> If you defer "A", everything breaks and the test packet "B" will
> get processed first and not work.

Playing devil's advocate here: if the packets are processed on
two different CPUs then this could also happen and break the test
case.

So the test is probably a bit fragile.

Currently it is unlikely to happen because of interrupt affinity for a 
single device,  but in future with MSI-X support it might not.

I generally agree it's better to keep this in kernel though.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding offloading IPv6 addrconf and ndisc

2006-07-31 Thread Kazunori Miyazawa
Hello Hugo,

Hugo Santos wrote:
> Hi,
> 
>> On the other hand, if a ND daemon loose the synchronization, it is
>> unpredicable, I guess.
> 
>What do you mean by synchronization in this context? My idea was to
>  keep the ND state machine inside the kernel, and instead have the
>  daemon be reactive. That means it would send messages on behalf of the
>  kernel, and apply information based on received signalling (besides, ND
>  is reseliant to loss of messages). Taking your example, if the kernel
>  is using a neighbor entry and you replace it (either changing it's
>  state or link-layer address), the kernel will adapt, i believe it is
>  predictable. To be honest, i'm only worried about possible lost netlink
>  messages; but the daemon may be implemented to handle this, re-sending
>  while an ACK isn't receiving, thus minimizing any de-synchronization
>  possibilities.
> 

The kernel maintains the ND state by itself and the daemon touches
the state. I think the daemon should aware the state.
It is what I meant with "synchronization".

Anyway I do not intend to prevent you from your work anymore.
I quit discussion without seeing the codes.

>> BTW, we have a choice which we implement a functionality as a
>> module. I think it can achieve some of what you want.
> 
>Well, exporting the functionality to a module would be a start to
>  have one moving it out of the kernel. :-)
> 
>Hugo

--
Kazunori Miyazawa
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/4] kevent: core files.

2006-07-31 Thread Zach Brown

> Ok, let's do it in the following way:
> I present new version of kevent with new syscalls and fixed issues mentioned
> before, while people look at it we can end up with mapped buffer design.
> Is it ok?

Yeah, that sounds good.  I'm looking forward to seeing the next set of
patches :).

- z
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Off by one buglets

2006-07-31 Thread David Miller
From: Ralf Baechle <[EMAIL PROTECTED]>
Date: Fri, 30 Jun 2006 15:29:01 +0100

> Ages ago, changeset
> 
> http://www.kernel.org/git/?p=linux/kernel/git/tglx/history.git;a=commit;h=22d864d542a0b92116751186f1794c7d0f1ca1b9
> 
> which converted several protocols from using open coded comparisons to
> use the helper function sk_acceptq_is_full() did introduce a bunch of
> off by one errors - sk_acceptq_is_full checks for
> sk_ack_backlog > sk_max_ack_backlog but it replaced >= or == comparisons.
> 
> Below patch is really only meant to illustrate the change, not to be
> applied.

I looked at this again, and the change is perfectly fine.

This patch merely shows that previously the protocols were very
inconsistent about what sk_max_ack_backlog really meant.  All
Arnaldo's changeset did was enforce a consistent meaning of
this limit across the entire tree.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/4] kevent: core files.

2006-07-31 Thread David Miller
From: Brent Cook <[EMAIL PROTECTED]>
Date: Mon, 31 Jul 2006 17:16:48 -0500

> There has to be some thread that is responsible for reading
> events. Perhaps a reasonable thing for a blocked thread that cannot
> process events to do is to yield to one that can?

The reason one decentralizes event processing into threads is so that
once they are tasked to process some event they need not be concerned
with event state.

They are designed to process their event through to the end, then
return to the top level and say "any more work for me?"
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/4] kevent: core files.

2006-07-31 Thread Brent Cook
On Monday 31 July 2006 17:00, David Miller wrote:
>
> So we'd have cases like this, assume we start with a full event
> queue:
>
>   thread Athread B
>
>   dequeue event
>   aha, new connection
>   accept()
>   register new kevent
>   queue is now full again
>   add kevent on new
>   connection
>
> At this point thread A doesn't have very many options when the kevent
> add fails.  You cannot force this thread to read more events, since he
> may not be in a state where he is easily able to do so.

There has to be some thread that is responsible for reading events. Perhaps a 
reasonable thing for a blocked thread that cannot process events to do is to 
yield to one that can?

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/4] kevent: core files.

2006-07-31 Thread David Miller
From: Evgeniy Polyakov <[EMAIL PROTECTED]>
Date: Mon, 31 Jul 2006 23:41:43 +0400

> Since kevents are never generated by kernel, but only marked as ready,
> length of the main queue performs as flow control, so we can create a
> mapped buffer which will have space equal to the main queue length
> multiplied by size of the copied to userspace structure plus 16 bits for
> the start index of the kernel writing side, i.e. it will store offset
> where the oldest event was placed.
>
> Since queue length is a limited factor and thus no new events can be added
> when queue is full, that means that buffer is full too and userspace
> must read events. When syscall is called to add new kevent and provided 
> there offset differs from what kernel stored, that means that all events 
> from kernel to provided index have been read and new events can be added.
> Thus we can even allow read-only mapping. Kernel's index is incremented
> modulo queue length. If kevent was removed after it was marked as
> ready, it's copy stays in the mapped buffer, but special flag can be
> assigned to show that kevent is no longer valid.

This sounds reasonable.

However we must be mindful that the thread of control trying to
add a new event might not be in a position to drain the queue
of pending events when the queue is full.  Usually he will be
trying to add an event in response to handling another event.

So we'd have cases like this, assume we start with a full event
queue:

thread Athread B

dequeue event
aha, new connection
accept()
register new kevent
queue is now full again
add kevent on new
connection

At this point thread A doesn't have very many options when the kevent
add fails.  You cannot force this thread to read more events, since he
may not be in a state where he is easily able to do so.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: warning at net/core/dev.c:1171/skb_checksum_help() 2.6.18-rc3

2006-07-31 Thread David Miller
From: Patrick McHardy <[EMAIL PROTECTED]>
Date: Mon, 31 Jul 2006 23:36:29 +0200

> David Miller wrote:
> > Does this matter?
> 
> I don't think it does. Its a huge corner case (unloading of the
> module which issued the QUEUE verdict while queueing the packet),
> and worst case is that we drop some segments or the entire packet.

Ok, that's what I thought.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: warning at net/core/dev.c:1171/skb_checksum_help() 2.6.18-rc3

2006-07-31 Thread Patrick McHardy
David Miller wrote:
> I noticed a subtle semantic change for nf_queue().  Previously, if we
> can't grab the module reference for the matching entry, we'd not free
> the skb, return 0, and the caller tries to iterate to the next hook.
> 
> That behavior is preserved for singleton frames, but that's not what
> happens for GSO frames.  Instead, the GSO frame is split up and we
> always return "1" even if some of the subsegments cause __nf_queue()
> to return 0 due to the case described in the previous paragraph.

I couldn't think of a better way to handle this except to just deliver
everything we can and drop the rest, since the caller doesn't know
anything about the individual segments we can't simply deliver the
remaining ones to the next hook.

> It is, however, mindful to free up the kfree_skb so it doesn't cause a
> leak or anything like that.
> 
> Does this matter?

I don't think it does. Its a huge corner case (unloading of the
module which issued the QUEUE verdict while queueing the packet),
and worst case is that we drop some segments or the entire packet.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


skge and sky2 backported driver.

2006-07-31 Thread Stephen Hemminger
This is a backport of the current 2.6.18 version of skge and sky2
drivers for use with older kernels.  The drivers depend on the CRC32
module.

It has been compiled and tested on RHEL (2.4) and 2.6.8
but should work on other kernels past that point. It does depend on 
ethtool_ops, if_vlan and mii support.

This version is somewhat different than the current
2.6 version:

 * no suspend/resume or wake on LAN

 * no support for reading original MAC address

 * no support for MSI on sky2

 * receive checksumming default off because earlier kernels often
   do not handle it properly with vlan's or PPP.

 * sky2 does not use NAPI because it required changes
   to netdevice.h and/or tweaking internals of netdevice
   interface to handle dual port status.

 * sky2 defaults to TSO off because until 2.6.13 there
   are issues with TCP congestion control and TSO.

 * sky2 doesn't use VLAN acceleration features
   probably doesn't make much difference.

THIS IS NOT SUPPORTED, IT IS PROVIDED AS IS.  In other words, go ahead
and mail bug reports to me <[EMAIL PROTECTED]> but don't expect me
to be able to fix them.

http://developer.osdl.org/shemminger/releases/skge-sky2-backport.tar.bz2
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding offloading IPv6 addrconf and ndisc

2006-07-31 Thread David Miller

So all of you userland control-plane fanatics, how will you handle
things like NFS root with these daemon-required variants of NDISC and
ARP?

I know the devils' advocate responses already, so don't bother with
responses saying things like 1) "do it in the initial ramdisk, we only
need the daemon to setup the NDISC entries to talk to the NFS server"
or 2) "IPSEC's control plane is in userspace and therefore we can't do
NFS root over IPSEC, why is that ok and key'd NDISC is not?"

I think we are building systems which gradually are becomming less and
less reliable, with increasing numbers of possible points of failure.

Flexibility is overrated.  There are many crucial optimizations and
simplifications we cannot perform because we've made certain aspects
of network configuration far too flexible.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: warning at net/core/dev.c:1171/skb_checksum_help() 2.6.18-rc3

2006-07-31 Thread David Miller
From: Patrick McHardy <[EMAIL PROTECTED]>
Date: Mon, 31 Jul 2006 20:36:58 +0200

> I'm going to do some more testing now ..

Thanks for all of this work Patrick.

I noticed a subtle semantic change for nf_queue().  Previously, if we
can't grab the module reference for the matching entry, we'd not free
the skb, return 0, and the caller tries to iterate to the next hook.

That behavior is preserved for singleton frames, but that's not what
happens for GSO frames.  Instead, the GSO frame is split up and we
always return "1" even if some of the subsegments cause __nf_queue()
to return 0 due to the case described in the previous paragraph.

It is, however, mindful to free up the kfree_skb so it doesn't cause a
leak or anything like that.

Does this matter?

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: neigh_lookup lockdep bug.

2006-07-31 Thread David Miller
From: Dave Jones <[EMAIL PROTECTED]>
Date: Mon, 31 Jul 2006 16:50:04 -0400

> 2.6.18rc2-gitSomething on my firewall box just triggered this..

Lockdep is perhaps confused.

> [515613.904945] swapper/0 is trying to acquire lock:
> [515613.931489]  (&tbl->lock){-+-+}, at: [] neigh_lookup+0x50/0xaf
> [515613.964369] 
> [515613.964373] but task is already holding lock:
> [515614.006550]  (&skb_queue_lock_key){-+..}, at: [] 
> neigh_proxy_process+0x20/0xc2

The skb_queue_lock in question is &tbl->proxy_queue.lock

> [515614.103459] the existing dependency chain (in reverse order) is:
> [515614.148752] 
> [515614.148755] -> #2 (&skb_queue_lock_key){-+..}:
> [515614.10][] lock_acquire+0x4b/0x6c
> [515614.215554][] _spin_lock_irqsave+0x22/0x32
> [515614.243606][] skb_dequeue+0x12/0x43
> [515614.269657][] skb_queue_purge+0x14/0x1b
> [515614.296565][] neigh_update+0x317/0x353

This is a different queue lock, namely &neigh->arp_queue.lock

Like the ipv6 trace we got yesterday from Matt Domsche, lockdep
is aparently confusing two instances of the skb_queue_lock_key

> [515614.677724] -> #0 (&tbl->lock){-+-+}:
> [515614.707327][] lock_acquire+0x4b/0x6c
> [515614.729897][] _read_lock_bh+0x1e/0x2d
> [515614.752546][] neigh_lookup+0x50/0xaf
> [515614.774754][] neigh_event_ns+0x2c/0x77
> [515614.797271][] arp_process+0x366/0x4e4
> [515614.819349][] parp_redo+0x8/0xa
> [515614.839660][] neigh_proxy_process+0x66/0xc2
> [515614.862931][] run_timer_softirq+0x108/0x167
> [515614.886048][] __do_softirq+0x78/0xf2
> [515614.907136][] do_softirq+0x5a/0xbe
> [515614.927553] 

And this path takes &neigh->proxy_queue.lock, then &tbl->lock

I don't see the problem.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH dscape] d80211: Switch d80211.h to IEEE80211_ style names

2006-07-31 Thread Michael Wu
On Monday 31 July 2006 13:31, John W. Linville wrote:
> As usual I'll depend on Jiri to merge d80211 stack patches, then
> send me a pull request.  If I apply your "Switch drivers to d80211"
> series now, that will undoutedly cause a breakage when Jiri asks me
> to pull this later.
>
Yeah, there needs to be a new (and smaller) set of patches to switch drivers 
to the d80211.h header.

> I presume that at least parts of those patches will still be necessary
> or desirable after the d80211 symbol rename gets merged.  Are you
> preparing a new patch series for when that happens?
>
I'm not quite sure whether switching d80211_mgmt.h will be worthwhile. If it 
isn't, then I can get to preparing a new patch series. At any rate, I think 
the most important thing right now is fixing the conflicts w/ wireless-dev in 
the patch that updates the d80211 drivers to use the new names so we can get 
the rename stuff into wireless-dev.

-Michael Wu


pgpTpG6pQBruA.pgp
Description: PGP signature


neigh_lookup lockdep bug.

2006-07-31 Thread Dave Jones
2.6.18rc2-gitSomething on my firewall box just triggered this..

Dave

[515613.791771] ===
[515613.841467] [ INFO: possible circular locking dependency detected ]
[515613.873284] ---
[515613.904945] swapper/0 is trying to acquire lock:
[515613.931489]  (&tbl->lock){-+-+}, at: [] neigh_lookup+0x50/0xaf
[515613.964369] 
[515613.964373] but task is already holding lock:
[515614.006550]  (&skb_queue_lock_key){-+..}, at: [] 
neigh_proxy_process+0x20/0xc2
[515614.043225] 
[515614.043228] which lock already depends on the new lock.
[515614.043234] 
[515614.103456] 
[515614.103459] the existing dependency chain (in reverse order) is:
[515614.148752] 
[515614.148755] -> #2 (&skb_queue_lock_key){-+..}:
[515614.10][] lock_acquire+0x4b/0x6c
[515614.215554][] _spin_lock_irqsave+0x22/0x32
[515614.243606][] skb_dequeue+0x12/0x43
[515614.269657][] skb_queue_purge+0x14/0x1b
[515614.296565][] neigh_update+0x317/0x353
[515614.323004][] arp_process+0x4aa/0x4e4
[515614.349004][] arp_rcv+0xd4/0xf1
[515614.373209][] netif_receive_skb+0x204/0x271
[515614.400405][] process_backlog+0x99/0xfa
[515614.426351][] net_rx_action+0x9d/0x196
[515614.451856][] __do_softirq+0x78/0xf2
[515614.476660][] do_softirq+0x5a/0xbe
[515614.500737] 
[515614.500741] -> #1 (&n->lock){-+-+}:
[515614.532763][] lock_acquire+0x4b/0x6c
[515614.556814][] _write_lock+0x19/0x28
[515614.580398][] neigh_periodic_timer+0x98/0x13c
[515614.606447][] run_timer_softirq+0x108/0x167
[515614.631798][] __do_softirq+0x78/0xf2
[515614.655122][] do_softirq+0x5a/0xbe
[515614.677721] 
[515614.677724] -> #0 (&tbl->lock){-+-+}:
[515614.707327][] lock_acquire+0x4b/0x6c
[515614.729897][] _read_lock_bh+0x1e/0x2d
[515614.752546][] neigh_lookup+0x50/0xaf
[515614.774754][] neigh_event_ns+0x2c/0x77
[515614.797271][] arp_process+0x366/0x4e4
[515614.819349][] parp_redo+0x8/0xa
[515614.839660][] neigh_proxy_process+0x66/0xc2
[515614.862931][] run_timer_softirq+0x108/0x167
[515614.886048][] __do_softirq+0x78/0xf2
[515614.907136][] do_softirq+0x5a/0xbe
[515614.927553] 
[515614.927557] other info that might help us debug this:
[515614.927563] 
[515614.966774] 1 lock held by swapper/0:
[515614.982693]  #0:  (&skb_queue_lock_key){-+..}, at: [] 
neigh_proxy_process+0x20/0xc2
[515615.013575] 
[515615.013578] stack backtrace:
[515615.037414]  [] show_trace_log_lvl+0x54/0xfd
[515615.057910]  [] show_trace+0xd/0x10
[515615.075934]  [] dump_stack+0x19/0x1b
[515615.094167]  [] print_circular_bug_tail+0x59/0x64
[515615.116172]  [] __lock_acquire+0x808/0x997
[515615.136514]  [] lock_acquire+0x4b/0x6c
[515615.155699]  [] _read_lock_bh+0x1e/0x2d
[515615.175098]  [] neigh_lookup+0x50/0xaf
[515615.197276]  [] neigh_event_ns+0x2c/0x77
[515615.220267]  [] arp_process+0x366/0x4e4
[515615.243248]  [] parp_redo+0x8/0xa
[515615.264645]  [] neigh_proxy_process+0x66/0xc2
[515615.288899]  [] run_timer_softirq+0x108/0x167
[515615.309972]  [] __do_softirq+0x78/0xf2
[515615.328940]  [] do_softirq+0x5a/0xbe
[515615.347150]  [] irq_exit+0x3d/0x3f
[515615.365067]  [] smp_apic_timer_interrupt+0x79/0x7e
[515615.387057]  [] apic_timer_interrupt+0x2a/0x30


-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH dscape] d80211: Switch d80211.h to IEEE80211_ style names

2006-07-31 Thread John W. Linville
On Thu, Jul 27, 2006 at 12:37:14AM -0700, Michael Wu wrote:
> Alright, I've replaced all + lines with spaces with tabs.
> 
> I also fixed one long line. The rest of them are nearly impossible to shorten 
> well. The "(fc & IEEE80211_FCTL_FTYPE) == IEEE80211_FTYPE_DATA" style is 
> really killing us (unless you want to break after ==, which is rather bad). I 
> think we should switch back to a macro for that.
> 
> I would prefer if we could get this merged soon and put in that line 
> shortening macro later (or whatever solution that's best), but it's your 
> call.

Michael,

As usual I'll depend on Jiri to merge d80211 stack patches, then
send me a pull request.  If I apply your "Switch drivers to d80211"
series now, that will undoutedly cause a breakage when Jiri asks me
to pull this later.

I presume that at least parts of those patches will still be necessary
or desirable after the d80211 symbol rename gets merged.  Are you
preparing a new patch series for when that happens?

Thanks,

John
-- 
John W. Linville
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: fix kernel panic from no dev->hard_header_len space

2006-07-31 Thread David Miller
From: Krzysztof Halasa <[EMAIL PROTECTED]>
Date: Mon, 31 Jul 2006 22:04:33 +0200

> This is non-trivial because hard_header and hard_start_xmit
> functions currently can't return new skb address (hard_header()
> can't use skb_realloc_headroom() at all, xmit() can't use it if
> there is a need to requeue the packet).
> 
> Or can you just realloc the data portion of skb without changing skb
> struct address? The skb may be referenced by other things.

Krzysztof, which device driver exactly creates this problem
in the first place?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: fix kernel panic from no dev->hard_header_len space

2006-07-31 Thread David Miller
From: Krzysztof Halasa <[EMAIL PROTECTED]>
Date: Mon, 31 Jul 2006 22:04:33 +0200

> Alexey Kuznetsov <[EMAIL PROTECTED]> writes:
> 
> > All the rest of places just check, that there is enough space
> > for their immediate needs. If dev->hard_header() is NULL, it means that
> > stack does not need any space at all, so that it does not need to worry.
> 
> Why do you think dev->hard_header == NULL means there is no need for
> header space? Isn't it dev->hard_header_len = 0? Why would a device
> set hard_header_len to non-zero if it doesn't need header space?

If you have headers to prepend for your device, why do you set the
header building function to NULL? :-)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SMSC LAN911x and LAN921x vendor driver

2006-07-31 Thread Steve . Glendinning
Hi Francois,

Thanks for your feedback, I have a few questions.

> > +return serviced;
> > +}
> > +
> > +/* Autodetects and initialises external phy for SMSC9115 and SMSC9117 
flavors.
> > + * If something goes wrong, returns -ENODEV to revert back to 
internal phy. */
> > +static int smsc911x_phy_initialise_external(struct smsc911x_data 
*pdata)
> > +{
> > +unsigned int address;
> > +unsigned int hwcfg;
> > +unsigned int phyid1;
> > +unsigned int phyid2;
> > +
> > +hwcfg = smsc911x_reg_read(pdata, HW_CFG);
> > +
> > +/* External phy is requested, supported, and detected */
> > +if (hwcfg & HW_CFG_EXT_PHY_DET_) {
> > +
> > +/* Attempt to switch to external phy for 
auto-detecting
> > + * its address. Assuming tx and rx are 
stopped because
> > + * smsc911x_phy_initialise is called 
before
> > + * smsc911x_rx_initialise and 
tx_initialise.
> > + */
> > +
> > +/* Disable phy clocks to the MAC */
> > +hwcfg &= (~HW_CFG_PHY_CLK_SEL_);
> > +hwcfg |= HW_CFG_PHY_CLK_SEL_CLK_DIS_;
> > +smsc911x_reg_write(hwcfg, pdata, HW_CFG);
> > +udelay(10); /* Enough time 
for clocks to stop */
> 
> I assume that writes are never posted, right ?
> 

I don't understand the question, what do you mean?


> > +static void smsc911x_rx_multicast_update(struct smsc911x_data *pdata)
> > +{
> > +unsigned long flags;
> > +unsigned int timeout;
> > +unsigned int mac_cr;
> > +
> > +/* This function is only called for older LAN911x devices 

> > + * (revA or revB), where MAC_CR, HASHH and HASHL should 
not
> > + * be modified during Rx - newer devices immediately 
update the
> > + * registers */
> > +
> > +local_irq_save(flags);
> > +
> > +/* Stop Rx */
> > +mac_cr = smsc911x_mac_read(pdata, MAC_CR);
> > +mac_cr &= ~(MAC_CR_RXEN_);
> > +smsc911x_mac_write(pdata, MAC_CR, mac_cr);
> > +
> > +/* Poll until Rx has stopped.  If a frame is being 
recieved, this will
> > + * block until the end of this frame.  (this may take a 
long time at
> > + * 10Mbps) */
> > +timeout = 2000;
> > +while ((timeout--)
> > +   && (!(smsc911x_reg_read(pdata, INT_STS) & 
INT_STS_RXSTOP_INT_))) {
> > +udelay(1);
> 
> 
> In a completely ideal world the driver would probably race outside of an
> irq disabled section until it grabs the napi poll handler, thus 
preserving
> the nice low latency property of the kernel.
> 
> Nevermind :o}

Agreed, I would like to find a nicer way to do this.  It's a nasty 
workaround to a nasty hardware issue :o}

There are two problems.  First, on older hardware revisions the multicast 
hash filters (as well as the promisc flag) cannot be modified while rx is 
active (bad things might happen).  There is an interrupt which can be used 
to indicate RX has stopped, but on early hardware this is not 100% 
reliable.

The current solution is the simplest option, and works.  A better way 
could be to use the RX_STOP interrupt, but also schedule a task to run 
later "just in case"?

Best Regards,
--
Steve Glendinning
SMSC GmbH
m: +44 777 933 9124
e: [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux TCP in the presence of delays or drops...

2006-07-31 Thread David Miller
From: Oumer Teyeb <[EMAIL PROTECTED]>
Date: Mon, 31 Jul 2006 19:49:28 +0200

> it would be so great if some of you could spare a few minutes and take a 
> look at the traces I provided.see below for the original postng...

If people are too backlogged and busy to reply to your original
posting, you will only ensure that it will take even longer by
bombarding the list with even more information and questions on
top of your original large query.

Just be patient.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables

2006-07-31 Thread David Miller
From: Thomas Graf <[EMAIL PROTECTED]>
Date: Mon, 31 Jul 2006 17:41:42 +0200

> * Herbert Xu <[EMAIL PROTECTED]> 2006-08-01 00:01
> > Actually, if we're adding policy routing, we should seriously consider
> > whether living without a routing cache is still viable or not because
> > the cost of a route lookup has just gone up.
> 
> Absolutely.

This is something I wanted to bring up too.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please pull 'bcm43xx' branch of wireless-2.6?

2006-07-31 Thread John W. Linville
On Mon, Jul 31, 2006 at 04:01:43PM -0400, Jeff Garzik wrote:
> John W. Linville wrote:
> >Jeff, if a 10ms maximum delay is still acceptable to you, then please
> >pull from the bcm43xx branch of wireless-2.6 into the upstream branch
> >of netdev-2.6.
> 
> Just to be clear, 'upstream' not 'upstream-fixes', correct?

Yes, queued for 2.6.19.

Thanks,

John
-- 
John W. Linville
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework

2006-07-31 Thread Thomas Graf
* Patrick McHardy <[EMAIL PROTECTED]> 2006-07-31 20:01
> Thomas Graf wrote:
> > * Ville Nuorvala <[EMAIL PROTECTED]> 2006-07-31 17:46
> >
> >>Shouldn't all these (struct fib_rule_hdr included) actually be defined
> >>in include/linux/rtnetlink.h?
> > 
> > 
> > We used to stuff everything into rtnetlink.h for no good reason. Having
> > independant include/linux/.h to export the interface to
> > userspace and include/net/.h to export the kernel interface
> > instead of contributing to the ifdef hell seems a lot cleaner to me.
> 
> 
> I agree, but then we should also split up rtnetlink.h. Having one
> special case will just make it harder to find.

Already done in the patchset converting things to the new netlink
interface that I'll start submiting in the next days.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please pull 'bcm43xx' branch of wireless-2.6?

2006-07-31 Thread Jeff Garzik

John W. Linville wrote:

Jeff, if a 10ms maximum delay is still acceptable to you, then please
pull from the bcm43xx branch of wireless-2.6 into the upstream branch
of netdev-2.6.


Just to be clear, 'upstream' not 'upstream-fixes', correct?



P.S. FWIW, I'm still not totally happy w/ the (potential for a)
long busy wait.  But, this series of patches makes things better by
100x over what is currently in the tree.  So, it seems worthwhile.
I'll keep further reductions as an item on my TODO list, FWIW... :-)


Agreed.  And there are some existing busy-waits that (1) _obviously_ 
need to be converted to mdelay(), and (2) eventually need to be 
converted to msleep().


Overall, the Linux kernel community consensus is that long synchronous 
delays spinning the CPU should be avoided.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/4] kevent: core files.

2006-07-31 Thread Evgeniy Polyakov
On Mon, Jul 31, 2006 at 02:33:22PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
> Ok, let's do it in the following way:
> I present new version of kevent with new syscalls and fixed issues mentioned
> before, while people look at it we can end up with mapped buffer design.
> Is it ok?

Since kevents are never generated by kernel, but only marked as ready,
length of the main queue performs as flow control, so we can create a
mapped buffer which will have space equal to the main queue length
multiplied by size of the copied to userspace structure plus 16 bits for
the start index of the kernel writing side, i.e. it will store offset
where the oldest event was placed.
Since queue length is a limited factor and thus no new events can be added
when queue is full, that means that buffer is full too and userspace
must read events. When syscall is called to add new kevent and provided 
there offset differs from what kernel stored, that means that all events 
from kernel to provided index have been read and new events can be added.
Thus we can even allow read-only mapping. Kernel's index is incremented
modulo queue length. If kevent was removed after it was marked as
ready, it's copy stays in the mapped buffer, but special flag can be
assigned to show that kevent is no longer valid.


-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Please pull 'bcm43xx' branch of wireless-2.6?

2006-07-31 Thread John W. Linville
On Thu, Jul 27, 2006 at 08:37:53PM -0400, John W. Linville wrote:
> As most of us are painfully aware, there is a blockage in getting
> bcm43xx patches upstream.(*)

> (*) http://marc.theaimsgroup.com/?l=linux-netdev&m=115137403631920&w=2

After re-reading that thread, I realized that Jeff had indicated that
the original maximum delay of 10ms would be acceptable to him.

http://marc.theaimsgroup.com/?l=linux-netdev&m=115138026614994&w=2

The current bcm43xx sources had already reduced the maximum delay to
100ms, and the d80211 (aka wireless-dev) driver had already dropped
it to 10ms.  So, I applied a simple patch to drop the delay to 10ms
on this branch as well.

Jeff, if a 10ms maximum delay is still acceptable to you, then please
pull from the bcm43xx branch of wireless-2.6 into the upstream branch
of netdev-2.6.

Thanks,

John

P.S. FWIW, I'm still not totally happy w/ the (potential for a)
long busy wait.  But, this series of patches makes things better by
100x over what is currently in the tree.  So, it seems worthwhile.
I'll keep further reductions as an item on my TODO list, FWIW... :-)

---

The following changes since commit 8f0f850e240df5bea027caeb1723142c50e37e57:
  Daniel Drake:
softmac: Add MAINTAINERS entry

are found in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git 
bcm43xx

John W. Linville:
  bcm43xx: fix-up build breakage from merging patches out of order
  bcm43xx: reduce mac_suspend delay loop count

Larry Finger:
  bcm43xx: improved statistics
  bcm43xx: add missing mac_suspended initialization

Michael Buesch:
  bcm43xx: suspend MAC while executing long pwork
  bcm43xx: lower mac_suspend udelay
  bcm43xx: fix mac_suspend refcount
  bcm43xx: init routine rewrite

 drivers/net/wireless/bcm43xx/bcm43xx.h |   52 +-
 drivers/net/wireless/bcm43xx/bcm43xx_debugfs.c |   46 ++
 drivers/net/wireless/bcm43xx/bcm43xx_debugfs.h |1 
 drivers/net/wireless/bcm43xx/bcm43xx_main.c|  687 ++--
 drivers/net/wireless/bcm43xx/bcm43xx_main.h|3 
 drivers/net/wireless/bcm43xx/bcm43xx_sysfs.c   |   70 ++
 drivers/net/wireless/bcm43xx/bcm43xx_wx.c  |   28 +
 drivers/net/wireless/bcm43xx/bcm43xx_xmit.c|5 
 8 files changed, 565 insertions(+), 327 deletions(-)

diff --git a/drivers/net/wireless/bcm43xx/bcm43xx.h 
b/drivers/net/wireless/bcm43xx/bcm43xx.h
index ee6571e..c6ee1e9 100644
--- a/drivers/net/wireless/bcm43xx/bcm43xx.h
+++ b/drivers/net/wireless/bcm43xx/bcm43xx.h
@@ -504,6 +504,12 @@ struct bcm43xx_phyinfo {
 * This lock is only used by bcm43xx_phy_{un}lock()
 */
spinlock_t lock;
+
+   /* Firmware. */
+   const struct firmware *ucode;
+   const struct firmware *pcm;
+   const struct firmware *initvals0;
+   const struct firmware *initvals1;
 };
 
 
@@ -593,12 +599,14 @@ struct bcm43xx_coreinfo {
u8 available:1,
   enabled:1,
   initialized:1;
-   /** core_id ID number */
-   u16 id;
/** core_rev revision number */
u8 rev;
/** Index number for _switch_core() */
u8 index;
+   /** core_id ID number */
+   u16 id;
+   /** Core-specific data. */
+   void *priv;
 };
 
 /* Additional information for each 80211 core. */
@@ -647,7 +655,10 @@ enum {
BCM43xx_STAT_RESTARTING,/* controller_restart() called. */
 };
 #define bcm43xx_status(bcm)atomic_read(&(bcm)->init_status)
-#define bcm43xx_set_status(bcm, stat)  atomic_set(&(bcm)->init_status, (stat))
+#define bcm43xx_set_status(bcm, stat)  do {\
+   atomic_set(&(bcm)->init_status, (stat));\
+   smp_wmb();  \
+   } while (0)
 
 /**** THEORY OF LOCKING ***
  *
@@ -721,10 +732,6 @@ #endif
struct bcm43xx_coreinfo core_80211[ BCM43xx_MAX_80211_CORES ];
/* Additional information, specific to the 80211 cores. */
struct bcm43xx_coreinfo_80211 core_80211_ext[ BCM43xx_MAX_80211_CORES ];
-   /* Index of the current 80211 core. If current_core is not
-* an 80211 core, this is -1.
-*/
-   int current_80211_core_idx;
/* Number of available 80211 cores. */
int nr_80211_available;
 
@@ -737,6 +744,8 @@ #endif
u32 irq_savedstate;
/* Link Quality calculation context. */
struct bcm43xx_noise_calculation noisecalc;
+   /* if > 0 MAC is suspended. if == 0 MAC is enabled. */
+   int mac_suspended;
 
/* Threshold values. */
//TODO: The RTS thr has to be _used_. Currently, it is only set via WX.
@@ -759,12 +768,6 @@ #endif
struct bcm43xx_key key[54];
u8 default_key_idx;
 
-   /* Firmware. */
-   const struct firmware *ucode;
-   const struct firmware *pcm;
-   const struct firmware *initvals0;
-   const struct firmware

[PATCH 2/2] forcedeth: mac address corrected

2006-07-31 Thread Ayaz Abdulla
This patch will correct the mac address and set a flag to indicate that 
it is already corrected in case nv_probe is called again. For example, 
when you use kexec to restart the kernel.


Signed-Off-By: Ayaz Abdulla <[EMAIL PROTECTED]>

--- orig-2.6/drivers/net/forcedeth.c2006-07-06 15:06:27.0 -0400
+++ new-2.6/drivers/net/forcedeth.c 2006-07-06 15:06:58.0 -0400
@@ -109,6 +109,7 @@
  * 0.54: 21 Mar 2006: Fix spin locks for multi irqs and cleanup.
  * 0.55: 22 Mar 2006: Add flow control (pause frame).
  * 0.56: 22 Mar 2006: Additional ethtool config and moduleparam support.
+ * 0.57: 14 May 2006: Mac address set in probe/remove and order 
corrections.
  *
  * Known bugs:
  * We suspect that on some hardware no TX done interrupts are generated.
@@ -120,7 +121,7 @@
  * DEV_NEED_TIMERIRQ will not harm you on sane hardware, only generating a few
  * superfluous timer interrupts from the nic.
  */
-#define FORCEDETH_VERSION  "0.56"
+#define FORCEDETH_VERSION  "0.57"
 #define DRV_NAME   "forcedeth"
 
 #include 
@@ -262,7 +263,8 @@
NvRegRingSizes = 0x108,
 #define NVREG_RINGSZ_TXSHIFT 0
 #define NVREG_RINGSZ_RXSHIFT 16
-   NvRegUnknownTransmitterReg = 0x10c,
+   NvRegTransmitPoll = 0x10c,
+#define NVREG_TRANSMITPOLL_MAC_ADDR_REV0x8000
NvRegLinkSpeed = 0x110,
 #define NVREG_LINKSPEED_FORCE 0x1
 #define NVREG_LINKSPEED_10 1000
@@ -1178,7 +1180,7 @@
KERN_INFO "nv_stop_tx: TransmitterStatus remained 
busy");
 
udelay(NV_TXSTOP_DELAY2);
-   writel(0, base + NvRegUnknownTransmitterReg);
+   writel(readl(base + NvRegTransmitPoll) & 
NVREG_TRANSMITPOLL_MAC_ADDR_REV, base + NvRegTransmitPoll);
 }
 
 static void nv_txrx_reset(struct net_device *dev)
@@ -3917,7 +3919,7 @@
oom = nv_init_ring(dev);
 
writel(0, base + NvRegLinkSpeed);
-   writel(0, base + NvRegUnknownTransmitterReg);
+   writel(readl(base + NvRegTransmitPoll) & 
NVREG_TRANSMITPOLL_MAC_ADDR_REV, base + NvRegTransmitPoll);
nv_txrx_reset(dev);
writel(0, base + NvRegUnknownSetupReg6);
 
@@ -4082,7 +4084,7 @@
unsigned long addr;
u8 __iomem *base;
int err, i;
-   u32 powerstate;
+   u32 powerstate, txreg;
 
dev = alloc_etherdev(sizeof(struct fe_priv));
err = -ENOMEM;
@@ -4269,12 +4271,30 @@
np->orig_mac[0] = readl(base + NvRegMacAddrA);
np->orig_mac[1] = readl(base + NvRegMacAddrB);
 
-   dev->dev_addr[0] = (np->orig_mac[1] >>  8) & 0xff;
-   dev->dev_addr[1] = (np->orig_mac[1] >>  0) & 0xff;
-   dev->dev_addr[2] = (np->orig_mac[0] >> 24) & 0xff;
-   dev->dev_addr[3] = (np->orig_mac[0] >> 16) & 0xff;
-   dev->dev_addr[4] = (np->orig_mac[0] >>  8) & 0xff;
-   dev->dev_addr[5] = (np->orig_mac[0] >>  0) & 0xff;
+   /* check the workaround bit for correct mac address order */
+   txreg = readl(base + NvRegTransmitPoll);
+   if (txreg & NVREG_TRANSMITPOLL_MAC_ADDR_REV) {
+   /* mac address is already in correct order */
+   dev->dev_addr[0] = (np->orig_mac[0] >>  0) & 0xff;
+   dev->dev_addr[1] = (np->orig_mac[0] >>  8) & 0xff;
+   dev->dev_addr[2] = (np->orig_mac[0] >> 16) & 0xff;
+   dev->dev_addr[3] = (np->orig_mac[0] >> 24) & 0xff;
+   dev->dev_addr[4] = (np->orig_mac[1] >>  0) & 0xff;
+   dev->dev_addr[5] = (np->orig_mac[1] >>  8) & 0xff;
+   } else {
+   /* need to reverse mac address to correct order */
+   dev->dev_addr[0] = (np->orig_mac[1] >>  8) & 0xff;
+   dev->dev_addr[1] = (np->orig_mac[1] >>  0) & 0xff;
+   dev->dev_addr[2] = (np->orig_mac[0] >> 24) & 0xff;
+   dev->dev_addr[3] = (np->orig_mac[0] >> 16) & 0xff;
+   dev->dev_addr[4] = (np->orig_mac[0] >>  8) & 0xff;
+   dev->dev_addr[5] = (np->orig_mac[0] >>  0) & 0xff;
+   /* set permanent address to be correct aswell */
+   np->orig_mac[0] = (dev->dev_addr[0] << 0) + (dev->dev_addr[1] 
<< 8) +
+   (dev->dev_addr[2] << 16) + (dev->dev_addr[3] << 24);
+   np->orig_mac[1] = (dev->dev_addr[4] << 0) + (dev->dev_addr[5] 
<< 8);
+   writel(txreg|NVREG_TRANSMITPOLL_MAC_ADDR_REV, base + 
NvRegTransmitPoll);
+   }
memcpy(dev->perm_addr, dev->dev_addr, dev->addr_len);
 
if (!is_valid_ether_addr(dev->perm_addr)) {


[PATCH 1/2] forcedeth: move mac address setup/teardown

2006-07-31 Thread Ayaz Abdulla
This patch moves the mac address setup/teardown to the 
nv_probe/nv_remove functions. This fixes WOL wakeup since on nv_close we 
would reverse the mac address. Also, bonding driver will reset address 
after nv_close is called.


Signed-Off-By: Ayaz Abdulla <[EMAIL PROTECTED]>

--- orig-2.6/drivers/net/forcedeth.c2006-07-06 15:05:31.0 -0400
+++ new-2.6/drivers/net/forcedeth.c 2006-07-06 15:05:54.0 -0400
@@ -3895,10 +3895,9 @@
 
dprintk(KERN_DEBUG "nv_open: begin\n");
 
-   /* 1) erase previous misconfiguration */
+   /* erase previous misconfiguration */
if (np->driver_data & DEV_HAS_POWER_CNTRL)
nv_mac_reset(dev);
-   /* 4.1-1: stop adapter: ignored, 4.3 seems to be overkill */
writel(NVREG_MCASTADDRA_FORCE, base + NvRegMulticastAddrA);
writel(0, base + NvRegMulticastAddrB);
writel(0, base + NvRegMulticastMaskA);
@@ -3913,7 +3912,7 @@
if (np->pause_flags & NV_PAUSEFRAME_TX_CAPABLE)
writel(NVREG_TX_PAUSEFRAME_DISABLE,  base + NvRegTxPauseFrame);
 
-   /* 2) initialize descriptor rings */
+   /* initialize descriptor rings */
set_bufsize(dev);
oom = nv_init_ring(dev);
 
@@ -3924,15 +3923,11 @@
 
np->in_shutdown = 0;
 
-   /* 3) set mac address */
-   nv_copy_mac_to_hw(dev);
-
-   /* 4) give hw rings */
+   /* give hw rings */
setup_hw_rings(dev, NV_SETUP_RX_RING | NV_SETUP_TX_RING);
writel( ((np->rx_ring_size-1) << NVREG_RINGSZ_RXSHIFT) + 
((np->tx_ring_size-1) << NVREG_RINGSZ_TXSHIFT),
base + NvRegRingSizes);
 
-   /* 5) continue setup */
writel(np->linkspeed, base + NvRegLinkSpeed);
if (np->desc_ver == DESC_VER_1)
writel(NVREG_TX_WM_DESC1_DEFAULT, base + NvRegTxWatermark);
@@ -3950,7 +3945,6 @@
writel(NVREG_IRQSTAT_MASK, base + NvRegIrqStatus);
writel(NVREG_MIISTAT_MASK2, base + NvRegMIIStatus);
 
-   /* 6) continue setup */
writel(NVREG_MISC1_FORCE | NVREG_MISC1_HD, base + NvRegMisc1);
writel(readl(base + NvRegTransmitterStatus), base + 
NvRegTransmitterStatus);
writel(NVREG_PFF_ALWAYS, base + NvRegPacketFilterFlags);
@@ -4076,12 +4070,6 @@
if (np->wolenabled)
nv_start_rx(dev);
 
-   /* special op: write back the misordered MAC address - otherwise
-* the next nv_probe would see a wrong address.
-*/
-   writel(np->orig_mac[0], base + NvRegMacAddrA);
-   writel(np->orig_mac[1], base + NvRegMacAddrB);
-
/* FIXME: power down nic */
 
return 0;
@@ -4309,6 +4297,9 @@
dev->dev_addr[0], dev->dev_addr[1], dev->dev_addr[2],
dev->dev_addr[3], dev->dev_addr[4], dev->dev_addr[5]);
 
+   /* set mac address */
+   nv_copy_mac_to_hw(dev);
+
/* disable WOL */
writel(0, base + NvRegWakeUpFlags);
np->wolenabled = 0;
@@ -4421,9 +4412,17 @@
 static void __devexit nv_remove(struct pci_dev *pci_dev)
 {
struct net_device *dev = pci_get_drvdata(pci_dev);
+   struct fe_priv *np = netdev_priv(dev);
+   u8 __iomem *base = get_hwbase(dev);
 
unregister_netdev(dev);
 
+   /* special op: write back the misordered MAC address - otherwise
+* the next nv_probe would see a wrong address.
+*/
+   writel(np->orig_mac[0], base + NvRegMacAddrA);
+   writel(np->orig_mac[1], base + NvRegMacAddrB);
+
/* free all structures */
free_rings(dev);
iounmap(get_hwbase(dev));


Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework

2006-07-31 Thread Patrick McHardy
Thomas Graf wrote:
> * Ville Nuorvala <[EMAIL PROTECTED]> 2006-07-31 17:46
>
>>Shouldn't all these (struct fib_rule_hdr included) actually be defined
>>in include/linux/rtnetlink.h?
> 
> 
> We used to stuff everything into rtnetlink.h for no good reason. Having
> independant include/linux/.h to export the interface to
> userspace and include/net/.h to export the kernel interface
> instead of contributing to the ifdef hell seems a lot cleaner to me.


I agree, but then we should also split up rtnetlink.h. Having one
special case will just make it harder to find.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux TCP in the presence of delays or drops...

2006-07-31 Thread Oumer Teyeb

Hi,

it would be so great if some of you could spare a few minutes and take a 
look at the traces I provided.see below for the original postng...I 
just had a couple of things to add which I noticed in linux TCP 
behaviour which I have not seen documented anywhere else (or which I 
might have misread..:-)...and below I have given yet another trace that 
illustrates one of the TCP linux behaviour which I am having trouble 
understanding


-If multiple timeouts occur for one packet then even if we are using the 
timestamp option or FRTO TCP linux is not able to detect spurious 
retransmissions... and TCP linux is able to detect spurious 
retransmissions only for a single timeout for one packet or fast 
retransmissions that are caused by duplicate ACK reception.I have 
some traces that show this behaviour, let me know if you are interested.


-In the cases where TCP timestamp or FRTO is not able to detect spurious 
retransmissions, the performance degrades even more than when TCP 
timestamp or FRTO option are not used


I also have one additional trace that shows the problem with the case of 
an explained pause in the tcp sender during retransmission which I found 
really hard to explain it is similar to the case 1) but this time I 
am doing an upgrade instead from a 384kbps connection to 1Mbps 
connection the traces and tcptrace time sequence curve can be found 
at...

http://kom.aau.dk/~oumer/drop_0_delay_UPGRADE_SERVER.dat
http://kom.aau.dk/~oumer/drop_0_delay_UPGRADE_CLIENT.dat
and the tcptrace time sequence curve can be found in
http://kom.aau.dk/~oumer/drop_0_delay_UPGRADE.ps

as you can see from the server side trace... (all the packets shown here 
are retransmissions because I flushed the sender's buffer at time 
instant 17:26:24.657)

17:26:26.261972  2267693336:2267694796(1460) ack 3498775069 win 5840 (DF)
17:26:26.319180  . ack 2267694796 win 61320 (DF) [tos 0x8]
17:26:26.321961  2267694796:2267696256(1460) ack 3498775069 win 5840 (DF)
17:26:26.379160  . ack 2267696256 win 61320 (DF) [tos 0x8]
17:26:26.381940 . 2267696256:2267697716(1460) ack 3498775069 win 5840 (DF)
17:26:26.439138  . ack 2267697716 win 61320 (DF) [tos 0x8]
17:26:26.441925  2267697716:2267699176(1460) ack 3498775069 win 5840 (DF)
17:26:26.499144   ack 2267699176 win 61320 (DF) [tos 0x8]
17:26:28.234327  2267699176:2267700636(1460) ack 3498775069 win 5840 (DF)

eventhough the server got an ACK with # ack 2267699176 at timeinstant 
17:26:26.49...it waited till 17:26:28.234 to resend  the packet... which 
is around
1.73 seconds... I have checked with other traces where I introduced 
delay and for the link the first timeout occurs after 1.73 second, which 
seems to be the RTO at that time, and for no apparent reason
TCP is wating for a timeout...  case 1 is quite similar but there the 
retransmissions were triggered by timeout to begin with, here the 
retransmissions are triggered by duplicate ACKs...in the case1 described 
below this abnormal behaviour occured after only a couple of packets 
were retransmitted...here it took quite some retransmissions before the 
same problem happend... any insight into this is greatly appreciated!!


Thanks in advance,
Oumer

Oumer Teyeb wrote:


Hi all,

I have some questions regarding Linux TCP in the presence of delays or 
packet drops. It is somehow long mail, but the questions are two or 
three, just wanted to provide a detailed information so that the 
problem is clear. thanx for the patience!!


Best regards,
Oumer

Note that for the traces referred here, SACK,timestamps, and FRTO are 
all disabled...


1) packet drops

I have a trace where the tcp sender window is flushed and  then the 
connection speed is  changed from 1Mbps to 384kbps...

The trace files from both the client and the server side can be found at
http://kom.aau.dk/~oumer/drop_0_delay_SERVER.dat
http://kom.aau.dk/~oumer/drop_0_delay_CLIENT.dat
and the tcptrace time sequence curve can be found in
http://kom.aau.dk/~oumer/drop_0_delay.ps

as can be seen from the plot and the trace files at around 
17:19:35.705733, the window was flushed (both the sender's and 
receivers), and hence packets with seq  numbers from
1840001135 upto 1840058075  were dropped (39 packets)...and also the 
ACK for 1840001135 was also dropped (from the traces this can be seen 
as it appears

in the client trace but not on the server trace)...
and since there were still packets to be sent the sender keeps sending 
a few more packets

and when  few of them are received (from the client side trace..)

17:19:35.938017 1840059535:1840060995(1460) ack 3059152863 win 5840 
(DF)...
17:19:35.938028  ack 1840001135 win 62780 (DF) [tos 0x8]...first ACK 
that is going to be received by the sender

17:19:35.969316  1840060995:1840062455(1460) ack 3059152863 win 5840 (DF)
17:19:35.969325  1840001135 win 62780 (DF) [tos 0x8]first 
duplicate ACK

17:19:36.000519  1840062455:1840063915(1460) ack 3059152863 win 5840 (DF)
17:19:3

[RFC] irqbalance: Mark in-kernel irqbalance as obsolete, set to N by default

2006-07-31 Thread Auke Kok


We've recently seen a number of user bug reports against e1000 that the 
in-kernel irqbalance code is detrimental to network latency. The algorithm 
keeps swapping irq's for NICs from cpu to cpu causing extremely high network 
latency (>1000ms). Another NIC driver (cxgb) already has severe warnings in 
their documentation file against using CONFIG_IRQBALANCE, but this is a 
general problem for all NIC drivers and other subsystems. This is especially 
so with cpufreq scaling where the system is slowed down and the migrations 
take much longer.


I suggest that the in-kernel irqbalance is phased out, by marking it OBSOLETE 
first and (perhaps) removing the code later. The userspace irqbalance daemon 
written by Arjan van de Ven does a wonderful job and should be used instead.


Signed-off-by: Auke Kok <[EMAIL PROTECTED]>

---

 Kconfig |   15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)
---
diff --git a/arch/i386/Kconfig b/arch/i386/Kconfig
index daa75ce..5a40cfe 100644
--- a/arch/i386/Kconfig
+++ b/arch/i386/Kconfig
@@ -690,12 +690,19 @@ config EFI
 	kernel should continue to boot on existing non-EFI platforms.
 
 config IRQBALANCE
- 	bool "Enable kernel irq balancing"
+	bool "Enable kernel irq balancing (obsolete)"
 	depends on SMP && X86_IO_APIC
-	default y
+	default n
 	help
- 	  The default yes will allow the kernel to do irq load balancing.
-	  Saying no will keep the kernel from doing irq load balancing.
+	  The kernel irq balance will migrate interrupts between cpu's
+	  constantly, which may help reduce load in some cases. It is not
+	  beneficial for latency however, and a user-space daemon is available
+	  that does a much better job.
+
+	  The default no will keep the kernel from doing irq load balancing.
+	  Say yes will allow the kernel to do irq load balancing.
+
+	  If unsure, say N.
 
 # turning this on wastes a bunch of space.
 # Summit needs it only when NUMA is on


Re: Linville's L2 rant... -- Re: PATCH Fix bonding active-backup behavior for VLAN interfaces

2006-07-31 Thread Christophe Devriese
On Monday 31 July 2006 14:30, you wrote:
> (This is not directed at Christophe, or anyone in particular...)
>
> 
>
> Am I the only one that thinks that our handling of LAN L2 stuff
> is at best a little "too" flexible (and at worst a collection of
> nasty hacks)?
>
> I mean, do we really need both the ability to bond multiple vlan
> interfaces AND the ability to have vlan interfaces on top of a bond?
> How many people really appreciate the subtle(?) differences?
>
> Then throw bridging into the mix!  If I'm using VLANs and bonds in
> a bridged environment, do I bridge the bonds, or bond the bridges?

In all honesty, you cannot bond bridges :-p

> Do the VLANs come before the bonds?  after the bridges?  or somewhere
> in-between?  Do all these combinations even work together?  Who has
> the definitive answer (besides the code itself)?
>
> I have no doubt that there are plenty of opportunities for cleverness
> here (and no doubt dragons too).  I just doubt that most of them
> are worth the complexities introduced by our current collection of
> "transparently" stackable pseudo-drivers and strategically placed hacks
> (e.g. skb_bond).  All that, and it still isn't clear to me how we
> can cleanly accomodate 802.1s (which adds VLAN awareness to bridging).
>
> Do we hold the view that our L2 code is on par with the rest of
> our code?  Is there an appetite for a clean-up?  Or is it just me?

A vlan capable bridge with trunk ports and access ports would be nice :-p

I think the current code is nice. You need it to properly support 
virtualization and I find it very useful where I work to have this option.

Regards,

Christophe
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Runtime power management for network interfaces

2006-07-31 Thread Auke Kok

Randy.Dunlap wrote:

On Tue, 25 Jul 2006 09:20:06 -0700 Auke Kok wrote:


Alan Stern wrote:

During a Power Management session at the Ottawa Linux Symposium, it was
generally agreed that network interface drivers ought to automatically
suspend their devices (if possible) whenever:

(1) The interface is ifconfig'ed down, or

(2) No link is available.

Presumably (1) should be easy enough to implement.  (2) might or might not
be feasible, depending on how much WOL support is available.  (It might
not be feasible at all for wireless networking.)  Still, there can be no
question that it would be a Good Thing for laptops to power-down their
ethernet controllers when the network cable is unplugged.

Has any progress been made in this direction?  If not, a natural approach 
would be to start with a reference implementation in one driver which 
could then be copied to other drivers.


Intel's newer e1000's (ich7 onboard e1000 and newer versions for instance) 
already support this feature partially - the MAC stays on but the PHY can be 
powered off when no link is present.


In order to enable this feature you will need to turn it on explicitly at load 
time:


modprobe e1000 SmartPowerDownEnable=1


Please add that to Documentation/networking/e1000.txt.


I'm long overdue with documentation updates ATM, I'll see if I can fix that :)

Cheers,

Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Runtime power management for network interfaces

2006-07-31 Thread Stephen Hemminger
On Tue, 25 Jul 2006 11:59:52 -0400 (EDT)
Alan Stern <[EMAIL PROTECTED]> wrote:

> During a Power Management session at the Ottawa Linux Symposium, it was
> generally agreed that network interface drivers ought to automatically
> suspend their devices (if possible) whenever:
> 
> (1) The interface is ifconfig'ed down, or
> 
> (2) No link is available.

This is hard because most of the power may be consumed by the PHY interface
and it needs to be alive to see link.

> 
> Presumably (1) should be easy enough to implement.  (2) might or might not
> be feasible, depending on how much WOL support is available.  (It might
> not be feasible at all for wireless networking.)  Still, there can be no
> question that it would be a Good Thing for laptops to power-down their
> ethernet controllers when the network cable is unplugged.
> 
> Has any progress been made in this direction?  If not, a natural approach 
> would be to start with a reference implementation in one driver which 
> could then be copied to other drivers.
> 

The problem is not generic, it really is specific to each device.
We have all the necessary infrastructure to do the right thing in the network
device driver, but in many cases we don't have the code or the technical 
information
to do proper power management.

-- 
Stephen Hemminger <[EMAIL PROTECTED]>
"And in the Packet there writ down that doome"
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] gre: transparent ethernet bridging

2006-07-31 Thread Stephen Hemminger
On Mon, 31 Jul 2006 20:06:41 +1000
Philip Craig <[EMAIL PROTECTED]> wrote:

> This patch implements transparent ethernet bridging for gre tunnels.
> There are a few outstanding issues.

Why not use existing bridge code?

> There is no way for userspace to select the type of gre tunnel. The
> #if 0 near the top of the patch forces all gre tunnels to be bridges.
> The problem is that userspace uses an IPPROTO_ to select the type of
> tunnel, but both types of gre tunnel are IPPROTO_GRE. I can't see
> anything else in struct ip_tunnel_parm that could be used to select
> this. One approach that I've seen mentioned in the archives is to add
> a netlink interface to replace the tunnel ioctls.
> 
> Network loops are bad. See the comments at the top of ip_gre.c for
> a description of how gre tunnels handle this normally. But for gre
> bridges, we don't want to copy the ttl (it breaks routing protocols),
> and we don't want to force DF (we want to bridge 1500 byte packets).
> I couldn't think of any solution for this.
> 
> Some routers set LLC_SAP_BSPAN in the gre protocol field, and then
> give the bpdu packet without any other ethernet/llc header. This patch
> currently tries to fake the ethernet/llc header before passing the
> packet up, but it is buggy (mac addresses are wrong at least). Maybe a
> better approach is to call directly into the bridging code. I didn't try
> that at first because it isn't modular, and may break other things that
> want to see the packet.

Existing bridge code already has spanning tree.

> --- linux-2.6.x/net/ipv4/ip_gre.c 18 Jun 2006 23:30:56 -  1.1.1.33
> +++ linux-2.6.x/net/ipv4/ip_gre.c 31 Jul 2006 09:57:41 -
> @@ -30,6 +30,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> 
>  #include 
>  #include 
> @@ -41,6 +43,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> 
>  #ifdef CONFIG_IPV6
>  #include 
> @@ -119,6 +123,7 @@
> 
>  static int ipgre_tunnel_init(struct net_device *dev);
>  static void ipgre_tunnel_setup(struct net_device *dev);
> +static void ipgre_ether_tunnel_setup(struct net_device *dev);
> 
>  /* Fallback tunnel: no source, no destination, no key, no options */
> 
> @@ -274,7 +279,11 @@ static struct ip_tunnel * ipgre_tunnel_l
>   goto failed;
>   }
> 
> +#if 0
>   dev = alloc_netdev(sizeof(*t), name, ipgre_tunnel_setup);
> +#else
> + dev = alloc_netdev(sizeof(*t), name, ipgre_ether_tunnel_setup);
> +#endif

"Do, or do not there is no try"


> +__be16 ipgre_type_trans(struct sk_buff *skb, int offset)
> +{
> + u8 *h = skb->data;
> + __be16 flags = *(__be16*)h;
> + __be16 proto = *(__be16*)(h + 2);
> +
> + /* WCCP version 1 and 2 protocol decoding.
> +  * - Change protocol to IP
> +  * - When dealing with WCCPv2, Skip extra 4 bytes in GRE header
> +  */
> + if (flags == 0 &&
> + proto == __constant_htons(ETH_P_WCCP)) {
> + proto = __constant_htons(ETH_P_IP);
> + if ((*(h + offset) & 0xF0) != 0x40)
> + offset += 4;
> + }

Don't use __constant_htons() except in initializers and switch cases
(where gcc is too stupid to optimize the macro).

-- 
Stephen Hemminger <[EMAIL PROTECTED]>
"And in the Packet there writ down that doome"
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IBM (Lenovo) T60: e1000 driver high latency

2006-07-31 Thread Thomas Glanzmann
Hello Auke,

> > CONFIG_IRQBALANCE=y

thanks for the feedback. The behaviour improoved. In my first tests it
wasn't so good. But now it seems perfect:

(thinkpad) [~] ping 131.188.30.102
PING 131.188.30.102 (131.188.30.102) 56(84) bytes of data.
64 bytes from 131.188.30.102: icmp_seq=1 ttl=64 time=419 ms
64 bytes from 131.188.30.102: icmp_seq=2 ttl=64 time=0.264 ms
64 bytes from 131.188.30.102: icmp_seq=3 ttl=64 time=0.701 ms
64 bytes from 131.188.30.102: icmp_seq=4 ttl=64 time=0.630 ms
64 bytes from 131.188.30.102: icmp_seq=5 ttl=64 time=0.710 ms
64 bytes from 131.188.30.102: icmp_seq=6 ttl=64 time=0.638 ms
64 bytes from 131.188.30.102: icmp_seq=7 ttl=64 time=0.588 ms
64 bytes from 131.188.30.102: icmp_seq=8 ttl=64 time=0.517 ms
64 bytes from 131.188.30.102: icmp_seq=9 ttl=64 time=0.445 ms
64 bytes from 131.188.30.102: icmp_seq=10 ttl=64 time=0.374 ms

--- 131.188.30.102 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 8996ms
rtt min/avg/max/mdev = 0.264/42.447/419.606/125.719 ms
(thinkpad) [~] ping 131.188.30.102
PING 131.188.30.102 (131.188.30.102) 56(84) bytes of data.
64 bytes from 131.188.30.102: icmp_seq=1 ttl=64 time=0.547 ms
64 bytes from 131.188.30.102: icmp_seq=2 ttl=64 time=0.502 ms
64 bytes from 131.188.30.102: icmp_seq=3 ttl=64 time=0.402 ms
64 bytes from 131.188.30.102: icmp_seq=4 ttl=64 time=0.329 ms

--- 131.188.30.102 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 2998ms
rtt min/avg/max/mdev = 0.329/0.445/0.547/0.085 ms
(thinkpad) [~] ping 131.188.30.102
PING 131.188.30.102 (131.188.30.102) 56(84) bytes of data.
64 bytes from 131.188.30.102: icmp_seq=1 ttl=64 time=0.301 ms
64 bytes from 131.188.30.102: icmp_seq=2 ttl=64 time=0.753 ms
64 bytes from 131.188.30.102: icmp_seq=3 ttl=64 time=0.681 ms
64 bytes from 131.188.30.102: icmp_seq=4 ttl=64 time=0.609 ms
64 bytes from 131.188.30.102: icmp_seq=5 ttl=64 time=0.538 ms
64 bytes from 131.188.30.102: icmp_seq=6 ttl=64 time=0.466 ms
64 bytes from 131.188.30.102: icmp_seq=7 ttl=64 time=0.374 ms
64 bytes from 131.188.30.102: icmp_seq=8 ttl=64 time=0.308 ms

--- 131.188.30.102 ping statistics ---
8 packets transmitted, 8 received, 0% packet loss, time 6993ms
rtt min/avg/max/mdev = 0.301/0.503/0.753/0.161 ms
(thinkpad) [~] ping www.heise.de
PING www.heise.de (193.99.144.85) 56(84) bytes of data.
64 bytes from www.heise.de (193.99.144.85): icmp_seq=1 ttl=246 time=1019 ms
64 bytes from www.heise.de (193.99.144.85): icmp_seq=2 ttl=246 time=15.8 ms
64 bytes from www.heise.de (193.99.144.85): icmp_seq=3 ttl=246 time=1000 ms
64 bytes from www.heise.de (193.99.144.85): icmp_seq=4 ttl=246 time=360 ms
64 bytes from www.heise.de (193.99.144.85): icmp_seq=5 ttl=246 time=39.1 ms
64 bytes from www.heise.de (193.99.144.85): icmp_seq=6 ttl=246 time=360 ms
64 bytes from www.heise.de (193.99.144.85): icmp_seq=7 ttl=246 time=1000 ms
64 bytes from www.heise.de (193.99.144.85): icmp_seq=8 ttl=246 time=360 ms
64 bytes from www.heise.de (193.99.144.85): icmp_seq=9 ttl=246 time=360 ms
64 bytes from www.heise.de (193.99.144.85): icmp_seq=10 ttl=246 time=360 ms
64 bytes from www.heise.de (193.99.144.85): icmp_seq=11 ttl=246 time=1000 ms
64 bytes from www.heise.de (193.99.144.85): icmp_seq=12 ttl=246 time=319 ms
64 bytes from www.heise.de (193.99.144.85): icmp_seq=13 ttl=246 time=20.5 ms
64 bytes from www.heise.de (193.99.144.85): icmp_seq=14 ttl=246 time=350 ms
64 bytes from www.heise.de (193.99.144.85): icmp_seq=15 ttl=246 time=792 ms
64 bytes from www.heise.de (193.99.144.85): icmp_seq=16 ttl=246 time=350 ms
64 bytes from www.heise.de (193.99.144.85): icmp_seq=17 ttl=246 time=1000 ms

--- www.heise.de ping statistics ---
18 packets transmitted, 17 received, 5% packet loss, time 17013ms
rtt min/avg/max/mdev = 15.835/512.385/1019.753/360.201 ms, pipe 2
(thinkpad) [~] ping www.google.de
PING www.l.google.com (66.249.85.104) 56(84) bytes of data.
64 bytes from 66.249.85.104: icmp_seq=1 ttl=246 time=732 ms
64 bytes from 66.249.85.104: icmp_seq=2 ttl=246 time=999 ms
64 bytes from 66.249.85.104: icmp_seq=3 ttl=246 time=1000 ms
64 bytes from 66.249.85.104: icmp_seq=4 ttl=246 time=400 ms

--- www.l.google.com ping statistics ---
5 packets transmitted, 4 received, 20% packet loss, time 4388ms
rtt min/avg/max/mdev = 400.939/783.424/1000.436/246.402 ms, pipe 2
(thinkpad) [~] uname -a
Linux thinkpad 2.6.17.7 #3 SMP Mon Jul 31 17:44:21 CEST 2006 i686 GNU/Linux
(thinkpad) [~] date
Mon Jul 31 17:47:39 CEST 2006
(thinkpad) [~] ping www.google.de
PING www.l.google.com (66.249.85.99) 56(84) bytes of data.
64 bytes from 66.249.85.99: icmp_seq=1 ttl=246 time=13.3 ms
64 bytes from 66.249.85.99: icmp_seq=2 ttl=246 time=13.4 ms
64 bytes from 66.249.85.99: icmp_seq=3 ttl=246 time=13.4 ms
64 bytes from 66.249.85.99: icmp_seq=4 ttl=246 time=13.3 ms
64 bytes from 66.249.85.99: icmp_seq=5 ttl=246 time=13.2 ms
64 bytes from 66.249.85.99: icmp_seq=6 ttl=246 time=13.6 ms
64 bytes from 66.249.85.99: icmp_seq=7 ttl=246 time=13.6 ms
64 bytes from 6

Re: eth2.100: received packet with own address as source address

2006-07-31 Thread Andy Gospodarek
On Mon, Jul 31, 2006 at 12:27:19AM -0400, David Coulson wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> I have a machine running 2.6.18-rc3 with a bridge config that looks like
> this:
> 
> cr1:~# brctl show
> bridge name bridge id   STP enabled interfaces
> vlan100 36b0.0007e90f40c1   yes eth0.100
> eth2.100
> vlan101 5dc0.0007e90f40c1   yes eth0.101
> eth2.101
> vlan102 5dc0.00e08163c33f   yes eth3
> vlan200 5dc0.0007e90f40c1   yes eth0.200
> eth2.200
> vlan201 5dc0.0007e90f40c1   yes eth0.201
> eth2.201
> vlan300 5dc0.0007e90f40c1   yes eth0.300
> eth2.300
> vlan301 5dc0.0007e90f40c1   yes eth0.301
> eth2.301
> vlan302 5dc0.0007e90f40c1   yes eth0.302
> eth2.302
> vlan303 5dc0.0007e90f40c1   yes eth0.303
> eth2.303
> 
> All bridges, except for vlan102, are running STP and appear to have
> elected themselves the root bridge.
> 
> I see this on the console:
> 
> printk: 18 messages suppressed.
> eth2.100: received packet with  own address as source address
> printk: 20 messages suppressed.
> eth2.200: received packet with  own address as source address
> 

This usually means that you have a loop somewhere, I can't say
specifically that this is the case with your current setup, but it would
be interesting to capture traffic on that interface and determine what
it looks like in more detail.


> This repeats continuously, only indicating an issue on the two specific
> ports mentioned above. This confuses me:
> 
> 1) All VLANs (except for 102) are in an identical configuration
> 2) No other VLANs exhibit the same kernel message
> 3) I have another machine, running 2.6.18-rc2 with the same config and
> no kernel message
> 
> Is this message specifically related to BPDU frames, or is it pertaining
> to any Ethernet frame on the port?
> 
> Here is the STP config for one of the bridges. What's the next step to
> troubleshoot this?

Capture some of the suspicious traffic.

> 
> vlan200
>  bridge id  5dc0.0007e90f40c1
>  designated root5dc0.0007e90f40c1
>  root port 0path cost  0
>  max age  20.00 bridge max age
>   20.00
>  hello time2.00 bridge hello time
>2.00
>  forward delay 5.00 bridge forward delay
>5.00
>  ageing time 300.01
>  hello timer   1.67 tcn timer
>0.00
>  topology change timer 0.00 gc timer
>0.03
>  flags
> 
> 
> eth0.200 (1)
>  port id8001state
> forwarding
>  designated root5dc0.0007e90f40c1   path cost 19
>  designated bridge  5dc0.0007e90f40c1   message age timer
>0.00
>  designated port8001forward delay timer
>0.00
>  designated cost   0hold timer
>0.67
>  flags
> 
> eth2.200 (2)
>  port id8002state
> blocking
>  designated root5dc0.0007e90f40c1   path cost  4
>  designated bridge  5dc0.0007e90f40c1   message age timer
>   19.67
>  designated port8001forward delay timer
>0.00
>  designated cost   0hold timer
>0.00
>  flags
> 
> 
> David
> 
> - --
> David J. Coulson
> email: [EMAIL PROTECTED]
> web: http://www.davidcoulson.net/
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.3 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFEzYanTIgPQWnLowkRApT1AJ9yXl/O+rzacF+mpM7hhNtsEh/ufACfQCHk
> mBGBxl5Iscj7vbFlM0IzY/Y=
> =ZIs7
> -END PGP SIGNATURE-
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables

2006-07-31 Thread Thomas Graf
* Herbert Xu <[EMAIL PROTECTED]> 2006-08-01 00:01
> Without a route cache, I think our only choice is to search through
> all tables.  The same thing applies to PMTU updates as well.

I think PMTU etc. should be moved out of the route into a
some form of flow cache. It's currently using rt6_lookup()
which even goes through the rules.

Doing a few thousand trie lookups after Patrick's changes
in the worst case for every redirect might be acceptable
but doing so for every PMTU update could become an issue.

> Actually, if we're adding policy routing, we should seriously consider
> whether living without a routing cache is still viable or not because
> the cost of a route lookup has just gone up.

Absolutely.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: fix kernel panic from no dev->hard_header_len space

2006-07-31 Thread Alexey Kuznetsov
Hello!

> It does seem weird that IP output won't pay attention to

Not so weird, actually.

The logic was:

Only initial skb allocation tries to reserve all the space
to avoid copies in the future.

All the rest of places just check, that there is enough space
for their immediate needs. If dev->hard_header() is NULL, it means that
stack does not need any space at all, so that it does not need to worry.

Right logic for reallocation would be:

if (skb_headroom(skb) < space_which_I_need_now) {
skb2 = skb_realloc_headroom(skb, space_for_future);
}

That logic was not followed exactly only because of laziness,
each time some device is found which forgets to check for space,
so reallocation is made in absolutely inappropriate places.
F.e. ip_forward() does not need to reallocate skb when
skb_headroom() < dev->hard_header_len. It does and it is not good.

Good example is ipip tunnel. It sets:

dev->hard_header_len = sizeof(iphdr) + LL_MAX_HEADER

because it does not know, what device will be used.
It is lots of space and most likely it will not use it.
So, initial allocation reserves lots of space, but all the rest
of stack should not reallocate, tunnel will take care of this itself.

Alexey


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables

2006-07-31 Thread Thomas Graf
* Ville Nuorvala <[EMAIL PROTECTED]> 2006-07-31 16:55
> > When locating routes for redirects only the main table is
> > searched for now. Since policy rules will not be reversible
> > it is unclear whether it makes sense to change this.
> 
> This is a good point. You are absolutely correct about the policy rules.
> 
> IIRC, I initially looked through all the tables, but skipped this
> behavior when I rewrote the code for 2.6.11. Currently I'm once again
> in favor of looping through them all. This is IMO at least closer to the
> spirit of RFC 2461 section 8.3. where a host SHOULD update its
> destination cache upon receiving a redirect. If we don't look through
> all tables, we can't ensure this happens.

I agree, it will depend on what way is being followed regarding a
flow cache or route cache.

> > +#define RT6_TABLE_UNSPEC   RT_TABLE_UNSPEC
> > +#define RT6_TABLE_MAIN RT_TABLE_MAIN
> > +#define RT6_TABLE_LOCALRT6_TABLE_MAIN
> > +#define RT6_TABLE_DFLT RT6_TABLE_MAIN
> > +#define RT6_TABLE_INFO RT6_TABLE_MAIN
> 
> IMO it's a bit inconsistent to define a separate table entry for Route
> Information generated routes, but not Prefix Information based ones.
> What do you say about adding a RT6_TABLE_PRFX?

Sounds good.

> > @@ -1435,12 +1523,15 @@ static struct rt6_info *rt6_add_route_in
> >  struct rt6_info *rt6_get_dflt_router(struct in6_addr *addr, struct 
> > net_device *dev)
> >  {  
> > struct rt6_info *rt;
> > -   struct fib6_node *fn;
> > +   struct fib6_table *table;
> >  
> > -   fn = &ip6_routing_table;
> > +   /* TODO: It might be better to search all tables */
> > +   table = fib6_get_table(RT6_TABLE_DFLT);
> 
> As long as the table for default routes is RT6_TABLE_DFLT and can't be
> configured by the user, I think the correct behavior is just to search
> RT6_TABLE_DFLT.

I agree, I intended to remove that comment but missed it.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework

2006-07-31 Thread Thomas Graf
* Ville Nuorvala <[EMAIL PROTECTED]> 2006-07-31 17:46
> > Derived from net/ipv6/fib_rules.c
> 
> do you mean net/ipv4/fib_rules.c or net/ipv6/fib6_rules.c? :-)

Hehe, I meant net/ipv4/fib_rules.c :-)

> > +struct fib_rule_hdr
> > +{
> > +   __u8family;
> > +   __u8dst_len;
> > +   __u8src_len;
> > +   __u8tos;
> > +
> > +   __u8table;
> > +   __u8res1;   /* reserved */
> > +   __u8res2;   /* reserved */
> > +   __u8action;
> > +
> > +   __u32   flags;
> > +};
> 
> I'm wondering if this is guaranteed to be equvalent to struct rtmsg?
> 
> struct rtmsg
> {
>   unsigned char   rtm_family;
>   unsigned char   rtm_dst_len;
>   unsigned char   rtm_src_len;
>   unsigned char   rtm_tos;
> 
>   unsigned char   rtm_table;  /* Routing table id */
>   unsigned char   rtm_protocol;   /* Routing protocol; see below  
> */
>   unsigned char   rtm_scope;  /* See below */ 
>   unsigned char   rtm_type;   /* See below*/
> 
>   unsignedrtm_flags;
> };
> 
> Won't we otherwise be breaking the existing userland interface?

It is equivalent but you're right, it would break userland
interfaces otherwise. I've defined this new header to add
implicit names and stop the confusion with unused fields.

> > +enum
> > +{
> > +   FRA_UNSPEC,
> > +   FRA_DST,/* destination address */
> > +   FRA_SRC,/* source address */
> > +   FRA_IFNAME, /* interface name */
> > +   FRA_UNUSED1,
> > +   FRA_UNUSED2,
> > +   FRA_PRIORITY,   /* priority/preference */
> > +   FRA_UNUSED3,
> > +   FRA_UNUSED4,
> > +   FRA_UNUSED5,
> > +   FRA_FWMARK, /* netfilter mark (IPv4) */
> > +   FRA_FLOW,   /* flow/class id */
> > +   __FRA_MAX
> > +};
> > +
> > +#define FRA_MAX (__FRA_MAX - 1)
> > +
> > +enum
> > +{
> > +   FR_ACT_UNSPEC,
> > +   FR_ACT_TO_TBL,  /* Pass to fixed table */
> > +   FR_ACT_RES1,
> > +   FR_ACT_RES2,
> > +   FR_ACT_RES3,
> > +   FR_ACT_RES4,
> > +   FR_ACT_BLACKHOLE,   /* Drop without notification */
> > +   FR_ACT_UNREACHABLE, /* Drop with ENETUNREACH */
> > +   FR_ACT_PROHIBIT,/* Drop with EACCES */
> > +   __FR_ACT_MAX,
> > +};
> > +
> > +#define FR_ACT_MAX (__FR_ACT_MAX - 1)
> > +
> > +#endif
> 
> Shouldn't all these (struct fib_rule_hdr included) actually be defined
> in include/linux/rtnetlink.h?

We used to stuff everything into rtnetlink.h for no good reason. Having
independant include/linux/.h to export the interface to
userspace and include/net/.h to export the kernel interface
instead of contributing to the ifdef hell seems a lot cleaner to me.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IBM (Lenovo) T60: e1000 driver high latency

2006-07-31 Thread Auke Kok

Thomas Glanzmann wrote:

Hello,

   [ resend because .config and the used kernel version was missing ]

Linux Kernel Version: Linus Vanilla Tree; .config attached.

I recently aquired a Lenovo (IBM) T60 with a e1000 network card. I
experience high latency with this networkcard: Pings last upto 1 second
where the ping should be around 25 ms. I googled a bit and found the
following:

- Enable NAPI, which didn't worked for me.

64 bytes from 192.168.0.223: icmp_seq=30 ttl=64 time=1004 ms
64 bytes from 192.168.0.223: icmp_seq=31 ttl=64 time=0.444 ms
64 bytes from 192.168.0.223: icmp_seq=32 ttl=64 time=1006 ms
64 bytes from 192.168.0.223: icmp_seq=33 ttl=64 time=0.739 ms


Someone reported this problem on the e1000 bug tracker at e1000.sf.net.

He also reported that the behaviour goes away completely if he disables the 
in-kernel irq balancer:


: If I disable in kernel config Irq Balancing pings are
: much better but not the best  :-)
:
: 64 bytes from 192.168.3.74: icmp_seq=29 ttl=64 time=12.7 ms
: 64 bytes from 192.168.3.74: icmp_seq=30 ttl=64 time=10.0 ms
: 64 bytes from 192.168.3.74: icmp_seq=31 ttl=64 time=7.3 ms
: 64 bytes from 192.168.3.74: icmp_seq=32 ttl=64 time=4.5 ms

that's a large difference from >> 1000ms, and I cannot suspect otherwise that 
the kernel irqbalance is wreaking havoc in your system, trying to swap the 
entire context between each core (t60 is a core duo) every second or so.


I've never believed much in the kernel irq balancer, the userspace daemon 
written by Arjan van der Ven just does a much better job, so can you try to 
disable the kernel irqbalancer?


> CONFIG_IRQBALANCE=y

turn that off ;)


Cheers,

Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND 4/5] [IPV6]: Policy Routing Rules

2006-07-31 Thread Ville Nuorvala
Thomas Graf wrote:

> Adds support for policy routing rules including a new
> local table for routes with a local destination.

Looks good!

> Signed-off-by: Thomas Graf <[EMAIL PROTECTED]>

Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework

2006-07-31 Thread Ville Nuorvala
Thomas Graf wrote:

Hi Thomas,


> Derived from net/ipv6/fib_rules.c

do you mean net/ipv4/fib_rules.c or net/ipv6/fib6_rules.c? :-)

A couple of comments below.

> Signed-off-by: Thomas Graf <[EMAIL PROTECTED]>
> 
> Index: net-2.6.git/include/linux/fib_rules.h
> ===
> --- /dev/null
> +++ net-2.6.git/include/linux/fib_rules.h
> @@ -0,0 +1,60 @@
> +#ifndef __LINUX_FIB_RULES_H
> +#define __LINUX_FIB_RULES_H
> +
> +#include 
> +#include 
> +
> +/* rule is permanent, and cannot be deleted */
> +#define FIB_RULE_PERMANENT   1
> +
> +struct fib_rule_hdr
> +{
> + __u8family;
> + __u8dst_len;
> + __u8src_len;
> + __u8tos;
> +
> + __u8table;
> + __u8res1;   /* reserved */
> + __u8res2;   /* reserved */
> + __u8action;
> +
> + __u32   flags;
> +};

I'm wondering if this is guaranteed to be equvalent to struct rtmsg?

struct rtmsg
{
unsigned char   rtm_family;
unsigned char   rtm_dst_len;
unsigned char   rtm_src_len;
unsigned char   rtm_tos;

unsigned char   rtm_table;  /* Routing table id */
unsigned char   rtm_protocol;   /* Routing protocol; see below  
*/
unsigned char   rtm_scope;  /* See below */ 
unsigned char   rtm_type;   /* See below*/

unsignedrtm_flags;
};

Won't we otherwise be breaking the existing userland interface?

> +enum
> +{
> + FRA_UNSPEC,
> + FRA_DST,/* destination address */
> + FRA_SRC,/* source address */
> + FRA_IFNAME, /* interface name */
> + FRA_UNUSED1,
> + FRA_UNUSED2,
> + FRA_PRIORITY,   /* priority/preference */
> + FRA_UNUSED3,
> + FRA_UNUSED4,
> + FRA_UNUSED5,
> + FRA_FWMARK, /* netfilter mark (IPv4) */
> + FRA_FLOW,   /* flow/class id */
> + __FRA_MAX
> +};
> +
> +#define FRA_MAX (__FRA_MAX - 1)
> +
> +enum
> +{
> + FR_ACT_UNSPEC,
> + FR_ACT_TO_TBL,  /* Pass to fixed table */
> + FR_ACT_RES1,
> + FR_ACT_RES2,
> + FR_ACT_RES3,
> + FR_ACT_RES4,
> + FR_ACT_BLACKHOLE,   /* Drop without notification */
> + FR_ACT_UNREACHABLE, /* Drop with ENETUNREACH */
> + FR_ACT_PROHIBIT,/* Drop with EACCES */
> + __FR_ACT_MAX,
> +};
> +
> +#define FR_ACT_MAX (__FR_ACT_MAX - 1)
> +
> +#endif

Shouldn't all these (struct fib_rule_hdr included) actually be defined
in include/linux/rtnetlink.h?

Otherwise, looks good.

Regards,
Ville
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Debugging kernel lockups during network activity

2006-07-31 Thread Jarek Poplawski

On 28-07-2006 16:39, Jarek Poplawski wrote:
...

It has some great patch to queue scheduler by Hubert Xu. I think it is 


I'm immensly sorry to change the name of Mr Herbert Xu.

And I thought it's easy name - just like some famous conductor 
(but not so famous). You'll not believe, but when I wrote this I 
checked mails not to misspell his surname!


Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/7] NetLabel: core network changes

2006-07-31 Thread Paul Moore
On Monday 31 July 2006 8:43 am, Venkat Yekkirala wrote:
> > The NetLabel patch allows administrators to assign specific a CIPSO
> > DOI/configuration to each LSM "domain".  Blindly using the
> > CIPSO tag that the
> > remote host sends could violate the administrator's NetLabel
> > configuration.
> >
> > The current patch reads the CIPSO tag off the child socket,
> > translating the
> > tag according to the CIPSO DOI configuration to arrive at the
> > correct/desired
> > LSM  security attributes.  These LSM security attributes and
> > the "domain" are
> > then used to set the NetLabel on the socket.  In the case
> > where everyone is
> > well behaved this should have no effect on the socket IP
> > options and the
> > packets sent across the wire.  However, in the case of a
> > not-nice remote host
> > the outgoing CIPSO tag may change to match the administrators desired
> > settings.
>
> I wonder if waiting till accept isn't too late though. Perhaps this
> should be done when the openreq is created so the syn-ack and such
> will go out with the right tag?

Stephen Smalley and I had several long discussions about this and my opinion, 
which seemed to be at least acceptable to Stephen, was that it was okay since 
there was no actual data being sent only TCP control messages.  However, like 
I said earlier, the exact details of this are going to change as I am going 
to port the code to use the new accept() LSM hooks so this is really a not 
much of a concern anymore ...

-- 
paul moore
linux security @ hp
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables

2006-07-31 Thread Herbert Xu
On Tue, Aug 01, 2006 at 12:01:03AM +1000, Herbert Xu wrote:
> Ville Nuorvala <[EMAIL PROTECTED]> wrote:
> > 
> >> When locating routes for redirects only the main table is
> >> searched for now. Since policy rules will not be reversible
> >> it is unclear whether it makes sense to change this.
> > 
> > This is a good point. You are absolutely correct about the policy rules.
> > 
> > IIRC, I initially looked through all the tables, but skipped this
> > behavior when I rewrote the code for 2.6.11. Currently I'm once again
> > in favor of looping through them all. This is IMO at least closer to the
> > spirit of RFC 2461 section 8.3. where a host SHOULD update its
> > destination cache upon receiving a redirect. If we don't look through
> > all tables, we can't ensure this happens.
> 
> Without a route cache, I think our only choice is to search through
> all tables.  The same thing applies to PMTU updates as well.

Actually, if we're adding policy routing, we should seriously consider
whether living without a routing cache is still viable or not because
the cost of a route lookup has just gone up.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables

2006-07-31 Thread Herbert Xu
Ville Nuorvala <[EMAIL PROTECTED]> wrote:
> 
>> When locating routes for redirects only the main table is
>> searched for now. Since policy rules will not be reversible
>> it is unclear whether it makes sense to change this.
> 
> This is a good point. You are absolutely correct about the policy rules.
> 
> IIRC, I initially looked through all the tables, but skipped this
> behavior when I rewrote the code for 2.6.11. Currently I'm once again
> in favor of looping through them all. This is IMO at least closer to the
> spirit of RFC 2461 section 8.3. where a host SHOULD update its
> destination cache upon receiving a redirect. If we don't look through
> all tables, we can't ensure this happens.

Without a route cache, I think our only choice is to search through
all tables.  The same thing applies to PMTU updates as well.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables

2006-07-31 Thread Ville Nuorvala
Thomas Graf wrote:

> Adds the framework to support multiple IPv6 routing tables.
> Currently all automatically generated routes are put into the
> same table. This could be changed at a later point after
> considering the produced locking overhead.

Hi Thomes, some minor comments below.

> When locating routes for redirects only the main table is
> searched for now. Since policy rules will not be reversible
> it is unclear whether it makes sense to change this.

This is a good point. You are absolutely correct about the policy rules.

IIRC, I initially looked through all the tables, but skipped this
behavior when I rewrote the code for 2.6.11. Currently I'm once again
in favor of looping through them all. This is IMO at least closer to the
spirit of RFC 2461 section 8.3. where a host SHOULD update its
destination cache upon receiving a redirect. If we don't look through
all tables, we can't ensure this happens.

> Index: net-2.6.git/include/net/ip6_fib.h
> ===
> --- net-2.6.git.orig/include/net/ip6_fib.h
> +++ net-2.6.git/include/net/ip6_fib.h



> @@ -143,12 +146,41 @@ struct rt6_statistics {
>  
>  typedef void (*f_pnode)(struct fib6_node *fn, void *);
>  
> -extern struct fib6_node  ip6_routing_table;
> +struct fib6_table {
> + struct hlist_node   tb6_hlist;
> + u32 tb6_id;
> + rwlock_ttb6_lock;
> + struct fib6_nodetb6_root;
> +};
> +
> +#define RT6_TABLE_UNSPEC RT_TABLE_UNSPEC
> +#define RT6_TABLE_MAIN   RT_TABLE_MAIN
> +#define RT6_TABLE_LOCAL  RT6_TABLE_MAIN
> +#define RT6_TABLE_DFLT   RT6_TABLE_MAIN
> +#define RT6_TABLE_INFO   RT6_TABLE_MAIN

IMO it's a bit inconsistent to define a separate table entry for Route
Information generated routes, but not Prefix Information based ones.
What do you say about adding a RT6_TABLE_PRFX?

> Index: net-2.6.git/net/ipv6/route.c
> ===
> --- net-2.6.git.orig/net/ipv6/route.c
> +++ net-2.6.git/net/ipv6/route.c



> @@ -1435,12 +1523,15 @@ static struct rt6_info *rt6_add_route_in
>  struct rt6_info *rt6_get_dflt_router(struct in6_addr *addr, struct 
> net_device *dev)
>  {
>   struct rt6_info *rt;
> - struct fib6_node *fn;
> + struct fib6_table *table;
>  
> - fn = &ip6_routing_table;
> + /* TODO: It might be better to search all tables */
> + table = fib6_get_table(RT6_TABLE_DFLT);

As long as the table for default routes is RT6_TABLE_DFLT and can't be
configured by the user, I think the correct behavior is just to search
RT6_TABLE_DFLT.

Otherwise it looks very good!

Regards,
Ville
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/7] NetLabel: core network changes

2006-07-31 Thread Venkat Yekkirala
> The NetLabel patch allows administrators to assign specific a CIPSO 
> DOI/configuration to each LSM "domain".  Blindly using the 
> CIPSO tag that the 
> remote host sends could violate the administrator's NetLabel 
> configuration.  
> 
> The current patch reads the CIPSO tag off the child socket, 
> translating the 
> tag according to the CIPSO DOI configuration to arrive at the 
> correct/desired 
> LSM  security attributes.  These LSM security attributes and 
> the "domain" are 
> then used to set the NetLabel on the socket.  In the case 
> where everyone is 
> well behaved this should have no effect on the socket IP 
> options and the 
> packets sent across the wire.  However, in the case of a 
> not-nice remote host 
> the outgoing CIPSO tag may change to match the administrators desired 
> settings.

I wonder if waiting till accept isn't too late though. Perhaps this
should be done when the openreq is created so the syn-ack and such
will go out with the right tag?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Linville's L2 rant... -- Re: PATCH Fix bonding active-backup behavior for VLAN interfaces

2006-07-31 Thread John W. Linville
On Mon, Jul 31, 2006 at 10:15:40AM +0200, Christophe Devriese wrote:

> If you bond 2 vlan subinterfaces, the patch is not necessary at all. In that 
> case also the source device will be changed from eth0. to bond. So 
> that's correct behavior no ?
> 
> In the second case, you create vlan subifs on a bonding device, vlan 
> subinterfaces will be created on the slave interfaces. In that case the vlan 

(This is not directed at Christophe, or anyone in particular...)



Am I the only one that thinks that our handling of LAN L2 stuff
is at best a little "too" flexible (and at worst a collection of
nasty hacks)?

I mean, do we really need both the ability to bond multiple vlan
interfaces AND the ability to have vlan interfaces on top of a bond?
How many people really appreciate the subtle(?) differences?

Then throw bridging into the mix!  If I'm using VLANs and bonds in
a bridged environment, do I bridge the bonds, or bond the bridges?
Do the VLANs come before the bonds?  after the bridges?  or somewhere
in-between?  Do all these combinations even work together?  Who has
the definitive answer (besides the code itself)?

I have no doubt that there are plenty of opportunities for cleverness
here (and no doubt dragons too).  I just doubt that most of them
are worth the complexities introduced by our current collection of
"transparently" stackable pseudo-drivers and strategically placed hacks
(e.g. skb_bond).  All that, and it still isn't clear to me how we
can cleanly accomodate 802.1s (which adds VLAN awareness to bridging).

Do we hold the view that our L2 code is on par with the rest of
our code?  Is there an appetite for a clean-up?  Or is it just me?



If you made it this far, thanks for listening...I feel better now. :-)

John
-- 
John W. Linville
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix UDP filter condition when do checksum

2006-07-31 Thread David Miller
From: Wei Yongjun <[EMAIL PROTECTED]>
Date: Mon, 31 Jul 2006 06:33:42 -0400

> In udp_queue_rcv_skb(), checksum condition is error. When UDP filter is
> set, checksum is be done, but if UDP filter is not set, checksum will
> not be done. So I think this is a BUG.

It is not a bug, we defer the checksum, when we can, to sys_recvmsg()
where we can combine the copy into userspace and the checksum
calculation into one operation.

We cannot do this deferral when there is a filter attached, and that
is why the check is the way it is.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix UDP filter condition when do checksum

2006-07-31 Thread Herbert Xu
Wei Yongjun <[EMAIL PROTECTED]> wrote:
> In udp_queue_rcv_skb(), checksum condition is error. When UDP filter is
> set, checksum is be done, but if UDP filter is not set, checksum will
> not be done. So I think this is a BUG. Following is my patch:
> 
> --- a/net/ipv4/udp.c2006-07-31 09:33:45.392479344 -0400
> +++ b/net/ipv4/udp.c2006-07-31 17:10:41.271632200 -0400
> @@ -1018,7 +1018,7 @@ static int udp_queue_rcv_skb(struct sock
>/* FALLTHROUGH -- it's a UDP Packet */
>}
> 
> -   if (sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) {
> +   if (!sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) {
>if (__udp_checksum_complete(skb)) {
>UDP_INC_STATS_BH(UDP_MIB_INERRORS);
>kfree_skb(skb);

For the record, this isn't correct since the only reason we're computing
a checksum here rather than recv(2) time is if we have a filter attached.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: warning at net/core/dev.c:1171/skb_checksum_help() 2.6.18-rc3

2006-07-31 Thread Herbert Xu
On Mon, Jul 31, 2006 at 12:39:51PM +0200, Patrick McHardy wrote:
> 
> These are the patches (some variantions tested, but not all) on
> top of Herbert's CHECKSUM_PARTIAL patch. The first one fixes up
> the CHECKSUM_PARTIAL patch for 2.6.18-rc3, the second one fixes
> checksumming in all of netfilter besides ip_queue, the third one
> fixes ip_queue.

Thank you very much for working on this Patrick!

> Its actually not that much, if Herbert is fine with putting the
> CHECKSUM_PARTIAL patch in 2.6.18 I'll do some more testing and
> then I think these can go in as well.

You guys know I'm a coward when it comes to pushing things into rc :)

So I'd rather see a patch to disable the warnings for 2.6.18 so that
the proper fix can be tested more thoroughly.  We should remember that
the 2.6.18 minus the warning is still going to be heaps better in this
regard compared to 2.6.17 where all TSO packets were essentially
discarded due to the incorrect checksum (when the NAT module is loaded).

> [NET]: Fix up CHECKSUM_PARTIAL patch for 2.6.18-rc3
> 
> Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

Please merge this with my earlier patch.  I'm not that fussed about
having my changeset go in :)

> diff --git a/net/ipv4/netfilter/ip_nat_core.c 
> b/net/ipv4/netfilter/ip_nat_core.c
> index 1741d55..731efbb 100644
> --- a/net/ipv4/netfilter/ip_nat_core.c
> +++ b/net/ipv4/netfilter/ip_nat_core.c
> @@ -443,7 +443,9 @@ int ip_nat_icmp_reply_translation(struct
>  
>   /* We're actually going to mangle it beyond trivial checksum
>  adjustment, so make sure the current checksum is correct. */
> - if ((*pskb)->ip_summed != CHECKSUM_UNNECESSARY) {
> +
> + if ((*pskb)->ip_summed != CHECKSUM_UNNECESSARY &&
> + (*pskb)->ip_summed != CHECKSUM_PARTIAL) {
>   hdrlen = (*pskb)->nh.iph->ihl * 4;
>   if ((u16)csum_fold(skb_checksum(*pskb, hdrlen,
>   (*pskb)->len - hdrlen, 0)))

Call me picky, but I'd prefer it to actually look like

switch ((*pskb)->ip_summed) {
case CHECKSUM_COMPLETE:
if (!(u16)csum_fold(skb->csum))
break;
/* fall through */
case CHECKSUM_NONE:
hdrlen = (*pskb)->nh.iph->ihl * 4;
if ((u16)csum_fold(skb_checksum(*pskb, hdrlen,
(*pskb)->len - hdrlen, 0)))
return 0;
}

just because we probably won't revisit this code path for another
million years to add this optimisation :)

> diff --git a/net/ipv4/netfilter/ip_nat_helper.c 
> b/net/ipv4/netfilter/ip_nat_helper.c
> index cbcaa45..dd0ddd4 100644
> --- a/net/ipv4/netfilter/ip_nat_helper.c
> +++ b/net/ipv4/netfilter/ip_nat_helper.c
> @@ -165,7 +165,7 @@ ip_nat_mangle_tcp_packet(struct sk_buff 
>  {
>   struct iphdr *iph;
>   struct tcphdr *tcph;
> - int datalen;
> + int oldlen, datalen;
>  
>   if (!skb_make_writable(pskb, (*pskb)->len))
>   return 0;
> @@ -180,13 +180,22 @@ ip_nat_mangle_tcp_packet(struct sk_buff 
>   iph = (*pskb)->nh.iph;
>   tcph = (void *)iph + iph->ihl*4;
>  
> + oldlen = (*pskb)->len - iph->ihl*4;
>   mangle_contents(*pskb, iph->ihl*4 + tcph->doff*4,
>   match_offset, match_len, rep_buffer, rep_len);
>  
>   datalen = (*pskb)->len - iph->ihl*4;
> - tcph->check = 0;
> - tcph->check = tcp_v4_check(tcph, datalen, iph->saddr, iph->daddr,
> -csum_partial((char *)tcph, datalen, 0));
> + if ((*pskb)->ip_summed != CHECKSUM_PARTIAL) {
> + tcph->check = 0;
> + tcph->check = tcp_v4_check(tcph, datalen,
> +iph->saddr, iph->daddr,
> +csum_partial((char *)tcph,
> + datalen, 0));
> + } else
> + tcph->check = nf_proto_csum_update(*pskb,
> +htons(oldlen) ^ 0x,
> +htons(datalen),
> +tcph->check, 1);

OK, this is so incredibly clever that I probably won't understand it
until tomorrow :)

> @@ -238,22 +248,30 @@ ip_nat_mangle_udp_packet(struct sk_buff 
>  
>   iph = (*pskb)->nh.iph;
>   udph = (void *)iph + iph->ihl*4;
> +
> + oldlen = (*pskb)->len - iph->ihl*4;
>   mangle_contents(*pskb, iph->ihl*4 + sizeof(*udph),
>   match_offset, match_len, rep_buffer, rep_len);
>  
>   /* update the length of the UDP packet */
> - udph->len = htons((*pskb)->len - iph->ihl*4);
> + datalen = (*pskb)->len - iph->ihl*4;
> + udph->len = htons(datalen);
>  
>   /* fix udp checksum if udp checksum was previously calculated */
> - if (udph->check) {
> - int datalen = (*pskb)->len - iph->ihl * 4;
> + if (!

Re: [RFC 1/4] kevent: core files.

2006-07-31 Thread Herbert Xu
On Mon, Jul 31, 2006 at 03:57:16AM -0700, David Miller wrote:
> 
> So I would say for up to 4 or 5 events, system call overhead alone
> touches as many cache lines as the events themselves.

Absolutely.

The other to consider is that events don't come from the hardware.
Events are written by the kernel.  So if user-space is just reading
the events that we've written, then there are no cache misses at all.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] [IPV6]: Remove ndiscs rt6_lock dependency

2006-07-31 Thread Ville Nuorvala
David Miller wrote:
> From: Thomas Graf <[EMAIL PROTECTED]>
> Date: Thu, 27 Jul 2006 00:00:01 +0200
> 
>> (Ab)using rt6_lock wouldn't work anymore if rt6_lock is
>> converted into a per table lock.
>>
>> Signed-off-by: Thomas Graf <[EMAIL PROTECTED]>
> 
> This one looks great.
> 
> Signed-off-by: David S. Miller <[EMAIL PROTECTED]>

Ditto.
Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>

> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Multiple IPV6 Routing Tables & Policy Routing

2006-07-31 Thread Ville Nuorvala
Thomas Graf wrote:
> Hello,

Hi Thomas!

> Thought it might be time to go through a round of comments
> on this work. Even though I've almost rewritten all the code
> the patches are based on the work found on www.mobile-ipv6.org.
> I have no idea which code was written by whom so just email me
> to get the credits right.

The policy routing stuff (multiple tables and source address based
routing) was almost entirely written by me. Therefore you can apply my
name as you see fit ;-)

Tushar Gohad at MontaVista, Benjamin Thery at Bull and of course USAGI
have also worked on the code.

> Main differences to the version found on mobile-ipv6.org is
> that I removed table refcnt and defined that tables cannot
> disappear once created to simplify things and avoid too many
> atomic operations when looking up routes.

Yes, that sounds good. As the ipv6 module doesn't really seem to become
unloadable anytime soon, there isn't really any good reason to refcount
the tables.

> I've replaced the
> table array with a hash table to prepare it for > 255 tables
> and made things aware of the new default router selection
> code and experimental route info stuff added recently.

Good! I never had the time to merge our changes with 2.6.17.

> It's not final but somewhat working, I'm eager to see comments
> or patches.

I'll try to comment on them the best I can.

> I apologize if I've tramped onto anybody's foot
> by taking this up and submitting it, this isn't meant as an
> attempt to steal credits but rather to pick up good code and
> finally get it upstream after a very long while.

No offense taken! It's great that someone wants to push these things
upstream as I personally have neither had the time nor the energy to do
so lately.

Regards,
Ville
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/4] kevent: core files.

2006-07-31 Thread David Miller
From: Evgeniy Polyakov <[EMAIL PROTECTED]>
Date: Mon, 31 Jul 2006 14:50:37 +0400

> In syscall time kevents copy 40bytes for each event + 12 bytes of header 
> (number of events, timeout and command number). That's likely two cache
> lines if only one event is reported.

Do you know how many cachelines are dirtied by system call
entry and exit on typical system?

On sparc64 it is a minimum of 3 64-byte cachelines just to save and
restore the system call time cpu register state.  If application is
deep in a call chain, register windows might spill and each such
register window will dirty 2 more cachelines as they are dumped to the
stack.

I am not even talking about the other basic necessities of doing
a system call such as touching various task_struct and thread_info
state to check for pending signals etc.

System call overhead is non-trivial especially when you are using
it to move only a few small objects into and out of the kernel.

So I would say for up to 4 or 5 events, system call overhead alone
touches as many cache lines as the events themselves.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/4] kevent: core files.

2006-07-31 Thread Evgeniy Polyakov
On Mon, Jul 31, 2006 at 08:35:55PM +1000, Herbert Xu ([EMAIL PROTECTED]) wrote:
> Evgeniy Polyakov <[EMAIL PROTECTED]> wrote:
> >
> >> - if there is space, report it in the ring buffer.  Yes, the buffer
> >>   can be optional, then all events are reported by the system call.
> > 
> > That requires a copy, which can neglect syscall overhead.
> > Do we really want it to be done?
> 
> Please note that we're talking about events here, not actual data.  So
> only the event is being copied, which is presumably rather small compared
> to the data.

In syscall time kevents copy 40bytes for each event + 12 bytes of header 
(number of events, timeout and command number). That's likely two cache
lines if only one event is reported.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] SNMPv2 udpInDatagrams counter error

2006-07-31 Thread Herbert Xu
Wei Yongjun <[EMAIL PROTECTED]> wrote:
>
> I also send the same mail several ago, and get no response. You
> patch is fine. But I think following code has no effect:
> 
> if (sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) {
> 
> It just let UDP datagrams with checksum error be added into  UDP receive
> queue, and then discard it. I think this can be used to capture a UDP

We normally postpone the checksum computation until the user does a
recv(2).  However, if there is a socket filter attached then we need
to verify the checksum right now because the socket filter will be
applied at the very next step.

So yes it does let UDP datagrams with checksum errors onto the UDP rcv
queue if there no socket filters attached, but this is intentional since 
we want to postpone the cost of checksum computation until the point when
we have to copy the data to user-space where it becomes much cheaper.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: warning at net/core/dev.c:1171/skb_checksum_help() 2.6.18-rc3

2006-07-31 Thread Patrick McHardy
Patrick McHardy wrote:
> David Miller wrote:
> 
>>I would like to see this fixed for 2.6.18, no later.
>>
>>Either that or disable the bug trap, but taking this route
>>is severely discouraged. :)
> 
> 
> I'm actually updateing my patch for this on top of Herbert's
> CHECKSUM_PARTIAL patch right now. Unfortunately I targeted 2.6.19,
> so the fixes are on top of a few cleanups (which unconvered a few
> unrelated bugs as well). I'll post it when I'm done so we can
> decide how to proceed.

These are the patches (some variantions tested, but not all) on
top of Herbert's CHECKSUM_PARTIAL patch. The first one fixes up
the CHECKSUM_PARTIAL patch for 2.6.18-rc3, the second one fixes
checksumming in all of netfilter besides ip_queue, the third one
fixes ip_queue.

Its actually not that much, if Herbert is fine with putting the
CHECKSUM_PARTIAL patch in 2.6.18 I'll do some more testing and
then I think these can go in as well.

[NET]: Fix up CHECKSUM_PARTIAL patch for 2.6.18-rc3

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

---
commit 17a40f32fc339e9f6feeb042db58d30c8caf2fad
tree 479e926c12606667a91d483223b4416da56227d5
parent 296b866d72ee7a8a577908323f2a7e8e92f4001f
author Patrick McHardy <[EMAIL PROTECTED]> Mon, 31 Jul 2006 09:23:27 +0200
committer Patrick McHardy <[EMAIL PROTECTED]> Mon, 31 Jul 2006 09:23:27 +0200

 include/linux/netdevice.h |4 ++--
 net/core/dev.c|8 
 net/ipv4/tcp.c|4 ++--
 net/ipv4/tcp_ipv4.c   |2 +-
 net/ipv6/tcp_ipv6.c   |2 +-
 5 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 75f02d8..b5b9a33 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -973,7 +973,7 @@ extern void		dev_mcast_init(void);
 extern int		netdev_max_backlog;
 extern int		weight_p;
 extern int		netdev_set_master(struct net_device *dev, struct net_device *master);
-extern int skb_checksum_help(struct sk_buff *skb, int inward);
+extern int skb_checksum_help(struct sk_buff *skb);
 extern struct sk_buff *skb_gso_segment(struct sk_buff *skb, int features);
 #ifdef CONFIG_BUG
 extern void netdev_rx_csum_fault(struct net_device *dev);
@@ -1009,7 +1009,7 @@ static inline int netif_needs_gso(struct
 {
 	return skb_is_gso(skb) &&
 	   (!skb_gso_ok(skb, dev->features) ||
-		unlikely(skb->ip_summed != CHECKSUM_HW));
+		unlikely(skb->ip_summed != CHECKSUM_PARTIAL));
 }
 
 #endif /* __KERNEL__ */
diff --git a/net/core/dev.c b/net/core/dev.c
index 90fb267..528c5f3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1157,12 +1157,12 @@ EXPORT_SYMBOL(netif_device_attach);
  * Invalidate hardware checksum when packet is to be mangled, and
  * complete checksum manually on outgoing path.
  */
-int skb_checksum_help(struct sk_buff *skb, int inward)
+int skb_checksum_help(struct sk_buff *skb)
 {
 	unsigned int csum;
 	int ret = 0, offset = skb->h.raw - skb->data;
 
-	if (inward)
+	if (skb->ip_summed == CHECKSUM_COMPLETE)
 		goto out_set_summed;
 
 	if (unlikely(skb_shinfo(skb)->gso_size)) {
@@ -1219,7 +1219,7 @@ struct sk_buff *skb_gso_segment(struct s
 	skb->mac_len = skb->nh.raw - skb->data;
 	__skb_pull(skb, skb->mac_len);
 
-	if (unlikely(skb->ip_summed != CHECKSUM_HW)) {
+	if (unlikely(skb->ip_summed != CHECKSUM_PARTIAL)) {
 		static int warned;
 
 		WARN_ON(!warned);
@@ -1233,7 +1233,7 @@ struct sk_buff *skb_gso_segment(struct s
 	rcu_read_lock();
 	list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & 15], list) {
 		if (ptype->type == type && !ptype->dev && ptype->gso_segment) {
-			if (unlikely(skb->ip_summed != CHECKSUM_HW)) {
+			if (unlikely(skb->ip_summed != CHECKSUM_PARTIAL)) {
 err = ptype->gso_send_check(skb);
 segs = ERR_PTR(err);
 if (err || skb_gso_ok(skb, features))
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 40ada0b..c452373 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2204,7 +2204,7 @@ struct sk_buff *tcp_tso_segment(struct s
 		th->fin = th->psh = 0;
 
 		th->check = ~csum_fold(th->check + delta);
-		if (skb->ip_summed != CHECKSUM_HW)
+		if (skb->ip_summed != CHECKSUM_PARTIAL)
 			th->check = csum_fold(csum_partial(skb->h.raw, thlen,
 			   skb->csum));
 
@@ -2218,7 +2218,7 @@ struct sk_buff *tcp_tso_segment(struct s
 
 	delta = htonl(oldlen + (skb->tail - skb->h.raw) + skb->data_len);
 	th->check = ~csum_fold(th->check + delta);
-	if (skb->ip_summed != CHECKSUM_HW)
+	if (skb->ip_summed != CHECKSUM_PARTIAL)
 		th->check = csum_fold(csum_partial(skb->h.raw, thlen,
 		   skb->csum));
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 114830f..be056d1 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -510,7 +510,7 @@ int tcp_v4_gso_send_check(struct sk_buff
 	th->check = 0;
 	th->check = ~tcp_v4_check(th, skb->len, iph->saddr, iph->daddr, 0);
 	skb->csum = offsetof(struct tcphdr, check);
-	skb->ip_summed = CHECKSUM_HW;
+	skb->ip_summed = CHECKSUM_PARTIAL;
 	return 0;
 }
 
diff --git a/net/ipv6/tcp

Re: [RFC 1/4] kevent: core files.

2006-07-31 Thread Herbert Xu
Evgeniy Polyakov <[EMAIL PROTECTED]> wrote:
>
>> - if there is space, report it in the ring buffer.  Yes, the buffer
>>   can be optional, then all events are reported by the system call.
> 
> That requires a copy, which can neglect syscall overhead.
> Do we really want it to be done?

Please note that we're talking about events here, not actual data.  So
only the event is being copied, which is presumably rather small compared
to the data.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/4] kevent: core files.

2006-07-31 Thread Evgeniy Polyakov
On Sat, Jul 29, 2006 at 09:18:47AM -0700, Ulrich Drepper ([EMAIL PROTECTED]) 
wrote:
> Evgeniy Polyakov wrote:
> > Btw, why do we want mapped ring of ready events?
> > If user requestd some event, he definitely wants to get them back when
> > they are ready, and not to check and then get them?
> > Could you please explain more on this issue?
> 
> If of course makes no sense to enter the kernel to actually get the
> event.  This should be done by storing the event in the ring buffer.
> I.e., there are two ways to get an event:
> 
> - with a syscall.  This can report as many events at once as the caller
>   provides space for.  And no event which is reported in the run buffer
>   should be reported this way
> 
> - if there is space, report it in the ring buffer.  Yes, the buffer
>   can be optional, then all events are reported by the system call.

That requires a copy, which can neglect syscall overhead.
Do we really want it to be done?
 
> So the use case would be like this:
> 
> 
> wait_and_get_event:
> 
>   is buffer empty ?
> 
> yes -> make syscall
> 
> no -> get event from buffer
> 
> 
> To avoid races, the syscall needs to take a parameter indicating the
> last event checked out from the buffer.  If in the meantime the kernel
> put another event in the buffer the syscall immediately returns.
> Similar to what we do in the futex syscall.

And how "misordering" between queue and buffer is going to be managed?
I.e. when buffer is full and events are placed into queue, so syscall
could get them, and then syscall is called to get events from the queue
but not from the buffer - we can endup taking events from buffer while
old are placed in the queue.
And how waiting will be done without syscalls? Will glibc take care of
it?

> The question is how to best represent the ring buffer.  Zach and some
> others had some ready responses in Ottawa.  The important thing is to
> avoid cache line ping pong when possible.
> 
> Is the ring buffer absolutely necessary?  Probably not.  But it has the
> potential to help quite a bit.  Don't look at the problem to solve in
> the context of heavy I/O operations when another syscall here and there
> doesn't matter.  With this single event mechanism for every possible
> event the kernel can generate programming can look quite different.
> E.g., every read() call can implicitly we changed into an async read
> call followed by a user-level reschedule.  This rescheduling allows
> another thread of execution to run while the read request is processed.
>  I.e., it's basically a setjmp() followed by a goto into the inner loop
> to get the next event.  And now suddenly the event notification
> mechanism really should be as fast as possible.  If we submit basically
> every request asynchronously and are not creating dedicated threads for
> specific tasks anymore we
> 
> a) have a lot more event notifications
> 
> b) the probability of an event being reported when we want the receive
>the next one if higher (i.e., the case where no syscall vs syscall
>makes a difference)
> 
> Yes, all this will require changes in the way programs a written but we
> shouldn't limit the way we can write programs unnecessarily.  I think
> that given increasing discrepancies in relative speed/latency of the
> peripherals and the CPU this is one possible solution to keep the CPUs
> busy without resorting to a gazillion separate threads in each program.

Ok, let's do it in the following way:
I present new version of kevent with new syscalls and fixed issues mentioned
before, while people look at it we can end up with mapped buffer design.
Is it ok?

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
> 



-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] SNMPv2 udpInDatagrams counter error

2006-07-31 Thread Wei Yongjun
This change does not effect to tcpdump, only let UDP filter can not 
received UDP datagrams with checksum error. It is not a good idea, but
I think is the best way to resolve this problem. If you want to capture
error UDP packet, you can used tcpdump.

On Monday 31 July 2006 04:57, Gerrit Renker wrote:
> Hi,
>
> |  if (!sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) {
> |
> |  IPv6 doesn't do this, so I think delete condition 'sk->sk_filter'
is
> | better. Do you think so?
>
> I think the sk->sk_filter is there for a good reason. If you delete
it,
> that routine is forced to always compute UDP checksums, even if the
only
> receiving application is a tcpdump process. I may be wrong here, but I
> think that deleting the sk_filter statement is not at a good idea.
>
> The other alternatives discussed (afaik) so far were:
>
> 1) Move the increment of UDP_MIB_INDATAGRAMS from udp_queue_rcv_skb()
to
> udp_recvmsg() (first patch uploaded to 
> http://bugzilla.kernel.org/show_bug.cgi?id=6660). This was discussed:
not a
> good idea, since in-kernel applications may use the data_ready handler
> rather than udp_recvmsg().
>
> 2) Decrement UDP_MIB_INDATAGRAMS in udp_recvmsg() when the checksum
turns
> out to be wrong (second patch uploaded to above address). This would
be a
> fix to the problem you are stating, it also solves the problem of
missing
> out the data_ready handlers in (1), and was suggested earlier on this
> mailing list.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [3/4] kevent: AIO, aio_sendfile() implementation.

2006-07-31 Thread Suparna Bhattacharya
On Thu, Jul 27, 2006 at 11:44:23AM -0700, Ulrich Drepper wrote:
> Badari Pulavarty wrote:
> > Before we spend too much time cleaning up and merging into mainline -
> > I would like an agreement that what we add is good enough for glibc
> > POSIX AIO.
> 
> I haven't seen a description of the interface so far.  Would be good if

Did Sébastien's mail with the description help ? 

> it existed.  But I briefly mentioned one quirk in the interface about
> which Suparna wasn't sure whether it's implemented/implementable in the
> current interface.
> 
> If a lio_listio call is made the individual requests are handle just as
> if they'd be issue separately.  I.e., the notification specified in the
> individual aiocb is performed when the specific request is done.  Then,
> once all requests are done, another notification is made, this time
> controlled by the sigevent parameter if lio_listio.

Looking at the code in lio kernel patch, this should be already covered:

if (iocb->ki_signo)
__aio_send_signal(iocb);

+   if (iocb->ki_lio)
+   lio_check(iocb->ki_lio);

That is, it first checks the notification in the individual iocb, and then
the one for the LIO.

> 
> 
> Another feature which I always wanted: the current lio_listio call
> returns in blocking mode only if all requests are done.  In non-blocking
> mode it returns immediately and the program needs to poll the aiocbs.
> What is needed is something in the middle.  For instance, if multiple
> read requests are issued the program might be able to start working as
> soon as one request is satisfied.  I.e., a call similar to lio_listio
> would be nice which also takes another parameter specifying how many of
> the NENT aiocbs have to finish before the call returns.

I imagine the kernel could enable this by incorporating this additional
parameter for IOCB_CMD_GROUP in the ABI (in the default case this should be the
same as the total number of iocbs submitted to lio_listio). Now should the
at least NENT check apply only to LIO_WAIT or also to the LIO_NOWAIT
notification case ? 

BTW, the native io_getevents does support a min_nr wakeup already, except that
it applies to any iocb on the io_context, and not just a given lio_listio call.

Regards
Suparna


-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] gre: transparent ethernet bridging

2006-07-31 Thread Philip Craig
This patch implements transparent ethernet bridging for gre tunnels.
There are a few outstanding issues.

There is no way for userspace to select the type of gre tunnel. The
#if 0 near the top of the patch forces all gre tunnels to be bridges.
The problem is that userspace uses an IPPROTO_ to select the type of
tunnel, but both types of gre tunnel are IPPROTO_GRE. I can't see
anything else in struct ip_tunnel_parm that could be used to select
this. One approach that I've seen mentioned in the archives is to add
a netlink interface to replace the tunnel ioctls.

Network loops are bad. See the comments at the top of ip_gre.c for
a description of how gre tunnels handle this normally. But for gre
bridges, we don't want to copy the ttl (it breaks routing protocols),
and we don't want to force DF (we want to bridge 1500 byte packets).
I couldn't think of any solution for this.

Some routers set LLC_SAP_BSPAN in the gre protocol field, and then
give the bpdu packet without any other ethernet/llc header. This patch
currently tries to fake the ethernet/llc header before passing the
packet up, but it is buggy (mac addresses are wrong at least). Maybe a
better approach is to call directly into the bridging code. I didn't try
that at first because it isn't modular, and may break other things that
want to see the packet.


--- linux-2.6.x/net/ipv4/ip_gre.c   18 Jun 2006 23:30:56 -  1.1.1.33
+++ linux-2.6.x/net/ipv4/ip_gre.c   31 Jul 2006 09:57:41 -
@@ -30,6 +30,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 

 #include 
 #include 
@@ -41,6 +43,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 

 #ifdef CONFIG_IPV6
 #include 
@@ -119,6 +123,7 @@

 static int ipgre_tunnel_init(struct net_device *dev);
 static void ipgre_tunnel_setup(struct net_device *dev);
+static void ipgre_ether_tunnel_setup(struct net_device *dev);

 /* Fallback tunnel: no source, no destination, no key, no options */

@@ -274,7 +279,11 @@ static struct ip_tunnel * ipgre_tunnel_l
goto failed;
}

+#if 0
dev = alloc_netdev(sizeof(*t), name, ipgre_tunnel_setup);
+#else
+   dev = alloc_netdev(sizeof(*t), name, ipgre_ether_tunnel_setup);
+#endif
if (!dev)
  return NULL;

@@ -550,6 +559,68 @@ ipgre_ecn_encapsulate(u8 tos, struct iph
return INET_ECN_encapsulate(tos, inner);
 }

+__be16 ipgre_type_trans(struct sk_buff *skb, int offset)
+{
+   u8 *h = skb->data;
+   __be16 flags = *(__be16*)h;
+   __be16 proto = *(__be16*)(h + 2);
+
+   /* WCCP version 1 and 2 protocol decoding.
+* - Change protocol to IP
+* - When dealing with WCCPv2, Skip extra 4 bytes in GRE header
+*/
+   if (flags == 0 &&
+   proto == __constant_htons(ETH_P_WCCP)) {
+   proto = __constant_htons(ETH_P_IP);
+   if ((*(h + offset) & 0xF0) != 0x40)
+   offset += 4;
+   }
+
+   skb->mac.raw = skb->nh.raw;
+   skb->nh.raw = __pskb_pull(skb, offset);
+   skb_postpull_rcsum(skb, skb->h.raw, offset);
+#ifdef CONFIG_NET_IPGRE_BROADCAST
+   if (MULTICAST(iph->daddr)) {
+   /* Looped back packet, drop it! */
+   if (((struct rtable*)skb->dst)->fl.iif == 0)
+   return 0;
+   /* tunnel->stat.multicast++; */
+   skb->pkt_type = PACKET_BROADCAST;
+   }
+#endif
+
+   return proto;
+}
+
+extern const u8 br_group_address[ETH_ALEN];
+
+__be16 ipgre_ether_type_trans(struct sk_buff *skb, struct net_device *dev,
+ int offset)
+{
+   u8 *h = skb->data;
+   __be16 proto = *(__be16*)(h + 2);
+
+   if (proto == htons(ETH_P_BRIDGE)) {
+   if (!pskb_may_pull(skb, offset + ETH_HLEN))
+   return 0;
+   skb_pull_rcsum(skb, offset);
+   return eth_type_trans(skb, dev);
+   } else if (proto == htons(LLC_SAP_BSPAN)) {
+   skb_pull_rcsum(skb, offset);
+
+   llc_pdu_header_init(skb, LLC_PDU_TYPE_U, LLC_SAP_BSPAN,
+   LLC_SAP_BSPAN, LLC_PDU_CMD);
+   llc_pdu_init_as_ui_cmd(skb);
+
+   llc_mac_hdr_init(skb, dev->dev_addr, dev->dev_addr);
+   skb_pull(skb, ETH_HLEN);
+
+   return htons(ETH_P_802_2);
+   }
+
+   return 0;
+}
+
 static int ipgre_rcv(struct sk_buff *skb)
 {
struct iphdr *iph;
@@ -603,32 +674,8 @@ static int ipgre_rcv(struct sk_buff *skb
if ((tunnel = ipgre_tunnel_lookup(iph->saddr, iph->daddr, key)) != 
NULL) {
secpath_reset(skb);

-   skb->protocol = *(u16*)(h + 2);
-   /* WCCP version 1 and 2 protocol decoding.
-* - Change protocol to IP
-* - When dealing with WCCPv2, Skip extra 4 bytes in GRE header
-*/
-   if (flags == 0 &&
-   skb->protocol == __constant_htons(ETH_P_W

Re: [patch 1/1]SNMPv2 "ipv6IfStatsOutFragCreates" counter error

2006-07-31 Thread YOSHIFUJI Hideaki / 吉藤英明
Hello.

The patch seems sane to me.

In article <[EMAIL PROTECTED]> (at Tue, 01 Aug 2006 05:45:39 -0400), weidong 
<[EMAIL PROTECTED]> says:

> signed-off-by: Wei Dong <[EMAIL PROTECTED]>
Acked-by: YOSHIFUJI Hideaki <[EMAIL PROTECTED]>

--yoshfuji
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/1]SNMPv2 "ipv6IfStatsInHdrErrors" counter error

2006-07-31 Thread YOSHIFUJI Hideaki / 吉藤英明
Hello.

Next time, please put your "Signed-off-by" line before the patch.
Thank you.

In article <[EMAIL PROTECTED]> (at Tue, 01 Aug 2006 05:45:33 -0400), weidong 
<[EMAIL PROTECTED]> says:

> signed-off-by:Wei Dong <[EMAIL PROTECTED]>
Acked-by: YOSHIFUJI Hideaki <[EMAIL PROTECTED]>

--yoshfuji
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IBM (Lenovo) T60: e1000 driver high latency

2006-07-31 Thread Thomas Glanzmann
Hello,

   [ resend because .config and the used kernel version was missing ]

Linux Kernel Version: Linus Vanilla Tree; .config attached.

I recently aquired a Lenovo (IBM) T60 with a e1000 network card. I
experience high latency with this networkcard: Pings last upto 1 second
where the ping should be around 25 ms. I googled a bit and found the
following:

- Enable NAPI, which didn't worked for me.

64 bytes from 192.168.0.223: icmp_seq=30 ttl=64 time=1004 ms
64 bytes from 192.168.0.223: icmp_seq=31 ttl=64 time=0.444 ms
64 bytes from 192.168.0.223: icmp_seq=32 ttl=64 time=1006 ms
64 bytes from 192.168.0.223: icmp_seq=33 ttl=64 time=0.739 ms
64 bytes from 192.168.0.223: icmp_seq=34 ttl=64 time=1006 ms
64 bytes from 192.168.0.223: icmp_seq=35 ttl=64 time=0.603 ms
64 bytes from 192.168.0.223: icmp_seq=36 ttl=64 time=1001 ms
64 bytes from 192.168.0.223: icmp_seq=37 ttl=64 time=0.736 ms

02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet 
Controller
Subsystem: Lenovo Unknown device 2001
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- #
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.18-rc3
# Mon Jul 31 17:53:27 2006
#
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
# CONFIG_TASKSTATS is not set
CONFIG_SYSCTL=y
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_CPUSETS is not set
# CONFIG_RELAY is not set
CONFIG_INITRAMFS_SOURCE=""
CONFIG_UID16=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
# CONFIG_EMBEDDED is not set
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_RT_MUTEXES=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_SLAB=y
CONFIG_VM_EVENT_COUNTERS=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
# CONFIG_SLOB is not set

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_MODVERSIONS=y
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y

#
# Block layer
#
CONFIG_LBD=y
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_LSF is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_AS=y
# CONFIG_DEFAULT_DEADLINE is not set
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="anticipatory"

#
# Processor type and features
#
CONFIG_SMP=y
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_ES7000 is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
CONFIG_MPENTIUMM=y
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_X86_GENERIC is not set
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_NR_CPUS=8
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_NONFATAL=y
CONFIG_X86_MCE_P4THERMAL=y
CONFIG_VM86=y
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
# CONFIG_X86_REBOOTFIXUPS is not set
CONFIG_MICROCODE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y

#
# Firmware Drivers

[PATCH] Fix UDP filter condition when do checksum

2006-07-31 Thread Wei Yongjun
In udp_queue_rcv_skb(), checksum condition is error. When UDP filter is
set, checksum is be done, but if UDP filter is not set, checksum will
not be done. So I think this is a BUG. Following is my patch:

--- a/net/ipv4/udp.c2006-07-31 09:33:45.392479344 -0400
+++ b/net/ipv4/udp.c2006-07-31 17:10:41.271632200 -0400
@@ -1018,7 +1018,7 @@ static int udp_queue_rcv_skb(struct sock
/* FALLTHROUGH -- it's a UDP Packet */
}
 
-   if (sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) {
+   if (!sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) {
if (__udp_checksum_complete(skb)) {
UDP_INC_STATS_BH(UDP_MIB_INERRORS);
kfree_skb(skb);


Signed-off-by: Wei Yongjun <[EMAIL PROTECTED]>
`ß' 



'   c link




>


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 0/1]SNMPv2 "ipv6IfStatsInHdrErrors" counter error

2006-07-31 Thread weidong
Hi, All
  When I tested Linux kernel 2.6.17.7 about statistics
"ipv6IfStatsInHdrErrors", found that this counter couldn't increase
correctly. The criteria is RFC2465:
  ipv6IfStatsInHdrErrors OBJECT-TYPE
  SYNTAX Counter3
  MAX-ACCESS read-only
  STATUS current
  DESCRIPTION
 "The number of input datagrams discarded due to
 errors in their IPv6 headers, including version
 number mismatch, other format errors, hop count
 exceeded, errors discovered in processing their
 IPv6 options, etc."
  ::= { ipv6IfStatsEntry 2 }

When I send TTL=0 and TTL=1 a packet to a router which need to be
forwarded, router just sends an ICMPv6 message to tell the sender that
TIME_EXCEED and HOPLIMITS, but no increments for this counter(in the
function ip6_forward).

The following is the patch for this issue. 

diff -ruN old/net/ipv6/ip6_output.c new/net/ipv6/ip6_output.c
--- old/net/ipv6/ip6_output.c   2006-07-25 11:36:01.0 +0800
+++ new/net/ipv6/ip6_output.c   2006-07-31 16:16:13.0 +0800
@@ -356,6 +356,7 @@
skb->dev = dst->dev;
icmpv6_send(skb, ICMPV6_TIME_EXCEED, ICMPV6_EXC_HOPLIMIT,
0, skb->dev);
+   IP6_INC_STATS_BH(IPSTATS_MIB_INHDRERRORS);
 
kfree_skb(skb);
return -ETIMEDOUT;

signed-off-by:Wei Dong <[EMAIL PROTECTED]>

Regards
Wei Dong

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/1]SNMPv2 "ipv6IfStatsOutFragCreates" counter error

2006-07-31 Thread weidong
Hi, All
  When I tested linux kernel 2.6.71.7 about statistics
"ipv6IfStatsOutFragCreates", and found that it couldn't increase
correctly. The criteria is RFC 2465:

  ipv6IfStatsOutFragCreates OBJECT-TYPE
  SYNTAX  Counter32
  MAX-ACCESS  read-only
  STATUS  current
  DESCRIPTION
 "The number of output datagram fragments that have
 been generated as a result of fragmentation at
 this output interface."
  ::= { ipv6IfStatsEntry 15 }

I think there are two issues in Linux kernel. 
1st:
RFC2465 specifies the counter is "The number of output datagram
fragments...". I think increasing this counter after output a fragment
successfully is better. And it should not be increased even though a
fragment is created but failed to output.

2nd:
If we send a big ICMP/ICMPv6 echo request to a host, and receive
ICMP/ICMPv6 echo reply consisted of some fragments. As we know that in
Linux kernel first fragmentation occurs in ICMP layer(maybe saying
transport layer is better), but this is not the "real"
fragmentation,just do some "pre-fragment" -- allocate space for date,
and form a frag_list, etc. The "real" fragmentation happens in IP layer
-- set offset and MF flag and so on. So I think in "fast path" for
ip_fragment/ip6_fragment, if we send a fragment which "pre-fragment" by
upper layer we should also increase "ipv6IfStatsOutFragCreates".

The following is the patch for the issues mentioned above:

diff -ruN old/net/ipv4/ip_output.c new/net/ipv4/ip_output.c
--- old/net/ipv4/ip_output.c2006-07-25 11:36:01.0 +0800
+++ new/net/ipv4/ip_output.c2006-07-31 16:24:57.0 +0800
@@ -527,6 +527,8 @@
 
err = output(skb);
 
+   if (!err)
+   IP_INC_STATS(IPSTATS_MIB_FRAGCREATES);
if (err || !frag)
break;
 
@@ -650,9 +652,6 @@
/*
 *  Put this fragment into the sending queue.
 */
-
-   IP_INC_STATS(IPSTATS_MIB_FRAGCREATES);
-
iph->tot_len = htons(len + hlen);
 
ip_send_check(iph);
@@ -660,6 +659,8 @@
err = output(skb2);
if (err)
goto fail;
+
+   IP_INC_STATS(IPSTATS_MIB_FRAGCREATES);
}
kfree_skb(skb);
IP_INC_STATS(IPSTATS_MIB_FRAGOKS);
diff -ruN old/net/ipv6/ip6_output.c new/net/ipv6/ip6_output.c
--- old/net/ipv6/ip6_output.c   2006-07-25 11:36:01.0 +0800
+++ new/net/ipv6/ip6_output.c   2006-07-31 16:24:21.0 +0800
@@ -593,6 +593,9 @@
}

err = output(skb);
+   if(!err)
+   IP6_INC_STATS(IPSTATS_MIB_FRAGCREATES);
+   
if (err || !frag)
break;
 
@@ -704,12 +707,11 @@
/*
 *  Put this fragment into the sending queue.
 */
-
-   IP6_INC_STATS(IPSTATS_MIB_FRAGCREATES);
-
err = output(frag);
if (err)
goto fail;
+
+   IP6_INC_STATS(IPSTATS_MIB_FRAGCREATES);
}
kfree_skb(skb);
IP6_INC_STATS(IPSTATS_MIB_FRAGOKS);

signed-off-by: Wei Dong <[EMAIL PROTECTED]>
Regards
Wei Dong

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPV6]: Audit all ip6_dst_lookup/ip6_dst_store calls

2006-07-31 Thread Herbert Xu
On Sun, Jul 30, 2006 at 10:32:10PM -0500, Matt Domsch wrote:
> 
> I applied this on 2.6.18-rc3, and it panics immediately as the first
> IPv6 TCP (ssh) session is initiated to the system.

Executive summary:

1) We resolved one lockdep warning only to stumble onto another lockdep
validator bug.  

2) There is something broken in the x86_64 unwind code which is causing
it to panic just about everytime somebody calls dump_stack().

Andi, this is the second time I've seen a report where an otherwise
harmless dump_stack call (the other one was caused by a WARN_ON) gets
turned into a panic by the stack unwind code on x86_64.  This particular
report is with 2.6.18-rc3 so it looks like whatever bug is causing it
hasn't been fixed yet.

Could you please have a look at it? Thanks.

> =
> [ INFO: possible recursive locking detected ]
> -
> swapper/0 is trying to acquire lock:
>  (slock-AF_INET6){-+..}, at: [] sk_clone+0xd2/0x3a8
> 
> but task is already holding lock:
>  (slock-AF_INET6){-+..}, at: [] tcp_v6_rcv+0x30e/0x76e 
> [ipv6]
> 
> other info that might help us debug this:
> 1 lock held by swapper/0:
>  #0:  (slock-AF_INET6){-+..}, at: [] tcp_v6_rcv+0x30e/0x76e 
> [ipv6]
> 
> stack backtrace:
> 
> Call Trace:
>  [] show_trace+0xae/0x30e
>  [] dump_stack+0x15/0x17
>  [] __lock_acquire+0x12e/0xa18
>  [] lock_acquire+0x4b/0x69
>  [] _spin_lock+0x25/0x31
>  [] sk_clone+0xd2/0x3a8
>  [] inet_csk_clone+0x11/0x6f
>  [] tcp_create_openreq_child+0x24/0x49c
>  [] :ipv6:tcp_v6_syn_recv_sock+0x2c5/0x6be
>  [] tcp_check_req+0x1d1/0x326
>  [] :ipv6:tcp_v6_do_rcv+0x15d/0x372
>  [] :ipv6:tcp_v6_rcv+0x71f/0x76e
>  [] :ipv6:ip6_input+0x223/0x315
>  [] :ipv6:ipv6_rcv+0x254/0x2af
>  [] netif_receive_skb+0x260/0x2dd
>  [] :e1000:e1000_clean_rx_irq+0x423/0x4c2
>  [] :e1000:e1000_clean+0x88/0x17d
>  [] net_rx_action+0xac/0x1d1
>  [] __do_softirq+0x68/0xf5
>  [] call_softirq+0x1e/0x28

Now let's move onto the lockdep validator bug :)

Ingo/Arjen, I thought we've already fixed this before but somehow I
can't find anything in the email archives so perhaps I'm mixing it up
with another recursive lock false-positive.

The problem here is really quite simple: when we accept a TCP connection
there are two sockets involved.  The listening socket which existed before
the connection came in and the socket we construct for the newly arrived
connection.

The code works something like this:

* Take slock on listening socket.
* Construct child socket.
* Take slock on child socket.

As we never do the locking in the opposite direction (child followed by
listening socket) this is safe.

So perhaps we need to add some extra annotation in sk_clone?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] SNMPv2 udpInDatagrams counter error

2006-07-31 Thread Wei Yongjun
Yes, you are right. 
I also send the same mail several ago, and get no response. You
patch is fine. But I think following code has no effect:

if (sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) {

It just let UDP datagrams with checksum error be added into  UDP receive
queue, and then discard it. I think this can be used to capture a UDP
datagrams use a filter. But I if use a filter to capture a UDP
datagrams,  the code should like that:

if (!sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) {

IPv6 doesn't do this, so I think delete condition 'sk->sk_filter' is
better. Do you think so?

Specially, do you try to send UDP datagrams with checksum error to echo-
udp(port 7), may be your patch will let neither udpInDatagrams nor
udpInErrors be increased. Because in my test, that datagrams can be send
to echo-udp and get a echo reply.

- Original Message - 
From: "Gerrit Renker" <[EMAIL PROTECTED]>
To: 
Sent: Monday, July 31, 2006 4:19 PM
Subject: Re: [PATCH] SNMPv2 udpInDatagrams counter error


> This has been raised earlier, cf.
http://bugzilla.kernel.org/show_bug.cgi?id=6660
> 
> Wei Yongjun wrote:
> |  When I send a UDP datagrams with checksum error to target, I found
that:
> |  Under IPv6, counter udpInErrors increased, but under IPv4 counter
> |  udpInDatagrams increased. I lookup into the source code, and found
that,
> |  under IPv4 UDP datagrams with checksum error will be delivered to
UDP
> |  receive queue, but IPv6 does not. IPv4 delivered into UDP receive
queue,
> |  increased udpInDatagrams, then discard it before delivered to UDP
user.
> |  RFC said udpInDatagrams is the total number of UDP datagrams
delivered
> |  to UDP users, so udpInDatagrams should not be increased while UDP
> |  datagrams with checksum error received.
> |  
> |  Refer to RFC2013:
> |udpInDatagrams OBJECT-TYPE
> |SYNTAX  Counter32
> |MAX-ACCESS  read-only
> |STATUS  current
> |DESCRIPTION
> |"The total number of UDP datagrams delivered to UDP
> |  users."
> |::= { udp 1 }
> |  
> |  Following is my patch:
> |  
> |  --- a/net/ipv4/udp.c 2006-07-31 09:33:45.392479344 -0400
> |  +++ b/net/ipv4/udp.c 2006-07-31 09:34:26.430240656 -0400
> |  @@ -1018,7 +1018,7 @@ static int udp_queue_rcv_skb(struct sock
> |   /* FALLTHROUGH -- it's a UDP Packet */
> |   }
> |   
> |  - if (sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) {
> |  + if (skb->ip_summed != CHECKSUM_UNNECESSARY) {
> |   if (__udp_checksum_complete(skb)) {
> |   UDP_INC_STATS_BH(UDP_MIB_INERRORS);
> |   kfree_skb(skb);
> |  
> |  Signed-off-by: Wei Yongjun <[EMAIL PROTECTED]>
> |  
> |  
> |  -
> |  To unsubscribe from this list: send the line "unsubscribe netdev"
in
> |  the body of a message to [EMAIL PROTECTED]
> |  More majordomo info at  http://vger.kernel.org/majordomo-info.html
> |  
> |  
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] SNMPv2 udpInDatagrams counter error

2006-07-31 Thread Gerrit Renker
Hi,
  
|  if (!sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) {
|  
|  IPv6 doesn't do this, so I think delete condition 'sk->sk_filter' is better. 
|  Do you think so?
I think the sk->sk_filter is there for a good reason. If you delete it, that 
routine
is forced to always compute UDP checksums, even if the only receiving 
application is 
a tcpdump process. I may be wrong here, but I think that deleting the sk_filter 
statement
is not at a good idea. 

The other alternatives discussed (afaik) so far were:

1) Move the increment of UDP_MIB_INDATAGRAMS from udp_queue_rcv_skb() to 
udp_recvmsg()
   (first patch uploaded to  http://bugzilla.kernel.org/show_bug.cgi?id=6660). 
This
   was discussed: not a good idea, since in-kernel applications may use the 
data_ready
   handler rather than udp_recvmsg().

2) Decrement UDP_MIB_INDATAGRAMS in udp_recvmsg() when the checksum turns out 
to be
   wrong (second patch uploaded to above address). This would be a fix to the 
problem 
   you are stating, it also solves the problem of missing out the data_ready 
handlers in
   (1), and was suggested earlier on this mailing list.
  
-- Gerrit
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] SNMPv2 udpInDatagrams counter error

2006-07-31 Thread Gerrit Renker
This has been raised earlier, cf. 
http://bugzilla.kernel.org/show_bug.cgi?id=6660

Wei Yongjun wrote:
|  When I send a UDP datagrams with checksum error to target, I found that:
|  Under IPv6, counter udpInErrors increased, but under IPv4 counter
|  udpInDatagrams increased. I lookup into the source code, and found that,
|  under IPv4 UDP datagrams with checksum error will be delivered to UDP
|  receive queue, but IPv6 does not. IPv4 delivered into UDP receive queue,
|  increased udpInDatagrams, then discard it before delivered to UDP user.
|  RFC said udpInDatagrams is the total number of UDP datagrams delivered
|  to UDP users, so udpInDatagrams should not be increased while UDP
|  datagrams with checksum error received.
|  
|  Refer to RFC2013:
|udpInDatagrams OBJECT-TYPE
|SYNTAX  Counter32
|MAX-ACCESS  read-only
|STATUS  current
|DESCRIPTION
|"The total number of UDP datagrams delivered to UDP
|  users."
|::= { udp 1 }
|  
|  Following is my patch:
|  
|  --- a/net/ipv4/udp.c 2006-07-31 09:33:45.392479344 -0400
|  +++ b/net/ipv4/udp.c 2006-07-31 09:34:26.430240656 -0400
|  @@ -1018,7 +1018,7 @@ static int udp_queue_rcv_skb(struct sock
|   /* FALLTHROUGH -- it's a UDP Packet */
|   }
|   
|  -if (sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) {
|  +if (skb->ip_summed != CHECKSUM_UNNECESSARY) {
|   if (__udp_checksum_complete(skb)) {
|   UDP_INC_STATS_BH(UDP_MIB_INERRORS);
|   kfree_skb(skb);
|  
|  Signed-off-by: Wei Yongjun <[EMAIL PROTECTED]>
|  
|  
|  -
|  To unsubscribe from this list: send the line "unsubscribe netdev" in
|  the body of a message to [EMAIL PROTECTED]
|  More majordomo info at  http://vger.kernel.org/majordomo-info.html
|  
|  
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCH Fix bonding active-backup behavior for VLAN interfaces

2006-07-31 Thread Christophe Devriese
On Monday 31 July 2006 05:50, you wrote:
> From: Ben Greear <[EMAIL PROTECTED]>
> Date: Fri, 28 Jul 2006 14:55:17 -0700
>
> > The skb_bond method assigns skb->dev when it does the 'keep',
> > but the VLAN code immediately over-writes the skb->dev when
> > searching for the vlan device.
> >
> > What is the purpose of assinging skb->dev to the master device?
>
> This makes me consider this patch highly dubious, at best.
>
> The whole intention of bonding on input is to make all packets
> incoming on the individual bond slaves to look like they come in via
> the master device.
>
> Therefore, even when the bond slaves are VLAN devices, in the end the
> skb->dev should be the bond master device _not_ the VLAN device.
>
> I'm not applying this patch, it doesn't look correct at all.

That code is not introduced by this patch, but is already in the kernel. This 
patch is about having the same behavior for the vlan accelerated input path 
and the normal input path.

If you bond 2 vlan subinterfaces, the patch is not necessary at all. In that 
case also the source device will be changed from eth0. to bond. So 
that's correct behavior no ?

In the second case, you create vlan subifs on a bonding device, vlan 
subinterfaces will be created on the slave interfaces. In that case the vlan 
code will reassign the skb->dev node, and because skb_bond needs to know the 
actual input device in order to make an informed drop decision before passing 
this code (skb active-backup mode needs to drop packets from the backup slave 
interface, if you don't do that you get big problems with broadcasts). 

The same struct vlan_group is assigned to all slave devices and so the only 
vlan subinterfaces that exist in this case are the bond. 
subinterfaces, and the vlan path for both slaves will assign the 
bond. interface to skb->dev, thereby erasing the information about 
where the packet came from.

I have tested the patch, and it works correctly, in both cases on my test 
sytem (where I join vlan subifs on a bonding device into a bridge and have 
xen guests' vifX.Y interfaces connected to those bridges, which is a 
configuration we imho really want to support) (without this patch, as 
explained earlier in this thread, this config does not work)

Regards,

Christophe
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] SNMPv2 udpInDatagrams counter error

2006-07-31 Thread Wei Yongjun
When I send a UDP datagrams with checksum error to target, I found that:
Under IPv6, counter udpInErrors increased, but under IPv4 counter
udpInDatagrams increased. I lookup into the source code, and found that,
under IPv4 UDP datagrams with checksum error will be delivered to UDP
receive queue, but IPv6 does not. IPv4 delivered into UDP receive queue,
increased udpInDatagrams, then discard it before delivered to UDP user.
RFC said udpInDatagrams is the total number of UDP datagrams delivered
to UDP users, so udpInDatagrams should not be increased while UDP
datagrams with checksum error received.

Refer to RFC2013:
  udpInDatagrams OBJECT-TYPE
  SYNTAX  Counter32
  MAX-ACCESS  read-only
  STATUS  current
  DESCRIPTION
  "The total number of UDP datagrams delivered to UDP
users."
  ::= { udp 1 }

Following is my patch:

--- a/net/ipv4/udp.c2006-07-31 09:33:45.392479344 -0400
+++ b/net/ipv4/udp.c2006-07-31 09:34:26.430240656 -0400
@@ -1018,7 +1018,7 @@ static int udp_queue_rcv_skb(struct sock
/* FALLTHROUGH -- it's a UDP Packet */
}
 
-   if (sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) {
+   if (skb->ip_summed != CHECKSUM_UNNECESSARY) {
if (__udp_checksum_complete(skb)) {
UDP_INC_STATS_BH(UDP_MIB_INERRORS);
kfree_skb(skb);

Signed-off-by: Wei Yongjun <[EMAIL PROTECTED]>


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html