Re: pktgen question
jamal wrote: On Sun, 2007-23-09 at 12:55 -0500, Steve Wise wrote: Its a hack that breaks cxgb3 because cxgb3 uses the skb-cb area for each skb passed down. So cxgb3 is at fault then? IE a driver cannot use the skb-cb field if the users count is 1? Or maybe a driver can _never_ use the cb field? any layer can use the cb structure whichever way they wish. There are violations, e.g: the vlan code also uses the cb field to pass vlan details for hardware acceleration. How does pktgen affect it though, clone() will just copy the cb field and pktgen doesnt touch it. In retrospect, pktgen may have to use clone - ccing Robert Olsson. Pktgen abuses the ref count, as far as I can tell. It works with most ethernet drivers, but that multi-pkt will also fail with things like vlans because the skb-dev is changed as it is transmitted (to the lower-level device). I'd say just don't use the multi-pkt with pktgen on devices that can't handle it. Thanks, Ben -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHES] TX batching
jamal wrote: On Sun, 2007-23-09 at 12:36 -0700, Kok, Auke wrote: please be reminded that we're going to strip down e1000 and most of the features should go into e1000e, which has much less hardware workarounds. I'm still reluctant to putting in new stuff in e1000 - I really want to chop it down first ;) sure - the question then is, will you take those changes if i use e1000e? theres a few cleanups that have nothing to do with batching; take a look at the modified e1000 on the git tree. that's bad to begin with :) - please send those separately so I can fasttrack them into e1000e and e1000 where applicable. But yes, I'm very inclined to merge more features into e1000e than e1000. I intend to put multiqueue support into e1000e, as *all* of the hardware that it will support has multiple queues. Putting in any other performance feature like tx batching would absolutely be interesting. Auke - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fix mace_handle_misc_intrs compilation
Fix compilation after incomplete struct net_device changes. Signed-off-by: Olaf Hering [EMAIL PROTECTED] --- drivers/net/mace.c |8 1 file changed, 4 insertions(+), 4 deletions(-) --- a/drivers/net/mace.c +++ b/drivers/net/mace.c @@ -633,7 +633,7 @@ static void mace_set_multicast(struct ne spin_unlock_irqrestore(mp-lock, flags); } -static void mace_handle_misc_intrs(struct mace_data *mp, int intr) +static void mace_handle_misc_intrs(struct mace_data *mp, int intr, struct net_device *dev) { volatile struct mace __iomem *mb = mp-mace; static int mace_babbles, mace_jabbers; @@ -669,7 +669,7 @@ static irqreturn_t mace_interrupt(int ir spin_lock_irqsave(mp-lock, flags); intr = in_8(mb-ir); /* read interrupt register */ in_8(mb-xmtrc); /* get retries */ -mace_handle_misc_intrs(mp, intr); +mace_handle_misc_intrs(mp, intr, dev); i = mp-tx_empty; while (in_8(mb-pr) XMTSV) { @@ -682,7 +682,7 @@ static irqreturn_t mace_interrupt(int ir */ intr = in_8(mb-ir); if (intr != 0) - mace_handle_misc_intrs(mp, intr); + mace_handle_misc_intrs(mp, intr, dev); if (mp-tx_bad_runt) { fs = in_8(mb-xmtfs); mp-tx_bad_runt = 0; @@ -817,7 +817,7 @@ static void mace_tx_timeout(unsigned lon goto out; /* update various counters */ -mace_handle_misc_intrs(mp, in_8(mb-ir)); +mace_handle_misc_intrs(mp, in_8(mb-ir), dev); cp = mp-tx_cmds + NCMDS_TX * mp-tx_empty; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix mace_handle_misc_intrs compilation
On Mon, 24 Sep 2007 10:15:01 +0200 Olaf Hering [EMAIL PROTECTED] wrote: Fix compilation after incomplete struct net_device changes. yup, thanks, Kamalesh Babulal has already sent in an identical patch. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: change the way e1000 is handling short VLAN frames
Thanks for your replies. We contacted Arris (manufacturer of our CMTS) about this issue with links to relevant parts of specification about minimum size of VLAN tagged frames and it seems they'll acknowledge the problem and fix it in next firmware. Meantime i tried different suggestions posted there (patch from Chris, replacing ETH_ZLEN with VLAN_ETH_ZLEN in e1000_main.c and disabling HW VLAN tagging in netdev-features) and combinations of them, but none of it worked and 64B VLAN frames are still generated by our intel NIC. I'll now consider this issue as closed on the e1000 and intel NIC side, because there was no problem with them in the first place, and I'll rather concentrate on resolving it on the CMTS side which doesn't act according to specs. Cheer Emil. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Clean up redundant PHY write line for ULi526x Ethernet driver
From: Roy Zang [EMAIL PROTECTED] Clean up redundant PHY write line for ULi526x Ethernet Driver. Signed-off-by: Roy Zang [EMAIL PROTECTED] --- drivers/net/tulip/uli526x.c |1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/drivers/net/tulip/uli526x.c b/drivers/net/tulip/uli526x.c index ca2548e..53a8e65 100644 --- a/drivers/net/tulip/uli526x.c +++ b/drivers/net/tulip/uli526x.c @@ -1512,7 +1512,6 @@ static void uli526x_process_mode(struct uli526x_board_info *db) case ULI526X_100MFD: phy_reg = 0x2100; break; } phy_write(db-ioaddr, db-phy_addr, 0, phy_reg, db-chip_id); - phy_write(db-ioaddr, db-phy_addr, 0, phy_reg, db-chip_id); } } } -- 1.5.2 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 3/3] [TCP] MIB: Count FRTO's successfully detected spurious RTOs
Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/linux/snmp.h |1 + net/ipv4/proc.c |1 + net/ipv4/tcp_input.c |1 + 3 files changed, 3 insertions(+), 0 deletions(-) diff --git a/include/linux/snmp.h b/include/linux/snmp.h index d8fd3ec..89f0c2b 100644 --- a/include/linux/snmp.h +++ b/include/linux/snmp.h @@ -213,6 +213,7 @@ enum LINUX_MIB_TCPSACKDISCARD, /* TCPSACKDiscard */ LINUX_MIB_TCPDSACKIGNOREDOLD, /* TCPSACKIgnoredOld */ LINUX_MIB_TCPDSACKIGNOREDNOUNDO,/* TCPSACKIgnoredNoUndo */ + LINUX_MIB_TCPSPURIOUSRTOS, /* TCPSpuriousRTOs */ __LINUX_MIB_MAX }; diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c index 2015148..9dee70e 100644 --- a/net/ipv4/proc.c +++ b/net/ipv4/proc.c @@ -245,6 +245,7 @@ static const struct snmp_mib snmp4_net_list[] = { SNMP_MIB_ITEM(TCPSACKDiscard, LINUX_MIB_TCPSACKDISCARD), SNMP_MIB_ITEM(TCPDSACKIgnoredOld, LINUX_MIB_TCPDSACKIGNOREDOLD), SNMP_MIB_ITEM(TCPDSACKIgnoredNoUndo, LINUX_MIB_TCPDSACKIGNOREDNOUNDO), + SNMP_MIB_ITEM(TCPSpuriousRTOs, LINUX_MIB_TCPSPURIOUSRTOS), SNMP_MIB_SENTINEL }; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index ae06b94..259f517 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2935,6 +2935,7 @@ static int tcp_process_frto(struct sock *sk, int flag) } tp-frto_counter = 0; tp-undo_marker = 0; + NET_INC_STATS_BH(LINUX_MIB_TCPSPURIOUSRTOS); } return 0; } -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/3] [TCP]: Reordered ACK's (old) SACKs not included to discarded MIB
In case of ACK reordering, the SACK block might be valid in it's time but is already obsoleted since we've received another kind of confirmation about arrival of the segments through snd_una advancement of an earlier packet. I didn't bother to build distinguishing of valid and invalid SACK blocks but simply made reordered SACK blocks that are too old always not counted regardless of their real validity which could be determined by using the ack field of the reordered packet (won't be significant IMHO). DSACKs can very well be considered useful even in this situation, so won't do any of this for them. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_input.c |7 ++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 532288b..ae06b94 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1250,8 +1250,13 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ NET_INC_STATS_BH(LINUX_MIB_TCPDSACKIGNOREDNOUNDO); else NET_INC_STATS_BH(LINUX_MIB_TCPDSACKIGNOREDOLD); - } else + } else { + /* Don't count olds caused by ACK reordering */ + if ((TCP_SKB_CB(ack_skb)-ack_seq != tp-snd_una) + !after(end_seq, tp-snd_una)) + continue; NET_INC_STATS_BH(LINUX_MIB_TCPSACKDISCARD); + } continue; } -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] [TCP]: Re-place highest_sack check to a more robust position
I previously added checking to position that is rather poor as state has already been adjusted quite a bit. Re-placing it above all state changes should be more robust though the return should never ever get executed regardless of its place :-). Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_output.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index cbb83ac..94c8011 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1718,6 +1718,10 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *skb, int m BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1); + if (WARN_ON(tp-sacked_out + (TCP_SKB_CB(next_skb)-seq == tp-highest_sack))) + return; + /* Ok. We will be able to collapse the packet. */ tcp_unlink_write_queue(next_skb, sk); @@ -1734,10 +1738,6 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *skb, int m /* Update sequence range on original skb. */ TCP_SKB_CB(skb)-end_seq = TCP_SKB_CB(next_skb)-end_seq; - if (WARN_ON(tp-sacked_out - (TCP_SKB_CB(next_skb)-seq == tp-highest_sack))) - return; - /* Merge over control information. */ flags |= TCP_SKB_CB(next_skb)-flags; /* This moves PSH/FIN etc. over */ TCP_SKB_CB(skb)-flags = flags; -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-2.6.24 0/3]: TCP tweaks
Hi Dave, Here are some minor tweaks to recent net-2.6.24 changes, more or less mandatory. The first one is the most important one because if the not-supposed-to-happen case occurs, TCP would be left to an inconsistent state. As regards the second, I'm fine with either way. The third one's MIB for FRTO is sort of nice to have but definately not mandatory. I'm leaving decision about those two to you if you're fine with them feel free to apply :-). -- i. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: incorrect cksum with tcp/udp on lo with 2.6.20/2.6.21/2.6.22
On Mon, 24 Sep 2007, Herbert Xu wrote: On Sun, Sep 23, 2007 at 11:18:58PM +0200, Krzysztof Oledzki wrote: Thank you for the information. Is there any easy way to turn them on? I need it for LVS. Do you really need it? Yes. I would like to use a LVS redirector as both a client and a director: http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.LVS-DR.html#director_as_client_in_LVS-DR The packets should be checksummed at the point where they physically leave the host. So, with DR mode, packet goes by the lo device (with bad checksum) and then get redirected outside. Unfortunately, when it leaves host it has bad checksum, too. :( Best regards, Krzysztof Olędzki
[PATCH 3/5] [TCP]: Convert highest_sack to sk_buff to allow direct access
From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= [EMAIL PROTECTED] It is going to replace the sack fastpath hint quite soon... :-) Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/linux/tcp.h |6 -- include/net/tcp.h | 13 + net/ipv4/tcp_input.c | 12 ++-- net/ipv4/tcp_output.c | 19 ++- 4 files changed, 33 insertions(+), 17 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index f8cf090..1d6be2a 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -332,8 +332,10 @@ struct tcp_sock { struct tcp_sack_block_wire recv_sack_cache[4]; - u32 highest_sack; /* Start seq of globally highest revd SACK -* (validity guaranteed only if sacked_out 0) */ + struct sk_buff *highest_sack; /* highest skb with SACK received +* (validity guaranteed only if +* sacked_out 0) +*/ /* from STCP, retrans queue hinting */ struct sk_buff* lost_skb_hint; diff --git a/include/net/tcp.h b/include/net/tcp.h index 991ccdc..8bc64b7 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1308,6 +1308,19 @@ static inline int tcp_write_queue_empty(struct sock *sk) return skb_queue_empty(sk-sk_write_queue); } +/* Start sequence of the highest skb with SACKed bit, valid only if + * sacked 0 or when the caller has ensured validity by itself. + */ +static inline u32 tcp_highest_sack_seq(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + + if (WARN_ON(!tp-sacked_out + tp-highest_sack != tcp_write_queue_head(sk))) + return tp-snd_una; + return TCP_SKB_CB(tp-highest_sack)-seq; +} + /* /proc */ enum tcp_seq_states { TCP_SEQ_STATE_LISTENING, diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 76e9c9b..85dd4b0 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1143,10 +1143,11 @@ static int tcp_check_dsack(struct tcp_sock *tp, struct sk_buff *ack_skb, return dup_sack; } -static void tcp_sacktag_one(struct sk_buff *skb, struct tcp_sock *tp, +static void tcp_sacktag_one(struct sk_buff *skb, struct sock *sk, struct tcp_sacktag_state *state, int in_sack, int dup_sack, int fack_count, u32 end_seq) { + struct tcp_sock *tp = tcp_sk(sk); u8 sacked = TCP_SKB_CB(skb)-sacked; /* Account D-SACK for retransmitted packet. */ @@ -1231,9 +1232,8 @@ static void tcp_sacktag_one(struct sk_buff *skb, struct tcp_sock *tp, if (fack_count tp-fackets_out) tp-fackets_out = fack_count; - if (after(TCP_SKB_CB(skb)-seq, - tp-highest_sack)) - tp-highest_sack = TCP_SKB_CB(skb)-seq; + if (after(TCP_SKB_CB(skb)-seq, tcp_highest_sack_seq(sk))) + tp-highest_sack = skb; } else { if (dup_sack (sackedTCPCB_RETRANS)) state-reord = min(fack_count, state-reord); @@ -1271,7 +1271,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ if (!tp-sacked_out) { tp-fackets_out = 0; - tp-highest_sack = tp-snd_una; + tp-highest_sack = tcp_write_queue_head(sk); } found_dup_sack = tcp_check_dsack(tp, ack_skb, sp, num_sacks, prior_snd_una); @@ -1424,7 +1424,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ fack_count += pcount; - tcp_sacktag_one(skb, tp, state, in_sack, + tcp_sacktag_one(skb, sk, state, in_sack, dup_sack, fack_count, end_seq); } } diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 94c8011..fd51692 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -657,13 +657,15 @@ static void tcp_set_skb_tso_segs(struct sock *sk, struct sk_buff *skb, unsigned * tweak SACK fastpath hint too as it would overwrite all changes unless * hint is also changed. */ -static void tcp_adjust_fackets_out(struct tcp_sock *tp, struct sk_buff *skb, +static void tcp_adjust_fackets_out(struct sock *sk, struct sk_buff *skb, int decr) { + struct tcp_sock *tp = tcp_sk(sk); + if (!tp-sacked_out) return; - if (!before(tp-highest_sack, TCP_SKB_CB(skb)-seq)) + if (!before(tcp_highest_sack_seq(sk), TCP_SKB_CB(skb)-seq)) tp-fackets_out -= decr; /* cnt_hint is off-by-one compared with fackets_out (see sacktag) */ @@ -712,8 +714,8 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len, unsigned int mss TCP_SKB_CB(buff)-end_seq =
[RFC PATCH net-2.6.24 0/5]: TCP sacktag cache usage recoded
Hi all, After couple of wrong-wayed before/after()s and one infinite loopy version, here's the current trial version of a sacktag cache usage recode Two first patches come from tcp-2.6 (rebased and rotated). This series apply cleanly only on top of the other three patch series I posted earlier today. The last debug patch provides some statistics for those interested enough. Dave, please DO NOT apply! ...Some thoughts could be nice though :-). It should improve processing of such likely events as cumulative ACKs and new forward holed SACK considerably because full walk is not necessary anymore (old code could have been tweaked to cover them but it's better to drop each special case handling altogether and do a generic solution. Redundancy of fastpath hints and highest_sack stuff is also addressed, however, it might have slight effect as the hint could point to something less than highest_sack occassionally, whether that's significant remains to see... In all cases except hint below highest_sack, the new solution should perform at least as well as the old code (with a bit larger constant though, no additional cache misses though) because the SACK blocks old code choose not to process should all fall to LINUX_MIB_TCP_FULLSKIP category. It's possible to improve it easily with RB-tree stuff though this version is based on code using linked lists. I'm not yet too sure that I got everything 100% correct as I tweak start/end_seqs and exit skb loops a way that is prone to off-by-one errors, could miss skb here and there. I'll probably also recode dsack handling too to avoid recursion. Stephen, Sangtae, others? experiencing those unexpected RTOs during recovery of large windowed TCP, could you please give it a try if it helps any... -- i. ps. Our net-2.6.24 (and mainline?) DSACK processing code could be broken btw. DSACK in the middle of other SACK block during in-order walk seems to not be processed at all as the earlier (sorted) block caused skb to advance past it already? (Just occurred to me, I'll see what I can do to that if I've enough time). - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] [TCP]: Create tcp_sacktag_state.
From: David S. Miller [EMAIL PROTECTED] It is difficult to break out the inner-logic of tcp_sacktag_write_queue() into worker functions because so many local variables get updated in-place. Start to overcome this by creating a structure block of state variables that can be passed around into worker routines. [I made minor tweaks due to rebase/reordering of stuff in tcp-2.6 tree, and dropped found_dup_sack dup_sack from state -ij] Signed-off-by: David S. Miller [EMAIL PROTECTED] Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_input.c | 89 ++--- 1 files changed, 47 insertions(+), 42 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 259f517..04ff465 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1102,6 +1102,13 @@ static int tcp_is_sackblock_valid(struct tcp_sock *tp, int is_dsack, return !before(start_seq, end_seq - tp-max_window); } +struct tcp_sacktag_state { + unsigned int flag; + int reord; + int prior_fackets; + u32 lost_retrans; + int first_sack_index; +}; static int tcp_check_dsack(struct tcp_sock *tp, struct sk_buff *ack_skb, struct tcp_sack_block_wire *sp, int num_sacks, @@ -1146,25 +1153,22 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ struct tcp_sack_block_wire *sp = (struct tcp_sack_block_wire *)(ptr+2); struct sk_buff *cached_skb; int num_sacks = (ptr[1] - TCPOLEN_SACK_BASE)3; - int reord = tp-packets_out; - int prior_fackets; - u32 lost_retrans = 0; - int flag = 0; - int found_dup_sack = 0; + struct tcp_sacktag_state state; + int found_dup_sack; int cached_fack_count; int i; - int first_sack_index; + int force_one_sack; + + state.flag = 0; if (!tp-sacked_out) { tp-fackets_out = 0; tp-highest_sack = tp-snd_una; } - prior_fackets = tp-fackets_out; - found_dup_sack = tcp_check_dsack(tp, ack_skb, sp, -num_sacks, prior_snd_una); + found_dup_sack = tcp_check_dsack(tp, ack_skb, sp, num_sacks, prior_snd_una); if (found_dup_sack) - flag |= FLAG_DSACKING_ACK; + state.flag |= FLAG_DSACKING_ACK; /* Eliminate too old ACKs, but take into * account more or less fresh ones, they can @@ -1177,18 +1181,18 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ * if the only SACK change is the increase of the end_seq of * the first block then only apply that SACK block * and use retrans queue hinting otherwise slowpath */ - flag = 1; + force_one_sack = 1; for (i = 0; i num_sacks; i++) { __be32 start_seq = sp[i].start_seq; __be32 end_seq = sp[i].end_seq; if (i == 0) { if (tp-recv_sack_cache[i].start_seq != start_seq) - flag = 0; + force_one_sack = 0; } else { if ((tp-recv_sack_cache[i].start_seq != start_seq) || (tp-recv_sack_cache[i].end_seq != end_seq)) - flag = 0; + force_one_sack = 0; } tp-recv_sack_cache[i].start_seq = start_seq; tp-recv_sack_cache[i].end_seq = end_seq; @@ -1199,8 +1203,8 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ tp-recv_sack_cache[i].end_seq = 0; } - first_sack_index = 0; - if (flag) + state.first_sack_index = 0; + if (force_one_sack) num_sacks = 1; else { int j; @@ -1218,17 +1222,14 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ sp[j+1] = tmp; /* Track where the first SACK block goes to */ - if (j == first_sack_index) - first_sack_index = j+1; + if (j == state.first_sack_index) + state.first_sack_index = j+1; } } } } - /* clear flag as used for different purpose in following code */ - flag = 0; - /* Use SACK fastpath hint if valid */ cached_skb = tp-fastpath_skb_hint; cached_fack_count = tp-fastpath_cnt_hint; @@ -1237,12 +1238,16 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ cached_fack_count = 0; } + state.reord =
[PATCH 2/5] [TCP]: Create tcp_sacktag_one().
From: David S. Miller [EMAIL PROTECTED] Worker function that implements the main logic of the inner-most loop of tcp_sacktag_write_queue(). Signed-off-by: David S. Miller [EMAIL PROTECTED] Acked-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_input.c | 213 ++ 1 files changed, 110 insertions(+), 103 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 04ff465..76e9c9b 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1143,6 +1143,114 @@ static int tcp_check_dsack(struct tcp_sock *tp, struct sk_buff *ack_skb, return dup_sack; } +static void tcp_sacktag_one(struct sk_buff *skb, struct tcp_sock *tp, + struct tcp_sacktag_state *state, int in_sack, + int dup_sack, int fack_count, u32 end_seq) +{ + u8 sacked = TCP_SKB_CB(skb)-sacked; + + /* Account D-SACK for retransmitted packet. */ + if ((dup_sack in_sack) + (sacked TCPCB_RETRANS) + after(TCP_SKB_CB(skb)-end_seq, tp-undo_marker)) + tp-undo_retrans--; + + /* The frame is ACKed. */ + if (!after(TCP_SKB_CB(skb)-end_seq, tp-snd_una)) { + if (sacked TCPCB_RETRANS) { + if ((dup_sack in_sack) + (sacked TCPCB_SACKED_ACKED)) + state-reord = min(fack_count, state-reord); + } else { + /* If it was in a hole, we detected reordering. */ + if (fack_count state-prior_fackets + !(sacked TCPCB_SACKED_ACKED)) + state-reord = min(fack_count, state-reord); + } + + /* Nothing to do; acked frame is about to be dropped. */ + return; + } + + if ((sacked TCPCB_SACKED_RETRANS) + after(end_seq, TCP_SKB_CB(skb)-ack_seq) + (!state-lost_retrans || after(end_seq, state-lost_retrans))) + state-lost_retrans = end_seq; + + if (!in_sack) + return; + + if (!(sacked TCPCB_SACKED_ACKED)) { + if (sacked TCPCB_SACKED_RETRANS) { + /* If the segment is not tagged as lost, +* we do not clear RETRANS, believing +* that retransmission is still in flight. +*/ + if (sacked TCPCB_LOST) { + TCP_SKB_CB(skb)-sacked = + ~(TCPCB_LOST|TCPCB_SACKED_RETRANS); + tp-lost_out -= tcp_skb_pcount(skb); + tp-retrans_out -= tcp_skb_pcount(skb); + + /* clear lost hint */ + tp-retransmit_skb_hint = NULL; + } + } else { + /* New sack for not retransmitted frame, +* which was in hole. It is reordering. +*/ + if (!(sacked TCPCB_RETRANS) + fack_count state-prior_fackets) + state-reord = min(fack_count, state-reord); + + if (sacked TCPCB_LOST) { + TCP_SKB_CB(skb)-sacked = ~TCPCB_LOST; + tp-lost_out -= tcp_skb_pcount(skb); + + /* clear lost hint */ + tp-retransmit_skb_hint = NULL; + } + /* SACK enhanced F-RTO detection. +* Set flag if and only if non-rexmitted +* segments below frto_highmark are +* SACKed (RFC4138; Appendix B). +* Clearing correct due to in-order walk +*/ + if (after(end_seq, tp-frto_highmark)) { + state-flag = ~FLAG_ONLY_ORIG_SACKED; + } else { + if (!(sacked TCPCB_RETRANS)) + state-flag |= FLAG_ONLY_ORIG_SACKED; + } + } + + TCP_SKB_CB(skb)-sacked |= TCPCB_SACKED_ACKED; + state-flag |= FLAG_DATA_SACKED; + tp-sacked_out += tcp_skb_pcount(skb); + + if (fack_count tp-fackets_out) + tp-fackets_out = fack_count; + + if (after(TCP_SKB_CB(skb)-seq, + tp-highest_sack)) + tp-highest_sack = TCP_SKB_CB(skb)-seq; + } else { + if (dup_sack (sackedTCPCB_RETRANS)) + state-reord = min(fack_count, state-reord); + } + + /* D-SACK. We can detect redundant retransmission +* in S|R and plain R frames and clear
[RFC PATCH 4/5] [TCP]: Rewrite sack_recv_cache (WIP)
From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= [EMAIL PROTECTED] Previously a number of cases in TCP SACK processing fail to take advantage of costly stored information in sack_recv_cache. Most importantly expected events such as cumulative ACK, new hole ACKs and first ACK after RTO fall to this category. Processing on such ACKs result in rather long walks building up latencies (which easily gets nasty when window is large), which are completely unnecessary, usually no new information was gathered except the new SACK block above the hole in the respective case. Since the inclusion of highest_sack, there's a lot information that is very likely redundant (SACK fastpath hint stuff, fackets_out, highest_sack), though there's no ultimate guarantee that they'll remain the same whole the time (in all unearthly scenarios). Take advantage of this too and drop fastpath hint. Effectively this drops special cased fastpath. This change adds some complexity to introduce better coveraged fastpath. The current ACK's SACK blocks are compared against each cached block individially and only ranges that are new are then scanned by the high constant walk. For other parts of write queue, even when in previously known part of the SACK blocks, a faster skip function is used. In addition, whenever possible, TCP fast-forwards to highest_sack skb that was made available earlier. In typical case, no other things but this fast-forward and mandatory markings after that occur making the access pattern quite similar to the former fastpath. DSACKs are special case that must always be walked. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/linux/tcp.h |4 +- include/net/tcp.h |1 - net/ipv4/tcp_input.c | 320 ++-- net/ipv4/tcp_output.c | 12 +-- 4 files changed, 202 insertions(+), 135 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 1d6be2a..8d91eac 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -330,7 +330,7 @@ struct tcp_sock { struct tcp_sack_block duplicate_sack[1]; /* D-SACK block */ struct tcp_sack_block selective_acks[4]; /* The SACKS themselves*/ - struct tcp_sack_block_wire recv_sack_cache[4]; + struct tcp_sack_block recv_sack_cache[4]; struct sk_buff *highest_sack; /* highest skb with SACK received * (validity guaranteed only if @@ -343,9 +343,7 @@ struct tcp_sock { struct sk_buff *scoreboard_skb_hint; struct sk_buff *retransmit_skb_hint; struct sk_buff *forward_skb_hint; - struct sk_buff *fastpath_skb_hint; - int fastpath_cnt_hint; int lost_cnt_hint; int retransmit_cnt_hint; diff --git a/include/net/tcp.h b/include/net/tcp.h index 8bc64b7..d5def9b 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1078,7 +1078,6 @@ static inline void tcp_clear_retrans_hints_partial(struct tcp_sock *tp) static inline void tcp_clear_all_retrans_hints(struct tcp_sock *tp) { tcp_clear_retrans_hints_partial(tp); - tp-fastpath_skb_hint = NULL; } /* MD5 Signature */ diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 85dd4b0..9dfdd67 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1106,11 +1106,15 @@ struct tcp_sacktag_state { unsigned int flag; int reord; int prior_fackets; + int fack_count; u32 lost_retrans; - int first_sack_index; + u32 dup_start; + u32 dup_end; }; -static int tcp_check_dsack(struct tcp_sock *tp, struct sk_buff *ack_skb, +static int tcp_check_dsack(struct tcp_sock *tp, + struct tcp_sacktag_state *state, + struct sk_buff *ack_skb, struct tcp_sack_block_wire *sp, int num_sacks, u32 prior_snd_una) { @@ -1120,6 +1124,8 @@ static int tcp_check_dsack(struct tcp_sock *tp, struct sk_buff *ack_skb, if (before(start_seq_0, TCP_SKB_CB(ack_skb)-ack_seq)) { dup_sack = 1; + state-dup_start = start_seq_0; + state-dup_end = end_seq_0; tcp_dsack_seen(tp); NET_INC_STATS_BH(LINUX_MIB_TCPDSACKRECV); } else if (num_sacks 1) { @@ -1129,6 +1135,8 @@ static int tcp_check_dsack(struct tcp_sock *tp, struct sk_buff *ack_skb, if (!after(end_seq_0, end_seq_1) !before(start_seq_0, start_seq_1)) { dup_sack = 1; + state-dup_start = start_seq_1; + state-dup_end = end_seq_1; tcp_dsack_seen(tp); NET_INC_STATS_BH(LINUX_MIB_TCPDSACKOFORECV); } @@ -1251,6 +1259,104 @@ static void tcp_sacktag_one(struct sk_buff *skb, struct sock *sk, } } +static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk, +
[DEVELOPER PATCH 5/5] [TCP]: Track sacktag
From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= [EMAIL PROTECTED] Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/linux/snmp.h | 20 +++ net/ipv4/proc.c | 20 +++ net/ipv4/tcp_input.c | 52 +++-- 3 files changed, 89 insertions(+), 3 deletions(-) diff --git a/include/linux/snmp.h b/include/linux/snmp.h index 89f0c2b..42b8c07 100644 --- a/include/linux/snmp.h +++ b/include/linux/snmp.h @@ -214,6 +214,26 @@ enum LINUX_MIB_TCPDSACKIGNOREDOLD, /* TCPSACKIgnoredOld */ LINUX_MIB_TCPDSACKIGNOREDNOUNDO,/* TCPSACKIgnoredNoUndo */ LINUX_MIB_TCPSPURIOUSRTOS, /* TCPSpuriousRTOs */ + LINUX_MIB_TCP_SACKTAG, + LINUX_MIB_TCP_SACK0, + LINUX_MIB_TCP_SACK1, + LINUX_MIB_TCP_SACK2, + LINUX_MIB_TCP_SACK3, + LINUX_MIB_TCP_SACK4, + LINUX_MIB_TCP_WALKEDSKBS, + LINUX_MIB_TCP_WALKEDDSACKS, + LINUX_MIB_TCP_SKIPPEDSKBS, + LINUX_MIB_TCP_NOCACHE, + LINUX_MIB_TCP_FULLWALK, + LINUX_MIB_TCP_HEADWALK, + LINUX_MIB_TCP_TAILWALK, + LINUX_MIB_TCP_FULLSKIP, + LINUX_MIB_TCP_TAILSKIP, + LINUX_MIB_TCP_HEADSKIP, + LINUX_MIB_TCP_FULLSKIP_TOHIGH, + LINUX_MIB_TCP_TAILSKIP_TOHIGH, + LINUX_MIB_TCP_NEWSKIP, + LINUX_MIB_TCP_CACHEREMAINING, __LINUX_MIB_MAX }; diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c index 9dee70e..4808f82 100644 --- a/net/ipv4/proc.c +++ b/net/ipv4/proc.c @@ -246,6 +246,26 @@ static const struct snmp_mib snmp4_net_list[] = { SNMP_MIB_ITEM(TCPDSACKIgnoredOld, LINUX_MIB_TCPDSACKIGNOREDOLD), SNMP_MIB_ITEM(TCPDSACKIgnoredNoUndo, LINUX_MIB_TCPDSACKIGNOREDNOUNDO), SNMP_MIB_ITEM(TCPSpuriousRTOs, LINUX_MIB_TCPSPURIOUSRTOS), + SNMP_MIB_ITEM(TCP_SACKTAG, LINUX_MIB_TCP_SACKTAG), + SNMP_MIB_ITEM(TCP_SACK0, LINUX_MIB_TCP_SACK0), + SNMP_MIB_ITEM(TCP_SACK1, LINUX_MIB_TCP_SACK1), + SNMP_MIB_ITEM(TCP_SACK2, LINUX_MIB_TCP_SACK2), + SNMP_MIB_ITEM(TCP_SACK3, LINUX_MIB_TCP_SACK3), + SNMP_MIB_ITEM(TCP_SACK4, LINUX_MIB_TCP_SACK4), + SNMP_MIB_ITEM(TCP_WALKEDSKBS, LINUX_MIB_TCP_WALKEDSKBS), + SNMP_MIB_ITEM(TCP_WALKEDDSACKS, LINUX_MIB_TCP_WALKEDDSACKS), + SNMP_MIB_ITEM(TCP_SKIPPEDSKBS, LINUX_MIB_TCP_SKIPPEDSKBS), + SNMP_MIB_ITEM(TCP_NOCACHE, LINUX_MIB_TCP_NOCACHE), + SNMP_MIB_ITEM(TCP_FULLWALK, LINUX_MIB_TCP_FULLWALK), + SNMP_MIB_ITEM(TCP_HEADWALK, LINUX_MIB_TCP_HEADWALK), + SNMP_MIB_ITEM(TCP_TAILWALK, LINUX_MIB_TCP_TAILWALK), + SNMP_MIB_ITEM(TCP_FULLSKIP, LINUX_MIB_TCP_FULLSKIP), + SNMP_MIB_ITEM(TCP_TAILSKIP, LINUX_MIB_TCP_TAILSKIP), + SNMP_MIB_ITEM(TCP_HEADSKIP, LINUX_MIB_TCP_HEADSKIP), + SNMP_MIB_ITEM(TCP_FULLSKIP_TOHIGH, LINUX_MIB_TCP_FULLSKIP_TOHIGH), + SNMP_MIB_ITEM(TCP_TAILSKIP_TOHIGH, LINUX_MIB_TCP_TAILSKIP_TOHIGH), + SNMP_MIB_ITEM(TCP_NEWSKIP, LINUX_MIB_TCP_NEWSKIP), + SNMP_MIB_ITEM(TCP_CACHEREMAINING, LINUX_MIB_TCP_CACHEREMAINING), SNMP_MIB_SENTINEL }; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 9dfdd67..38045c8 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1306,6 +1306,10 @@ static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk, tcp_sacktag_one(skb, sk, state, in_sack, dup_sack, state-fack_count, end_seq); + + NET_INC_STATS_BH(LINUX_MIB_TCP_WALKEDSKBS); + if (dup_sack) + NET_INC_STATS_BH(LINUX_MIB_TCP_WALKEDDSACKS); } return skb; } @@ -1329,6 +1333,7 @@ static struct sk_buff *tcp_sacktag_skip(struct sk_buff *skb, struct sock *sk, skb = tcp_sacktag_walk(skb, sk, state, state-dup_start, state-dup_end, 1); } + NET_INC_STATS_BH(LINUX_MIB_TCP_SKIPPEDSKBS); } return skb; } @@ -1458,9 +1463,22 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ cache++; } + NET_INC_STATS_BH(LINUX_MIB_TCP_SACKTAG); + switch (used_sacks) { + case 0: NET_INC_STATS_BH(LINUX_MIB_TCP_SACK0); break; + case 1: NET_INC_STATS_BH(LINUX_MIB_TCP_SACK1); break; + case 2: NET_INC_STATS_BH(LINUX_MIB_TCP_SACK2); break; + case 3: NET_INC_STATS_BH(LINUX_MIB_TCP_SACK3); break; + case 4: NET_INC_STATS_BH(LINUX_MIB_TCP_SACK4); break; + } + + if (!tcp_sack_cache_ok(tp, cache)) + NET_INC_STATS_BH(LINUX_MIB_TCP_NOCACHE); + while (i used_sacks) { u32 start_seq = sp[i].start_seq; u32 end_seq = sp[i].end_seq; + int fullwalk = 0; /* Event B in the comment above. */ if (after(end_seq, tp-high_seq)) @@ -1473,41 +1491,69 @@
Re: 2.6.23-rc6-mm1: Build failures on ppc64_defconfig
On (22/09/07 12:24), Satyam Sharma didst pronounce: On Thu, 20 Sep 2007, Satyam Sharma wrote: BTW ppc64_defconfig didn't quite like 2.6.23-rc6-mm1 either ... IIRC I got build failures in: drivers/net/spider_net.c [PATCH -mm] spider_net: Misc build fixes after recent netdev stats changes Unbreak the following: drivers/net/spider_net.c: In function 'spider_net_release_tx_chain': drivers/net/spider_net.c:818: error: 'dev' undeclared (first use in this function) drivers/net/spider_net.c:818: error: (Each undeclared identifier is reported only once drivers/net/spider_net.c:818: error: for each function it appears in.) drivers/net/spider_net.c: In function 'spider_net_xmit': drivers/net/spider_net.c:922: error: 'dev' undeclared (first use in this function) drivers/net/spider_net.c: In function 'spider_net_pass_skb_up': drivers/net/spider_net.c:1018: error: 'dev' undeclared (first use in this function) drivers/net/spider_net.c: In function 'spider_net_decode_one_descr': drivers/net/spider_net.c:1215: error: 'dev' undeclared (first use in this function) make[2]: *** [drivers/net/spider_net.o] Error 1 Signed-off-by: Satyam Sharma [EMAIL PROTECTED] I've confirmed that this patch fixes the build error in question. Acked-by: Mel Gorman [EMAIL PROTECTED] --- drivers/net/spider_net.c | 24 +++- 1 file changed, 11 insertions(+), 13 deletions(-) diff -ruNp a/drivers/net/spider_net.c b/drivers/net/spider_net.c --- a/drivers/net/spider_net.c2007-09-22 06:26:39.0 +0530 +++ b/drivers/net/spider_net.c2007-09-22 12:12:23.0 +0530 @@ -795,6 +795,7 @@ spider_net_set_low_watermark(struct spid static int spider_net_release_tx_chain(struct spider_net_card *card, int brutal) { + struct net_device *dev = card-netdev; struct spider_net_descr_chain *chain = card-tx_chain; struct spider_net_descr *descr; struct spider_net_hw_descr *hwdescr; @@ -919,7 +920,7 @@ spider_net_xmit(struct sk_buff *skb, str spider_net_release_tx_chain(card, 0); if (spider_net_prepare_tx_descr(card, skb) != 0) { - dev-stats.tx_dropped++; + netdev-stats.tx_dropped++; netif_stop_queue(netdev); return NETDEV_TX_BUSY; } @@ -979,16 +980,12 @@ static void spider_net_pass_skb_up(struct spider_net_descr *descr, struct spider_net_card *card) { - struct spider_net_hw_descr *hwdescr= descr-hwdescr; - struct sk_buff *skb; - struct net_device *netdev; - u32 data_status, data_error; - - data_status = hwdescr-data_status; - data_error = hwdescr-data_error; - netdev = card-netdev; + struct spider_net_hw_descr *hwdescr = descr-hwdescr; + struct sk_buff *skb = descr-skb; + struct net_device *netdev = card-netdev; + u32 data_status = hwdescr-data_status; + u32 data_error = hwdescr-data_error; - skb = descr-skb; skb_put(skb, hwdescr-valid_size); /* the card seems to add 2 bytes of junk in front @@ -1015,8 +1012,8 @@ spider_net_pass_skb_up(struct spider_net } /* update netdevice statistics */ - dev-stats.rx_packets++; - dev-stats.rx_bytes += skb-len; + netdev-stats.rx_packets++; + netdev-stats.rx_bytes += skb-len; /* pass skb up to stack */ netif_receive_skb(skb); @@ -1184,6 +1181,7 @@ static int spider_net_resync_tail_ptr(st static int spider_net_decode_one_descr(struct spider_net_card *card) { + struct net_device *dev = card-netdev; struct spider_net_descr_chain *chain = card-rx_chain; struct spider_net_descr *descr = chain-tail; struct spider_net_hw_descr *hwdescr = descr-hwdescr; @@ -1210,7 +1208,7 @@ spider_net_decode_one_descr(struct spide (status == SPIDER_NET_DESCR_PROTECTION_ERROR) || (status == SPIDER_NET_DESCR_FORCE_END) ) { if (netif_msg_rx_err(card)) - dev_err(card-netdev-dev, + dev_err(dev-dev, dropping RX descriptor with state %d\n, status); dev-stats.rx_dropped++; goto bad_desc; -- -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen question
I think pktgen should be cloning the skbs using skb_clone(). Then it will work for all devices, eh? Ben Greear wrote: jamal wrote: On Sun, 2007-23-09 at 12:55 -0500, Steve Wise wrote: Its a hack that breaks cxgb3 because cxgb3 uses the skb-cb area for each skb passed down. So cxgb3 is at fault then? IE a driver cannot use the skb-cb field if the users count is 1? Or maybe a driver can _never_ use the cb field? any layer can use the cb structure whichever way they wish. There are violations, e.g: the vlan code also uses the cb field to pass vlan details for hardware acceleration. How does pktgen affect it though, clone() will just copy the cb field and pktgen doesnt touch it. In retrospect, pktgen may have to use clone - ccing Robert Olsson. Pktgen abuses the ref count, as far as I can tell. It works with most ethernet drivers, but that multi-pkt will also fail with things like vlans because the skb-dev is changed as it is transmitted (to the lower-level device). I'd say just don't use the multi-pkt with pktgen on devices that can't handle it. Thanks, Ben - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH V5 2/11] IB/ipoib: Notify the world before doing unregister
Roland Dreier wrote: The action in bonding to a detach of slave is to unregister the master (see patch 10). This can't be done from the context of unregister_netdevice itself (it is protected by rtnl_lock). I'm confused. Your patch has: + ipoib_slave_detach(cpriv-dev); unregister_netdev(cpriv-dev); And ipoib_slave_detach() is: +static inline void ipoib_slave_detach(struct net_device *dev) +{ + rtnl_lock(); + netdev_slave_detach(dev); + rtnl_unlock(); +} so you are calling netdev_slave_detach() with the rtnl lock held. Why can't you make the same call from the start of unregister_netdevice()? Anyway, if the rtnl lock is a problem, can you just add the call to netdev_slave_detach() to unregister_netdev() before it takes the rtnl lock? - R. Your comment made me do a little rethinking. In bonding, device is released by calling unregister_netdevice() that doesn't take the rtnl_lock (unlike unregister_netdev() that does). I guess that this made me confused to think that this is not possible. So, I guess I could put the detach notification in unregister_netedev() and the reaction to the notification in the bonding driver would not block. However, I looked one more time at the code of unregister_netdevice() and found out that nothing prevents from calling unregister_netdevice() again when the notification NETDEV_GOING_DOWN is sent. I tried that and it works. I have a new set of patches without sending a slave detach and I will send it soon. Thanks for the comment Roland. It makes this patch simpler. I'd also like to give a credit to Jay for the idea of using NETDEV_GOING_DOWN notification instead of NETDEV_CHANGE+IFF_SLAVE_DETACH. He suggested it a while ago but I wrongly thought that it wouldn't work. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen question
Steve Wise wrote: I think pktgen should be cloning the skbs using skb_clone(). Then it will work for all devices, eh? That might work, but it would decrease performance slightly (or, increase CPU load at least). Maybe a new option: multi_clone Ben -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sb1250-mac: Driver model phylib update
On Fri, 21 Sep 2007, Andrew Morton wrote: A driver model and phylib update. akpm:/usr/src/25 diffstat patches/git-net.patch | tail -n 1 1013 files changed, 187667 insertions(+), 23587 deletions(-) Sorry, but raising networking patches against Linus's crufty old mainline tree just isn't viable at present. Well, this is against Jeff's netdev-2.6 tree which hopefully is not as crufty as Linus's old mainline; if it is not possible to queue this change for 2.6.25 or suchlike, then I will try to resubmit later. Thanks for your attention though. Maciej - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen question
Ben Greear wrote: Steve Wise wrote: I think pktgen should be cloning the skbs using skb_clone(). Then it will work for all devices, eh? That might work, but it would decrease performance slightly (or, increase CPU load at least). Perf-wise, you could clone the skbs up front, then deliver them to the nic in a tight loop. This would mitigate the added overhead introduced by calling skb_clone() in the loop doing transmits... Maybe a new option: multi_clone If the current code is busted, I think it should be fixed. My 2 cents. Steve. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V6 0/9] net/bonding: ADD IPoIB support for the bonding driver
This patch series is the sixth version (see below link to V5) of the suggested changes to the bonding driver so it would be able to support non ARPHRD_ETHER netdevices for its High-Availability (active-backup) mode. Patches 1-8 were originally submitted in V5 and patch 9 is an addition by Jay. Major changes from the previous version: 1. Remove the patches to net/core. Bonding will use the NETDEV_GOING_DOWN notification instead of NETDEV_CHANGE+IFF_SLAVE_DETACH. This reduces the amount of patches from 11 to 9. Links to earlier discussion: 1. A discussion in netdev about bonding support for IPoIB. http://lists.openwall.net/netdev/2006/11/30/46 2. V5 series http://lists.openfabrics.org/pipermail/general/2007-September/040996.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V6 1/9] IB/ipoib: Bound the net device to the ipoib_neigh structue
IPoIB uses a two layer neighboring scheme, such that for each struct neighbour whose device is an ipoib one, there is a struct ipoib_neigh buddy which is created on demand at the tx flow by an ipoib_neigh_alloc(skb-dst-neighbour) call. When using the bonding driver, neighbours are created by the net stack on behalf of the bonding (master) device. On the tx flow the bonding code gets an skb such that skb-dev points to the master device, it changes this skb to point on the slave device and calls the slave hard_start_xmit function. Under this scheme, ipoib_neigh_destructor assumption that for each struct neighbour it gets, n-dev is an ipoib device and hence netdev_priv(n-dev) can be casted to struct ipoib_dev_priv is buggy. To fix it, this patch adds a dev field to struct ipoib_neigh which is used instead of the struct neighbour dev one, when n-dev-flags has the IFF_MASTER bit set. Signed-off-by: Moni Shoua monis at voltaire.com Signed-off-by: Or Gerlitz ogerlitz at voltaire.com --- drivers/infiniband/ulp/ipoib/ipoib.h |4 +++- drivers/infiniband/ulp/ipoib/ipoib_main.c | 24 +++- drivers/infiniband/ulp/ipoib/ipoib_multicast.c |3 ++- 3 files changed, 20 insertions(+), 11 deletions(-) Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib.h === --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-18 17:08:53.245849217 +0200 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib.h2007-09-18 17:09:26.534874404 +0200 @@ -328,6 +328,7 @@ struct ipoib_neigh { struct sk_buff_head queue; struct neighbour *neighbour; + struct net_device *dev; struct list_headlist; }; @@ -344,7 +345,8 @@ static inline struct ipoib_neigh **to_ip INFINIBAND_ALEN, sizeof(void *)); } -struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh); +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh, + struct net_device *dev); void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh); extern struct workqueue_struct *ipoib_workqueue; Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c === --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-18 17:08:53.245849217 +0200 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-18 17:23:54.725744661 +0200 @@ -511,7 +511,7 @@ static void neigh_add_path(struct sk_buf struct ipoib_path *path; struct ipoib_neigh *neigh; - neigh = ipoib_neigh_alloc(skb-dst-neighbour); + neigh = ipoib_neigh_alloc(skb-dst-neighbour, skb-dev); if (!neigh) { ++priv-stats.tx_dropped; dev_kfree_skb_any(skb); @@ -830,6 +830,13 @@ static void ipoib_neigh_cleanup(struct n unsigned long flags; struct ipoib_ah *ah = NULL; + neigh = *to_ipoib_neigh(n); + if (neigh) { + priv = netdev_priv(neigh-dev); + ipoib_dbg(priv, neigh_destructor for bonding device: %s\n, + n-dev-name); + } else + return; ipoib_dbg(priv, neigh_cleanup for %06x IPOIB_GID_FMT \n, IPOIB_QPN(n-ha), @@ -837,13 +844,10 @@ static void ipoib_neigh_cleanup(struct n spin_lock_irqsave(priv-lock, flags); - neigh = *to_ipoib_neigh(n); - if (neigh) { - if (neigh-ah) - ah = neigh-ah; - list_del(neigh-list); - ipoib_neigh_free(n-dev, neigh); - } + if (neigh-ah) + ah = neigh-ah; + list_del(neigh-list); + ipoib_neigh_free(n-dev, neigh); spin_unlock_irqrestore(priv-lock, flags); @@ -851,7 +855,8 @@ static void ipoib_neigh_cleanup(struct n ipoib_put_ah(ah); } -struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour, + struct net_device *dev) { struct ipoib_neigh *neigh; @@ -860,6 +865,7 @@ struct ipoib_neigh *ipoib_neigh_alloc(st return NULL; neigh-neighbour = neighbour; + neigh-dev = dev; *to_ipoib_neigh(neighbour) = neigh; skb_queue_head_init(neigh-queue); ipoib_cm_set(neigh, NULL); Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c === --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-09-18 17:08:53.245849217 +0200 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-09-18 17:09:26.536874045 +0200 @@ -727,7 +727,8 @@ out: if (skb-dst skb-dst-neighbour !*to_ipoib_neigh(skb-dst-neighbour)) { -
[PATCH V6 2/9] IB/ipoib: Verify address handle validity on send
When the bonding device senses a carrier loss of its active slave it replaces that slave with a new one. In between the times when the carrier of an IPoIB device goes down and ipoib_neigh is destroyed, it is possible that the bonding driver will send a packet on a new slave that uses an old ipoib_neigh. This patch detects and prevents this from happenning. Signed-off-by: Moni Shoua monis at voltaire.com Signed-off-by: Or Gerlitz ogerlitz at voltaire.com --- drivers/infiniband/ulp/ipoib/ipoib_main.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c === --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-18 17:09:26.535874225 +0200 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-18 17:10:22.375853147 +0200 @@ -686,9 +686,10 @@ static int ipoib_start_xmit(struct sk_bu goto out; } } else if (neigh-ah) { - if (unlikely(memcmp(neigh-dgid.raw, + if (unlikely((memcmp(neigh-dgid.raw, skb-dst-neighbour-ha + 4, - sizeof(union ib_gid { + sizeof(union ib_gid))) || +(neigh-dev != dev))) { spin_lock(priv-lock); /* * It's safe to call ipoib_put_ah() inside - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V6 3/9] net/bonding: Enable bonding to enslave non ARPHRD_ETHER
This patch changes some of the bond netdevice attributes and functions to be that of the active slave for the case of the enslaved device not being of ARPHRD_ETHER type. Basically it overrides those setting done by ether_setup(), which are netdevice **type** dependent and hence might be not appropriate for devices of other types. It also enforces mutual exclusion on bonding slaves from dissimilar ether types, as was concluded over the v1 discussion. IPoIB (see Documentation/infiniband/ipoib.txt) MAC address is made of a 3 bytes IB QP (Queue Pair) number and 16 bytes IB port GID (Global ID) of the port this IPoIB device is bounded to. The QP is a resource created by the IB HW and the GID is an identifier burned into the HCA (i have omitted here some details which are not important for the bonding RFC). Signed-off-by: Moni Shoua monis at voltaire.com Signed-off-by: Or Gerlitz ogerlitz at voltaire.com --- drivers/net/bonding/bond_main.c | 39 +++ 1 files changed, 39 insertions(+) Index: net-2.6/drivers/net/bonding/bond_main.c === --- net-2.6.orig/drivers/net/bonding/bond_main.c2007-08-15 10:08:59.0 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:54:13.424688411 +0300 @@ -1237,6 +1237,26 @@ static int bond_compute_features(struct return 0; } + +static void bond_setup_by_slave(struct net_device *bond_dev, + struct net_device *slave_dev) +{ + bond_dev-hard_header = slave_dev-hard_header; + bond_dev-rebuild_header= slave_dev-rebuild_header; + bond_dev-hard_header_cache = slave_dev-hard_header_cache; + bond_dev-header_cache_update = slave_dev-header_cache_update; + bond_dev-hard_header_parse = slave_dev-hard_header_parse; + + bond_dev-neigh_setup = slave_dev-neigh_setup; + + bond_dev-type = slave_dev-type; + bond_dev-hard_header_len = slave_dev-hard_header_len; + bond_dev-addr_len = slave_dev-addr_len; + + memcpy(bond_dev-broadcast, slave_dev-broadcast, + slave_dev-addr_len); +} + /* enslave device slave to bond device master */ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) { @@ -1311,6 +1331,25 @@ int bond_enslave(struct net_device *bond goto err_undo_flags; } + /* set bonding device ether type by slave - bonding netdevices are +* created with ether_setup, so when the slave type is not ARPHRD_ETHER +* there is a need to override some of the type dependent attribs/funcs. +* +* bond ether type mutual exclusion - don't allow slaves of dissimilar +* ether type (eg ARPHRD_ETHER and ARPHRD_INFINIBAND) share the same bond +*/ + if (bond-slave_cnt == 0) { + if (slave_dev-type != ARPHRD_ETHER) + bond_setup_by_slave(bond_dev, slave_dev); + } else if (bond_dev-type != slave_dev-type) { + printk(KERN_ERR DRV_NAME : %s ether type (%d) is different + from other slaves (%d), can not enslave it.\n, + slave_dev-name, + slave_dev-type, bond_dev-type); + res = -EINVAL; + goto err_undo_flags; + } + if (slave_dev-set_mac_address == NULL) { printk(KERN_ERR DRV_NAME : %s: Error: The slave device you specified does - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V6 4/9] net/bonding: Enable bonding to enslave netdevices not supporting set_mac_address()
This patch allows for enslaving netdevices which do not support the set_mac_address() function. In that case the bond mac address is the one of the active slave, where remote peers are notified on the mac address (neighbour) change by Gratuitous ARP sent by bonding when fail-over occurs (this is already done by the bonding code). Signed-off-by: Moni Shoua monis at voltaire.com Signed-off-by: Or Gerlitz ogerlitz at voltaire.com --- drivers/net/bonding/bond_main.c | 87 +++- drivers/net/bonding/bonding.h |1 2 files changed, 60 insertions(+), 28 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c === --- net-2.6.orig/drivers/net/bonding/bond_main.c2007-08-15 10:54:13.0 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:54:41.971632881 +0300 @@ -1095,6 +1095,14 @@ void bond_change_active_slave(struct bon if (new_active) { bond_set_slave_active_flags(new_active); } + + /* when bonding does not set the slave MAC address, the bond MAC +* address is the one of the active slave. +*/ + if (new_active !bond-do_set_mac_addr) + memcpy(bond-dev-dev_addr, new_active-dev-dev_addr, + new_active-dev-addr_len); + bond_send_gratuitous_arp(bond); } } @@ -1351,13 +1359,22 @@ int bond_enslave(struct net_device *bond } if (slave_dev-set_mac_address == NULL) { - printk(KERN_ERR DRV_NAME - : %s: Error: The slave device you specified does - not support setting the MAC address. - Your kernel likely does not support slave - devices.\n, bond_dev-name); - res = -EOPNOTSUPP; - goto err_undo_flags; + if (bond-slave_cnt == 0) { + printk(KERN_WARNING DRV_NAME + : %s: Warning: The first slave device you + specified does not support setting the MAC + address. This bond MAC address would be that + of the active slave.\n, bond_dev-name); + bond-do_set_mac_addr = 0; + } else if (bond-do_set_mac_addr) { + printk(KERN_ERR DRV_NAME + : %s: Error: The slave device you specified + does not support setting the MAC addres,. + but this bond uses this practice. \n + , bond_dev-name); + res = -EOPNOTSUPP; + goto err_undo_flags; + } } new_slave = kzalloc(sizeof(struct slave), GFP_KERNEL); @@ -1378,16 +1395,18 @@ int bond_enslave(struct net_device *bond */ memcpy(new_slave-perm_hwaddr, slave_dev-dev_addr, ETH_ALEN); - /* -* Set slave to master's mac address. The application already -* set the master's mac address to that of the first slave -*/ - memcpy(addr.sa_data, bond_dev-dev_addr, bond_dev-addr_len); - addr.sa_family = slave_dev-type; - res = dev_set_mac_address(slave_dev, addr); - if (res) { - dprintk(Error %d calling set_mac_address\n, res); - goto err_free; + if (bond-do_set_mac_addr) { + /* +* Set slave to master's mac address. The application already +* set the master's mac address to that of the first slave +*/ + memcpy(addr.sa_data, bond_dev-dev_addr, bond_dev-addr_len); + addr.sa_family = slave_dev-type; + res = dev_set_mac_address(slave_dev, addr); + if (res) { + dprintk(Error %d calling set_mac_address\n, res); + goto err_free; + } } res = netdev_set_master(slave_dev, bond_dev); @@ -1612,9 +1631,11 @@ err_close: dev_close(slave_dev); err_restore_mac: - memcpy(addr.sa_data, new_slave-perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev-type; - dev_set_mac_address(slave_dev, addr); + if (bond-do_set_mac_addr) { + memcpy(addr.sa_data, new_slave-perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev-type; + dev_set_mac_address(slave_dev, addr); + } err_free: kfree(new_slave); @@ -1792,10 +1813,12 @@ int bond_release(struct net_device *bond /* close slave before restoring its mac address */ dev_close(slave_dev); - /* restore original (permanent) mac address */ - memcpy(addr.sa_data, slave-perm_hwaddr, ETH_ALEN); -
Re: pktgen question
Steve Wise wrote: Ben Greear wrote: Steve Wise wrote: I think pktgen should be cloning the skbs using skb_clone(). Then it will work for all devices, eh? That might work, but it would decrease performance slightly (or, increase CPU load at least). Perf-wise, you could clone the skbs up front, then deliver them to the nic in a tight loop. This would mitigate the added overhead introduced by calling skb_clone() in the loop doing transmits... That only works if you are sending a small number of skbs. You can't pre-clone several minutes worth of 10Gbe traffic with any normal amount of RAM. Maybe a new option: multi_clone If the current code is busted, I think it should be fixed. Well, it works fine when used correctly :) Ben -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V6 5/9] net/bonding: Enable IP multicast for bonding IPoIB devices
Allow to enslave devices when the bonding device is not up. Over the discussion held at the previous post this seemed to be the most clean way to go, where it is not expected to cause instabilities. Normally, the bonding driver is UP before any enslavement takes place. Once a netdevice is UP, the network stack acts to have it join some multicast groups (eg the all-hosts 224.0.0.1). Now, since ether_setup() have set the bonding device type to be ARPHRD_ETHER and address len to be ETHER_ALEN, the net core code computes a wrong multicast link address. This is b/c ip_eth_mc_map() is called where for multicast joins taking place after the enslavement another ip_xxx_mc_map() is called (eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND) Signed-off-by: Moni Shoua monis at voltaire.com Signed-off-by: Or Gerlitz ogerlitz at voltaire.com --- drivers/net/bonding/bond_main.c |5 +++-- drivers/net/bonding/bond_sysfs.c |6 ++ 2 files changed, 5 insertions(+), 6 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c === --- net-2.6.orig/drivers/net/bonding/bond_main.c2007-08-15 10:54:41.0 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:55:48.431862446 +0300 @@ -1285,8 +1285,9 @@ int bond_enslave(struct net_device *bond /* bond must be initialized by bond_open() before enslaving */ if (!(bond_dev-flags IFF_UP)) { - dprintk(Error, master_dev is not up\n); - return -EPERM; + printk(KERN_WARNING DRV_NAME +%s: master_dev is not up in bond_enslave\n, + bond_dev-name); } /* already enslaved */ Index: net-2.6/drivers/net/bonding/bond_sysfs.c === --- net-2.6.orig/drivers/net/bonding/bond_sysfs.c 2007-08-15 10:08:58.0 +0300 +++ net-2.6/drivers/net/bonding/bond_sysfs.c2007-08-15 10:55:48.432862269 +0300 @@ -266,11 +266,9 @@ static ssize_t bonding_store_slaves(stru /* Quick sanity check -- is the bond interface up? */ if (!(bond-dev-flags IFF_UP)) { - printk(KERN_ERR DRV_NAME - : %s: Unable to update slaves because interface is down.\n, + printk(KERN_WARNING DRV_NAME + : %s: doing slave updates when interface is down.\n, bond-dev-name); - ret = -EPERM; - goto out; } /* Note: We can't hold bond-lock here, as bond_create grabs it. */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V6 6/9] net/bonding: Handlle wrong assumptions that slave is always an Ethernet device
bonding sometimes uses Ethernet constants (such as MTU and address length) which are not good when it enslaves non Ethernet devices (such as InfiniBand). Signed-off-by: Moni Shoua monis at voltaire.com --- drivers/net/bonding/bond_main.c |3 ++- drivers/net/bonding/bond_sysfs.c | 10 -- drivers/net/bonding/bonding.h|1 + 3 files changed, 11 insertions(+), 3 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c === --- net-2.6.orig/drivers/net/bonding/bond_main.c2007-09-24 12:52:33.0 +0200 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-09-24 12:57:33.411459811 +0200 @@ -1224,7 +1224,8 @@ static int bond_compute_features(struct struct slave *slave; struct net_device *bond_dev = bond-dev; unsigned long features = bond_dev-features; - unsigned short max_hard_header_len = ETH_HLEN; + unsigned short max_hard_header_len = max((u16)ETH_HLEN, + bond_dev-hard_header_len); int i; features = ~(NETIF_F_ALL_CSUM | BOND_VLAN_FEATURES); Index: net-2.6/drivers/net/bonding/bond_sysfs.c === --- net-2.6.orig/drivers/net/bonding/bond_sysfs.c 2007-09-24 12:55:09.0 +0200 +++ net-2.6/drivers/net/bonding/bond_sysfs.c2007-09-24 13:00:23.752680721 +0200 @@ -260,6 +260,7 @@ static ssize_t bonding_store_slaves(stru char command[IFNAMSIZ + 1] = { 0, }; char *ifname; int i, res, found, ret = count; + u32 original_mtu; struct slave *slave; struct net_device *dev = NULL; struct bonding *bond = to_bond(d); @@ -325,6 +326,7 @@ static ssize_t bonding_store_slaves(stru } /* Set the slave's MTU to match the bond */ + original_mtu = dev-mtu; if (dev-mtu != bond-dev-mtu) { if (dev-change_mtu) { res = dev-change_mtu(dev, @@ -339,6 +341,9 @@ static ssize_t bonding_store_slaves(stru } rtnl_lock(); res = bond_enslave(bond-dev, dev); + bond_for_each_slave(bond, slave, i) + if (strnicmp(slave-dev-name, ifname, IFNAMSIZ) == 0) + slave-original_mtu = original_mtu; rtnl_unlock(); if (res) { ret = res; @@ -351,6 +356,7 @@ static ssize_t bonding_store_slaves(stru bond_for_each_slave(bond, slave, i) if (strnicmp(slave-dev-name, ifname, IFNAMSIZ) == 0) { dev = slave-dev; + original_mtu = slave-original_mtu; break; } if (dev) { @@ -365,9 +371,9 @@ static ssize_t bonding_store_slaves(stru } /* set the slave MTU to the default */ if (dev-change_mtu) { - dev-change_mtu(dev, 1500); + dev-change_mtu(dev, original_mtu); } else { - dev-mtu = 1500; + dev-mtu = original_mtu; } } else { Index: net-2.6/drivers/net/bonding/bonding.h === --- net-2.6.orig/drivers/net/bonding/bonding.h 2007-09-24 12:55:09.0 +0200 +++ net-2.6/drivers/net/bonding/bonding.h 2007-09-24 12:57:33.412459636 +0200 @@ -156,6 +156,7 @@ struct slave { s8 link;/* one of BOND_LINK_ */ s8 state; /* one of BOND_STATE_ */ u32original_flags; + u32original_mtu; u32link_failure_count; u16speed; u8 duplex; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen question
Hi, Steve Wise writes: I think pktgen should be cloning the skbs using skb_clone(). Then it will work for all devices, eh? pktgen assumes for fastpath sending exclusive ownership of the skb. And does a skb_get to avoid final skb destruction so the same skb can be sent over and over. The idea is to avoid memory allocation and keep things in cache to give very high packet rates with identical packets. I But if you need to alter the packet then the skb_get trick can't be done. And you have to turn off fastpath with clone_skb Perf-wise, you could clone the skbs up front, then deliver them to the nic in a tight loop. This would mitigate the added overhead introduced by calling skb_clone() in the loop doing transmits... Sure it's can be done. It could replay sequences etc but it will not beat the skb_get trick in sending identical packets. It has been proposed before but I've avoided such efforts to keep things relatively small and simple. Really pktgen should be reworked to have s small skim in kernel and move the rest of the stuff to userland. Cheers. --ro - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
PATCH V6 7/9] net/bonding: Delay sending of gratuitous ARP to avoid failure
Delay sending a gratuitous_arp when LINK_STATE_LINKWATCH_PENDING bit in dev-state field is on. This improves the chances for the arp packet to be transmitted. Signed-off-by: Moni Shoua monis at voltaire.com --- drivers/net/bonding/bond_main.c | 24 +--- drivers/net/bonding/bonding.h |1 + 2 files changed, 22 insertions(+), 3 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c === --- net-2.6.orig/drivers/net/bonding/bond_main.c2007-08-15 10:56:33.0 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 11:04:37.221123652 +0300 @@ -1102,8 +1102,14 @@ void bond_change_active_slave(struct bon if (new_active !bond-do_set_mac_addr) memcpy(bond-dev-dev_addr, new_active-dev-dev_addr, new_active-dev-addr_len); - - bond_send_gratuitous_arp(bond); + if (bond-curr_active_slave + test_bit(__LINK_STATE_LINKWATCH_PENDING, + bond-curr_active_slave-dev-state)) { + dprintk(delaying gratuitous arp on %s\n, + bond-curr_active_slave-dev-name); + bond-send_grat_arp = 1; + } else + bond_send_gratuitous_arp(bond); } } @@ -2083,6 +2089,17 @@ void bond_mii_monitor(struct net_device * program could monitor the link itself if needed. */ + if (bond-send_grat_arp) { + if (bond-curr_active_slave test_bit(__LINK_STATE_LINKWATCH_PENDING, + bond-curr_active_slave-dev-state)) + dprintk(Needs to send gratuitous arp but not yet\n); + else { + dprintk(sending delayed gratuitous arp on on %s\n, + bond-curr_active_slave-dev-name); + bond_send_gratuitous_arp(bond); + bond-send_grat_arp = 0; + } + } read_lock(bond-curr_slave_lock); oldcurrent = bond-curr_active_slave; read_unlock(bond-curr_slave_lock); @@ -2484,7 +2501,7 @@ static void bond_send_gratuitous_arp(str if (bond-master_ip) { bond_arp_send(slave-dev, ARPOP_REPLY, bond-master_ip, - bond-master_ip, 0); + bond-master_ip, 0); } list_for_each_entry(vlan, bond-vlan_list, vlan_list) { @@ -4293,6 +4310,7 @@ static int bond_init(struct net_device * bond-current_arp_slave = NULL; bond-primary_slave = NULL; bond-dev = bond_dev; + bond-send_grat_arp = 0; INIT_LIST_HEAD(bond-vlan_list); /* Initialize the device entry points */ Index: net-2.6/drivers/net/bonding/bonding.h === --- net-2.6.orig/drivers/net/bonding/bonding.h 2007-08-15 10:56:33.0 +0300 +++ net-2.6/drivers/net/bonding/bonding.h 2007-08-15 11:05:41.516451497 +0300 @@ -187,6 +187,7 @@ struct bonding { struct timer_list arp_timer; s8 kill_timers; s8 do_set_mac_addr; + s8 send_grat_arp; struct net_device_stats stats; #ifdef CONFIG_PROC_FS struct proc_dir_entry *proc_entry; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V6 8/9] net/bonding: Destroy bonding master when last slave is gone
When bonding enslaves non Ethernet devices it takes pointers to functions in the module that owns the slaves. In this case it becomes unsafe to keep the bonding master registered after last slave was unenslaved because we don't know if the pointers are still valid. Destroying the bond when slave_cnt is zero ensures that these functions be used anymore. Signed-off-by: Moni Shoua monis at voltaire.com --- drivers/net/bonding/bond_main.c | 37 + drivers/net/bonding/bond_sysfs.c |9 + drivers/net/bonding/bonding.h|3 +++ 3 files changed, 45 insertions(+), 4 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c === --- net-2.6.orig/drivers/net/bonding/bond_main.c2007-09-24 14:01:24.055441842 +0200 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-09-24 14:05:05.658979207 +0200 @@ -1256,6 +1256,7 @@ static int bond_compute_features(struct static void bond_setup_by_slave(struct net_device *bond_dev, struct net_device *slave_dev) { + struct bonding *bond = bond_dev-priv; bond_dev-hard_header = slave_dev-hard_header; bond_dev-rebuild_header= slave_dev-rebuild_header; bond_dev-hard_header_cache = slave_dev-hard_header_cache; @@ -1270,6 +1271,7 @@ static void bond_setup_by_slave(struct n memcpy(bond_dev-broadcast, slave_dev-broadcast, slave_dev-addr_len); + bond-setup_by_slave = 1; } /* enslave device slave to bond device master */ @@ -1838,6 +1840,35 @@ int bond_release(struct net_device *bond } /* +* Destroy a bonding device. +* Must be under rtnl_lock when this function is called. +*/ +void bond_destroy(struct bonding *bond) +{ + bond_deinit(bond-dev); + bond_destroy_sysfs_entry(bond); + unregister_netdevice(bond-dev); +} + +/* +* First release a slave and than destroy the bond if no more slaves iare left. +* Must be under rtnl_lock when this function is called. +*/ +int bond_release_and_destroy(struct net_device *bond_dev, struct net_device *slave_dev) +{ + struct bonding *bond = bond_dev-priv; + int ret; + + ret = bond_release(bond_dev, slave_dev); + if ((ret == 0) (bond-slave_cnt == 0)) { + printk(KERN_INFO DRV_NAME %s: destroying bond %s.\n, + bond_dev-name); + bond_destroy(bond); + } + return ret; +} + +/* * This function releases all slaves. */ static int bond_release_all(struct net_device *bond_dev) @@ -3337,6 +3368,11 @@ static int bond_slave_netdev_event(unsig * ... Or is it this? */ break; + case NETDEV_GOING_DOWN: + dprintk(slave %s is going down\n, slave_dev-name); + if (bond-setup_by_slave) + bond_release_and_destroy(bond_dev, slave_dev); + break; case NETDEV_CHANGEMTU: /* * TODO: Should slaves be allowed to @@ -4311,6 +4347,7 @@ static int bond_init(struct net_device * bond-primary_slave = NULL; bond-dev = bond_dev; bond-send_grat_arp = 0; + bond-setup_by_slave = 0; INIT_LIST_HEAD(bond-vlan_list); /* Initialize the device entry points */ Index: net-2.6/drivers/net/bonding/bonding.h === --- net-2.6.orig/drivers/net/bonding/bonding.h 2007-09-24 14:01:24.055441842 +0200 +++ net-2.6/drivers/net/bonding/bonding.h 2007-09-24 14:01:24.627340013 +0200 @@ -188,6 +188,7 @@ struct bonding { s8 kill_timers; s8 do_set_mac_addr; s8 send_grat_arp; + s8 setup_by_slave; struct net_device_stats stats; #ifdef CONFIG_PROC_FS struct proc_dir_entry *proc_entry; @@ -295,6 +296,8 @@ static inline void bond_unset_master_alb struct vlan_entry *bond_next_vlan(struct bonding *bond, struct vlan_entry *curr); int bond_dev_queue_xmit(struct bonding *bond, struct sk_buff *skb, struct net_device *slave_dev); int bond_create(char *name, struct bond_params *params, struct bonding **newbond); +void bond_destroy(struct bonding *bond); +int bond_release_and_destroy(struct net_device *bond_dev, struct net_device *slave_dev); void bond_deinit(struct net_device *bond_dev); int bond_create_sysfs(void); void bond_destroy_sysfs(void); Index: net-2.6/drivers/net/bonding/bond_sysfs.c === --- net-2.6.orig/drivers/net/bonding/bond_sysfs.c 2007-09-24 14:01:23.523536550 +0200 +++ net-2.6/drivers/net/bonding/bond_sysfs.c2007-09-24 14:01:24.628339835 +0200 @@ -164,9 +164,7 @@ static ssize_t bonding_store_bonds(struc printk(KERN_INFO DRV_NAME : %s
[PATCH 9/9] bonding: Optionally allow ethernet slaves to keep own MAC
Update the don't change MAC of slaves functionality added in previous changes to be a generic option, rather than something tied to IB devices, as it's occasionally useful for regular ethernet devices as well. Adds fail_over_mac option (which is automatically enabled for IB slaves), applicable only to active-backup mode. Includes documentation update. Updates bonding driver version to 3.2.0. Signed-off-by: Jay Vosburgh [EMAIL PROTECTED] --- Documentation/networking/bonding.txt | 33 +++ drivers/net/bonding/bond_main.c | 57 + drivers/net/bonding/bond_sysfs.c | 49 + drivers/net/bonding/bonding.h|6 ++-- 4 files changed, 121 insertions(+), 24 deletions(-) diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index 1da5666..1134062 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -281,6 +281,39 @@ downdelay will be rounded down to the nearest multiple. The default value is 0. +fail_over_mac + + Specifies whether active-backup mode should set all slaves to + the same MAC address (the traditional behavior), or, when + enabled, change the bond's MAC address when changing the + active interface (i.e., fail over the MAC address itself). + + Fail over MAC is useful for devices that cannot ever alter + their MAC address, or for devices that refuse incoming + broadcasts with their own source MAC (which interferes with + the ARP monitor). + + The down side of fail over MAC is that every device on the + network must be updated via gratuitous ARP, vs. just updating + a switch or set of switches (which often takes place for any + traffic, not just ARP traffic, if the switch snoops incoming + traffic to update its tables) for the traditional method. If + the gratuitous ARP is lost, communication may be disrupted. + + When fail over MAC is used in conjuction with the mii monitor, + devices which assert link up prior to being able to actually + transmit and receive are particularly susecptible to loss of + the gratuitous ARP, and an appropriate updelay setting may be + required. + + A value of 0 disables fail over MAC, and is the default. A + value of 1 enables fail over MAC. This option is enabled + automatically if the first slave added cannot change its MAC + address. This option may be modified via sysfs only when no + slaves are present in the bond. + + This option was added in bonding version 3.2.0. + lacp_rate Option specifying the rate in which we'll ask our link partner diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 77caca3..c01ff9d 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -97,6 +97,7 @@ static char *xmit_hash_policy = NULL; static int arp_interval = BOND_LINK_ARP_INTERV; static char *arp_ip_target[BOND_MAX_ARP_TARGETS] = { NULL, }; static char *arp_validate = NULL; +static int fail_over_mac = 0; struct bond_params bonding_defaults; module_param(max_bonds, int, 0); @@ -130,6 +131,8 @@ module_param_array(arp_ip_target, charp, NULL, 0); MODULE_PARM_DESC(arp_ip_target, arp targets in n.n.n.n form); module_param(arp_validate, charp, 0); MODULE_PARM_DESC(arp_validate, validate src/dst of ARP probes: none (default), active, backup or all); +module_param(fail_over_mac, int, 0); +MODULE_PARM_DESC(fail_over_mac, For active-backup, do not set all slaves to the same MAC. 0 of off (default), 1 for on.); /*- Global variables */ @@ -1099,7 +1102,7 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active) /* when bonding does not set the slave MAC address, the bond MAC * address is the one of the active slave. */ - if (new_active !bond-do_set_mac_addr) + if (new_active bond-params.fail_over_mac) memcpy(bond-dev-dev_addr, new_active-dev-dev_addr, new_active-dev-addr_len); if (bond-curr_active_slave @@ -1371,16 +1374,16 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) if (slave_dev-set_mac_address == NULL) { if (bond-slave_cnt == 0) { printk(KERN_WARNING DRV_NAME - : %s: Warning: The first slave device you - specified does not support setting the MAC - address. This bond MAC address would be that - of the active slave.\n, bond_dev-name); - bond-do_set_mac_addr = 0; - } else
Re: [PATCH V6 5/9] net/bonding: Enable IP multicast for bonding IPoIB devices
On Mon, 24 Sep 2007 17:37:00 +0200 Moni Shoua [EMAIL PROTECTED] wrote: Allow to enslave devices when the bonding device is not up. Over the discussion held at the previous post this seemed to be the most clean way to go, where it is not expected to cause instabilities. Normally, the bonding driver is UP before any enslavement takes place. Once a netdevice is UP, the network stack acts to have it join some multicast groups (eg the all-hosts 224.0.0.1). Now, since ether_setup() have set the bonding device type to be ARPHRD_ETHER and address len to be ETHER_ALEN, the net core code computes a wrong multicast link address. This is b/c ip_eth_mc_map() is called where for multicast joins taking place after the enslavement another ip_xxx_mc_map() is called (eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND) Signed-off-by: Moni Shoua monis at voltaire.com Signed-off-by: Or Gerlitz ogerlitz at voltaire.com --- drivers/net/bonding/bond_main.c |5 +++-- drivers/net/bonding/bond_sysfs.c |6 ++ 2 files changed, 5 insertions(+), 6 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c === --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-08-15 10:54:41.0 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:55:48.431862446 +0300 @@ -1285,8 +1285,9 @@ int bond_enslave(struct net_device *bond /* bond must be initialized by bond_open() before enslaving */ if (!(bond_dev-flags IFF_UP)) { - dprintk(Error, master_dev is not up\n); - return -EPERM; + printk(KERN_WARNING DRV_NAME + %s: master_dev is not up in bond_enslave\n, + bond_dev-name); } /* already enslaved */ Index: net-2.6/drivers/net/bonding/bond_sysfs.c === --- net-2.6.orig/drivers/net/bonding/bond_sysfs.c 2007-08-15 10:08:58.0 +0300 +++ net-2.6/drivers/net/bonding/bond_sysfs.c 2007-08-15 10:55:48.432862269 +0300 @@ -266,11 +266,9 @@ static ssize_t bonding_store_slaves(stru /* Quick sanity check -- is the bond interface up? */ if (!(bond-dev-flags IFF_UP)) { - printk(KERN_ERR DRV_NAME -: %s: Unable to update slaves because interface is down.\n, + printk(KERN_WARNING DRV_NAME +: %s: doing slave updates when interface is down.\n, bond-dev-name); - ret = -EPERM; - goto out; } Please get rid of the warning. Make bonding work correctly and allow enslave/remove of device when bonding is down. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sb1250-mac: Driver model phylib update
On Mon, 24 Sep 2007 15:41:54 +0100 (BST) Maciej W. Rozycki [EMAIL PROTECTED] wrote: On Fri, 21 Sep 2007, Andrew Morton wrote: A driver model and phylib update. akpm:/usr/src/25 diffstat patches/git-net.patch | tail -n 1 1013 files changed, 187667 insertions(+), 23587 deletions(-) Sorry, but raising networking patches against Linus's crufty old mainline tree just isn't viable at present. Well, this is against Jeff's netdev-2.6 tree which hopefully is not as crufty as Linus's old mainline; if it is not possible to queue this change for 2.6.25 or suchlike, then I will try to resubmit later. Most of Jeff's netdev tree got dumped into Dave's net-2.6.24 tree. That's the one you want to be raising patches against for the next few weeks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6.24] introduce MAC_FMT/MAC_ARG
On Wed, 2007-09-19 at 12:54 -0700, David Miller wrote: Applied to net-2.6.24, thanks Joe! Here is a patch that adds some type safety to print_mac by using a struct print_mac_buf * instead of char *. It also reduces the defconfig vmlinux size by 8 bytes. Signed-off-by: Joe Perches [EMAIL PROTECTED] -- include/linux/if_ether.h | 12 ++-- net/ethernet/eth.c |6 +++--- 2 files changed, 13 insertions(+), 5 deletions(-) diff --git a/include/linux/if_ether.h b/include/linux/if_ether.h index 57abca1..620d6b1 100644 --- a/include/linux/if_ether.h +++ b/include/linux/if_ether.h @@ -126,7 +126,15 @@ extern struct ctl_table ether_table[]; * Display a 6 byte device address (MAC) in a readable format. */ #define MAC_FMT %02x:%02x:%02x:%02x:%02x:%02x -extern char *print_mac(char *buf, const u8 *addr); -#define DECLARE_MAC_BUF(var) char var[18] __maybe_unused + +struct print_mac_buf { + char formatted_mac_addr[18]; +}; + +#define DECLARE_MAC_BUF(var) \ + struct print_mac_buf __maybe_unused _##var; \ + struct print_mac_buf __maybe_unused *var = _##var + +extern char *print_mac(struct print_mac_buf *buf, const u8 *addr); #endif /* _LINUX_IF_ETHER_H */ diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c index 2aaf6fa..ad82613 100644 --- a/net/ethernet/eth.c +++ b/net/ethernet/eth.c @@ -338,10 +338,10 @@ struct net_device *alloc_etherdev_mq(int sizeof_priv, unsigned int queue_count) } EXPORT_SYMBOL(alloc_etherdev_mq); -char *print_mac(char *buf, const u8 *addr) +char *print_mac(struct print_mac_buf *buf, const u8 *addr) { - sprintf(buf, MAC_FMT, + sprintf(buf-formatted_mac_addr, MAC_FMT, addr[0], addr[1], addr[2], addr[3], addr[4], addr[5]); - return buf; + return buf-formatted_mac_addr; } EXPORT_SYMBOL(print_mac); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] bnx2: factor out gzip unpacker
On Fri, Sep 21, 2007 at 11:37:52PM +0100, Denys Vlasenko wrote: But I compile net/* into bzImage. I like netbooting :) Isn't it possible to netboot with an initramfs image? I am pretty sure I have seen some systems do exactly that. -- Len Sorensen - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen question
From: Steve Wise [EMAIL PROTECTED] Date: Mon, 24 Sep 2007 08:54:13 -0500 I think pktgen should be cloning the skbs using skb_clone(). Then it will work for all devices, eh? The problem is that skb_clone() is (relatively) expensive and pktgen is trying to just grab a reference to the SKB in the absolutely cheapest way possible. In my personal opinion, I think what the drivers that don't work with pktgen are doing is wrong and they should find another way to pass state around other than to depend upon being able to use the -cb[] area at-will. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen question
Perf-wise, you could clone the skbs up front, then deliver them to the nic in a tight loop. This would mitigate the added overhead introduced by calling skb_clone() in the loop doing transmits... That only works if you are sending a small number of skbs. You can't pre-clone several minutes worth of 10Gbe traffic with any normal amount of RAM. Does pktgen really need to allocate anything more than some smallish fraction more than the depth of the driver's transmit queue? Of course, even that won't fit in the L1 cache of a processor, so if one is really just trying to max-out the NIC it might still have too much overhead, but then does pktgen+driver et al actually fit in an L1 cache? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen question
Rick Jones wrote: Perf-wise, you could clone the skbs up front, then deliver them to the nic in a tight loop. This would mitigate the added overhead introduced by calling skb_clone() in the loop doing transmits... That only works if you are sending a small number of skbs. You can't pre-clone several minutes worth of 10Gbe traffic with any normal amount of RAM. Does pktgen really need to allocate anything more than some smallish fraction more than the depth of the driver's transmit queue? If you want to send sustained high rates of traffic, for more than just a trivial amount of time, then you either have to play the current trick with the skb_get(), or you have to allocate a real packet each time (maybe with skb_clone() or similar, but it's still more overhead than the skb_get which only bumps a reference count.) I see no other way, but if you can think of one, please let me know. Thanks, Ben -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
I have submitted this before; but here it is again. Against net-2.6.24 from yesterday for this and all following patches. cheers, jamal Hi Jamal, I've been (slowly) working on resurrecting the original design of my multiqueue patches to address this exact issue of the queue_lock being a hot item. I added a queue_lock to each queue in the subqueue struct, and in the enqueue and dequeue, just lock that queue instead of the global device queue_lock. The only two issues to overcome are the QDISC_RUNNING state flag, since that also serializes entry into the qdisc_restart() function, and the qdisc statistics maintenance, which needs to be serialized. Do you think this work along with your patch will benefit from one another? I apologize for not having working patches right now, but I am working on them slowly as I have some blips of spare time. Thanks, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/7] CAN: Add PF_CAN core module
Joe Perches [EMAIL PROTECTED] writes: I'd prefer something like this, which removes the unnecessary kmalloc/kfree pairs or the equivalent conversions to functions. I have changed this to a static buffer. Since this is only in #ifdef CONFIG_CAN_DEBUG_CORE, it shouldn't hurt. #define can_debug_cframe(cf, fmt, arg...) \ do { \ char buf[20]; \ int dlc = cf-can_dlc; \ if (dlc 8) \ dlc = 8; \ if (cf-can_id CAN_EFF_FLAG) \ sprintf(buf, %08X [%X] , cf-can_id CAN_EFF_MASK, dlc); \ else \ sprintf(buf, %03X [%X] , cf-can_id CAN_SFF_MASK, dlc); \ printk(KERN_DEBUG fmt, ##arg); \ print_hex_dump(buf, DUMP_PREFIX_NONE, cf-data, dlc); \ } while (0) The line isn't printed atomically, i.e. it could happen that something is printed between the fmt+args and the hexdump, i.e inbetween the line. and #define can_debug_skb(skb) \ do { \ pr_debug(skbuff at %p, dev: %d, proto: %04x\n, \ skb, skb-dev ? skb-dev-ifindex : -1, \ ntohs(skb-protocol)); \ pr_debug(users: %d, dataref: %d, nr_frags: %d, \ h,d,t,e,l: %p %+d %+d %+d, %d\n, \ atomic_read(skb-users), \ atomic_read((skb_shinfo(skb)-dataref)), \ skb_shinfo(skb)-nr_frags, \ skb-head, skb-data - skb-head, \ skb-tail - skb-head, skb-end - skb-head, skb-len); \ print_hex_dump(KERN_DEBUG, skb_head: , DUMP_PREFIX_NONE, \ 16, 1, skb-head, skb-end - skb-head); \ } while (0) Here, the consecutive lines aren't printed atomically. I think sprintf() to a buffer first and the do one printk() is better. But I will use hex_dump_to_buffer(). urs - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
deadlink in genetlink?
[Sorry about the duplicate mail, Thomas: I got the netdev address wrong on the first try.] lockdep reported a circular dependency between cb_mutex and genl_mutex, as follows: - #1 (nlk-cb_mutex){--..}: [c0127b5d] __lock_acquire+0x9c8/0xb98 [c01280f6] lock_acquire+0x5d/0x75 [c027adc1] __mutex_lock_slowpath+0xdb/0x247 [c027af49] mutex_lock+0x1c/0x1f [c0236aa0] netlink_dump_start+0xa9/0x12f --- takes cb_mutex [c0237fa6] genl_rcv_msg+0xa3/0x14c [c0235aa5] netlink_run_queue+0x6f/0xe1 [c0237772] genl_rcv+0x2d/0x4e --- trylocks genl_mutex [c0235ef0] netlink_data_ready+0x15/0x55 [c0234e98] netlink_sendskb+0x1f/0x36 [c0235772] netlink_unicast+0x1a6/0x1c0 [c0235ecf] netlink_sendmsg+0x238/0x244 [c021a92e] sock_sendmsg+0xcb/0xe4 [c021aa98] sys_sendmsg+0x151/0x1af [c021b850] sys_socketcall+0x203/0x222 [c01025d2] syscall_call+0x7/0xb [] 0x - #0 (genl_mutex){--..}: [c0127a46] __lock_acquire+0x8b1/0xb98 [c01280f6] lock_acquire+0x5d/0x75 [c027adc1] __mutex_lock_slowpath+0xdb/0x247 [c027af49] mutex_lock+0x1c/0x1f [c023717d] genl_lock+0xd/0xf [c023806f] ctrl_dumpfamily+0x20/0xdd --- takes genl_mutex [c0234bfa] netlink_dump+0x50/0x168 --- takes cb_mutex [c02360f2] netlink_recvmsg+0x15f/0x22f [c021a84a] sock_recvmsg+0xd5/0xee [c021b24e] sys_recvmsg+0xf5/0x187 [c021b865] sys_socketcall+0x218/0x222 [c01025d2] syscall_call+0x7/0xb [] 0x The trylock in genl_rcv doesn't prevent the deadlock, because it's not the last lock acquired. Perhaps this can be fixed by not acquiring the genl_mutex in ctrl_dumpfamily; it silences lockdep, at least. It is not clear to me what the value of taking the mutex is there. If this is an appropriate fix, here is a patch for it. Signed-off-by: Ben Pfaff [EMAIL PROTECTED] diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c index 8c11ca4..2e79035 100644 --- a/net/netlink/genetlink.c +++ b/net/netlink/genetlink.c @@ -616,9 +616,6 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb) int chains_to_skip = cb-args[0]; int fams_to_skip = cb-args[1]; - if (chains_to_skip != 0) - genl_lock(); - for (i = 0; i GENL_FAM_TAB_SIZE; i++) { if (i chains_to_skip) continue; @@ -636,9 +633,6 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb) } errout: - if (chains_to_skip != 0) - genl_unlock(); - cb-args[0] = i; cb-args[1] = n; -- Ben Pfaff Nicira Networks, Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/9] fs_enet: Be an of_platform device when CONFIG_PPC_CPM_NEW_BINDING is set.
The existing OF glue code was crufty and broken. Rather than fix it, it will be removed, and the ethernet driver now talks to the device tree directly. The old, non-CONFIG_PPC_CPM_NEW_BINDING code can go away once CPM platforms are dropped from arch/ppc (which will hopefully be soon), and existing arch/powerpc boards that I wasn't able to test on for this patchset get converted (which should be even sooner). Signed-off-by: Scott Wood [EMAIL PROTECTED] --- Resent, with linux/of_platform.h instead of asm/of_platform.h drivers/net/fs_enet/Kconfig|1 + drivers/net/fs_enet/fs_enet-main.c | 259 --- drivers/net/fs_enet/fs_enet.h | 55 +--- drivers/net/fs_enet/mac-fcc.c | 89 + drivers/net/fs_enet/mac-fec.c | 19 +++- drivers/net/fs_enet/mac-scc.c | 53 +-- drivers/net/fs_enet/mii-bitbang.c | 269 +++- drivers/net/fs_enet/mii-fec.c | 143 +++- include/linux/fs_enet_pd.h |5 + 9 files changed, 714 insertions(+), 179 deletions(-) diff --git a/drivers/net/fs_enet/Kconfig b/drivers/net/fs_enet/Kconfig index e27ee21..2765e49 100644 --- a/drivers/net/fs_enet/Kconfig +++ b/drivers/net/fs_enet/Kconfig @@ -11,6 +11,7 @@ config FS_ENET_HAS_SCC config FS_ENET_HAS_FCC bool Chip has an FCC usable for ethernet depends on FS_ENET CPM2 + select MDIO_BITBANG default y config FS_ENET_HAS_FEC diff --git a/drivers/net/fs_enet/fs_enet-main.c b/drivers/net/fs_enet/fs_enet-main.c index a4b76cd..c5abdea 100644 --- a/drivers/net/fs_enet/fs_enet-main.c +++ b/drivers/net/fs_enet/fs_enet-main.c @@ -44,12 +44,18 @@ #include asm/irq.h #include asm/uaccess.h +#ifdef CONFIG_PPC_CPM_NEW_BINDING +#include linux/of_platform.h +#endif + #include fs_enet.h /*/ +#ifndef CONFIG_PPC_CPM_NEW_BINDING static char version[] __devinitdata = DRV_MODULE_NAME .c:v DRV_MODULE_VERSION ( DRV_MODULE_RELDATE ) \n; +#endif MODULE_AUTHOR(Pantelis Antoniou [EMAIL PROTECTED]); MODULE_DESCRIPTION(Freescale Ethernet Driver); @@ -953,6 +959,7 @@ static int fs_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) extern int fs_mii_connect(struct net_device *dev); extern void fs_mii_disconnect(struct net_device *dev); +#ifndef CONFIG_PPC_CPM_NEW_BINDING static struct net_device *fs_init_instance(struct device *dev, struct fs_platform_info *fpi) { @@ -1132,6 +1139,7 @@ static int fs_cleanup_instance(struct net_device *ndev) return 0; } +#endif /**/ @@ -1140,35 +1148,250 @@ void *fs_enet_immap = NULL; static int setup_immap(void) { - phys_addr_t paddr = 0; - unsigned long size = 0; - #ifdef CONFIG_CPM1 - paddr = IMAP_ADDR; - size = 0x1; /* map 64K */ -#endif - -#ifdef CONFIG_CPM2 - paddr = CPM_MAP_ADDR; - size = 0x4; /* map 256 K */ + fs_enet_immap = ioremap(IMAP_ADDR, 0x4000); + WARN_ON(!fs_enet_immap); +#elif defined(CONFIG_CPM2) + fs_enet_immap = cpm2_immr; #endif - fs_enet_immap = ioremap(paddr, size); - if (fs_enet_immap == NULL) - return -EBADF; /* XXX ahem; maybe just BUG_ON? */ return 0; } static void cleanup_immap(void) { - if (fs_enet_immap != NULL) { - iounmap(fs_enet_immap); - fs_enet_immap = NULL; - } +#if defined(CONFIG_CPM1) + iounmap(fs_enet_immap); +#endif } /**/ +#ifdef CONFIG_PPC_CPM_NEW_BINDING +static int __devinit find_phy(struct device_node *np, + struct fs_platform_info *fpi) +{ + struct device_node *phynode, *mdionode; + struct resource res; + int ret = 0, len; + + const u32 *data = of_get_property(np, phy-handle, len); + if (!data || len != 4) + return -EINVAL; + + phynode = of_find_node_by_phandle(*data); + if (!phynode) + return -EINVAL; + + mdionode = of_get_parent(phynode); + if (!phynode) + goto out_put_phy; + + ret = of_address_to_resource(mdionode, 0, res); + if (ret) + goto out_put_mdio; + + data = of_get_property(phynode, reg, len); + if (!data || len != 4) + goto out_put_mdio; + + snprintf(fpi-bus_id, 16, PHY_ID_FMT, res.start, *data); + +out_put_mdio: + of_node_put(mdionode); +out_put_phy: + of_node_put(phynode); + return ret; +} + +#ifdef CONFIG_FS_ENET_HAS_FEC +#define IS_FEC(match) ((match)-data == fs_fec_ops) +#else +#define IS_FEC(match) 0 +#endif + +static int __devinit fs_enet_probe(struct of_device *ofdev, + const struct of_device_id *match) +{ + struct
[SCTP PULL Request]: Bug fixes for 2.6.23
Hi David Can you please pull the following changes since commit a41d3015c11a4e864b95cb57f579f2d8f40cd41b: David S. Miller (1): Revert PCI: disable MSI by default on systems with Serverworks HT1000 chips which are available in the git repository at: [EMAIL PROTECTED]:/pub/scm/linux/kernel/git/vxy/lksctp-dev.git master Vlad Yasevich (4): SCTP: Validate buffer room when processing sequential chunks SCTP: Explicitly discard OOTB chunks SCTP: Clean up OOTB handling and fix infinite loop processing SCTP: Discard OOTB packets with bundled INIT early. Wei Yongjun (2): SCTP: Send ABORT chunk with correct tag in response to INIT ACK SCTP : Add paramters validity check for ASCONF chunk include/net/sctp/sm.h |4 +- include/net/sctp/structs.h |1 + net/sctp/input.c |8 ++ net/sctp/inqueue.c |8 ++ net/sctp/sm_make_chunk.c | 46 net/sctp/sm_statefuns.c| 243 +--- net/sctp/sm_statetable.c | 16 ++-- 7 files changed, 278 insertions(+), 48 deletions(-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] Zero-length write() does not generate a datagram on connected socket
The bug http://bugzilla.kernel.org/show_bug.cgi?id=5731 describes an issue where write() can't be used to generate a zero-length datagram (but send, and sendto do work). I think the following is needed: --- a/net/socket.c 2007-08-20 09:54:28.0 -0700 +++ b/net/socket.c 2007-09-24 15:31:25.0 -0700 @@ -777,8 +777,11 @@ static ssize_t sock_aio_write(struct kio if (pos != 0) return -ESPIPE; - if (iocb-ki_left == 0) /* Match SYS5 behaviour */ - return 0; + if (unlikely(iocb-ki_left == 0)) { + struct socket *sock = iocb-ki_filp-private_data; + if (sock-type == SOCK_STREAM) + return 0; + } x = alloc_sock_iocb(iocb, siocb); if (!x) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHES] TX batching
On Mon, 2007-24-09 at 00:00 -0700, Kok, Auke wrote: that's bad to begin with :) - please send those separately so I can fasttrack them into e1000e and e1000 where applicable. Ive been CCing you ;- Most of the changes are readability and reusability with the batching. But yes, I'm very inclined to merge more features into e1000e than e1000. I intend to put multiqueue support into e1000e, as *all* of the hardware that it will support has multiple queues. Putting in any other performance feature like tx batching would absolutely be interesting. I looked at the e1000e and it is very close to e1000 so i should be able to move the changes easily. Most importantly, can i kill LLTX? For tx batching, we have to wait to see how Dave wants to move forward; i will have the patches but it is not something you need to push until we see where that is going. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
On Mon, 2007-24-09 at 12:12 -0700, Waskiewicz Jr, Peter P wrote: Hi Jamal, I've been (slowly) working on resurrecting the original design of my multiqueue patches to address this exact issue of the queue_lock being a hot item. I added a queue_lock to each queue in the subqueue struct, and in the enqueue and dequeue, just lock that queue instead of the global device queue_lock. The only two issues to overcome are the QDISC_RUNNING state flag, since that also serializes entry into the qdisc_restart() function, and the qdisc statistics maintenance, which needs to be serialized. Do you think this work along with your patch will benefit from one another? The one thing that seems obvious is to use dev-hard_prep_xmit() in the patches i posted to select the xmit ring. You should be able to do figure out the txmit ring without holding any lock. I lost track of how/where things went since the last discussion; so i need to wrap my mind around it to make sensisble suggestions - I know the core patches are in the kernel but havent paid attention to details and if you look at my second patch youd see a comment in dev_batch_xmit() which says i need to scrutinize multiqueue more. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[DOC] Net batching driver howto
I have updated the driver howto to match the patches i posted yesterday. attached. cheers, jamal Heres the begining of a howto for driver authors. The intended audience for this howto is people already familiar with netdevices. 1.0 Netdevice Pre-requisites -- For hardware based netdevices, you must have at least hardware that is capable of doing DMA with many descriptors; i.e having hardware with a queue length of 3 (as in some fscked ethernet hardware) is not very useful in this case. 2.0 What is new in the driver API --- There are 3 new methods and one new variable introduced. These are: 1)dev-hard_prep_xmit() 2)dev-hard_end_xmit() 3)dev-hard_batch_xmit() 4)dev-xmit_win 2.1 Using Core driver changes - To provide context, lets look at a typical driver abstraction for dev-hard_start_xmit(). It has 4 parts: a) packet formating (example vlan, mss, descriptor counting etc) b) chip specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interupts etc [For code cleanliness/readability sake, regardless of this work, one should break the dev-hard_start_xmit() into those 4 functions anyways]. A driver which has all 4 parts and needing to support batching is advised to split its dev-hard_start_xmit() in the following manner: 1)use its dev-hard_prep_xmit() method to achieve #a 2)use its dev-hard_end_xmit() method to achieve #d 3)#b and #c can stay in -hard_start_xmit() (or whichever way you want to do this) Note: There are drivers which may need not support any of the two methods (example the tun driver i patched) so the two methods are essentially optional. 2.1.1 Theory of operation -- The core will first do the packet formatting by invoking your supplied dev-hard_prep_xmit() method. It will then pass you the packet via your dev-hard_start_xmit() method for as many as packets you have advertised (via dev-xmit_win) you can consume. Lastly it will invoke your dev-hard_end_xmit() when it completes passing you all the packets queued for you. 2.1.1.1 Locking rules - dev-hard_prep_xmit() is invoked without holding any tx lock but the rest are under TX_LOCK(). So you have to ensure that whatever you put it dev-hard_prep_xmit() doesnt require locking. 2.1.1.2 The slippery LLTX - LLTX drivers present a challenge in that we have to introduce a deviation from the norm and require the -hard_batch_xmit() method. An LLTX driver presents us with -hard_batch_xmit() to which we pass it a list of packets in a dev-blist skb queue. It is then the responsibility of the -hard_batch_xmit() to exercise steps #b and #c for all packets passed in the dev-blist. Step #a and #d are done by the core should you register presence of dev-hard_prep_xmit() and dev-hard_end_xmit() in your setup. 2.1.1.3 xmit_win dev-xmit_win variable is set by the driver to tell us how much space it has in its rings/queues. dev-xmit_win is introduced to ensure that when we pass the driver a list of packets it will swallow all of them - which is useful because we dont requeue to the qdisc (and avoids burning unnecessary cpu cycles or introducing any strange re-ordering). The driver tells us, whenever it invokes netif_wake_queue, how much space it has for descriptors by setting this variable. 3.0 Driver Essentials - The typical driver tx state machine is: -1- +Core sends packets +-- Driver puts packet onto hardware queue +if hardware queue is full, netif_stop_queue(dev) + -2- +core stops sending because of netif_stop_queue(dev) .. .. time passes ... .. -3- +--- driver has transmitted packets, opens up tx path by invoking netif_wake_queue(dev) -1- +Cycle repeats and core sends more packets (step 1). 3.1 Driver pre-requisite -- This is _a very important_ requirement in making batching useful. The pre-requisite for batching changes is that the driver should provide a low threshold to open up the tx path. Drivers such as tg3 and e1000 already do this. Before you invoke netif_wake_queue(dev) you check if there is a threshold of space reached to insert new packets. Heres an example of how i added it to tun driver. Observe the setting of dev-xmit_win --- +#define NETDEV_LTT 4 /* the low threshold to open up the tx path */ .. .. u32 t = skb_queue_len(tun-readq); if (netif_queue_stopped(tun-dev) t NETDEV_LTT) { tun-dev-xmit_win = tun-dev-tx_queue_len; netif_wake_queue(tun-dev); } --- Heres how the batching e1000 driver does it: -- if (unlikely(cleaned netif_carrier_ok(netdev) E1000_DESC_UNUSED(tx_ring) = TX_WAKE_THRESHOLD)) { if (netif_queue_stopped(netdev)) { int rspace = E1000_DESC_UNUSED(tx_ring) -
Re: [PATCHES] TX batching
jamal wrote: On Mon, 2007-24-09 at 00:00 -0700, Kok, Auke wrote: that's bad to begin with :) - please send those separately so I can fasttrack them into e1000e and e1000 where applicable. Ive been CCing you ;- Most of the changes are readability and reusability with the batching. But yes, I'm very inclined to merge more features into e1000e than e1000. I intend to put multiqueue support into e1000e, as *all* of the hardware that it will support has multiple queues. Putting in any other performance feature like tx batching would absolutely be interesting. I looked at the e1000e and it is very close to e1000 so i should be able to move the changes easily. Most importantly, can i kill LLTX? For tx batching, we have to wait to see how Dave wants to move forward; i will have the patches but it is not something you need to push until we see where that is going. hmm, I though I already removed that, but now I see some remnants from that. By all means, please send a separate patch for that! Auke - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
The one thing that seems obvious is to use dev-hard_prep_xmit() in the patches i posted to select the xmit ring. You should be able to do figure out the txmit ring without holding any lock. I've looked at that as a candidate to use. The lock for enqueue would be needed when actually placing the skb into the appropriate software queue for the qdisc, so it'd be quick. I lost track of how/where things went since the last discussion; so i need to wrap my mind around it to make sensisble suggestions - I know the core patches are in the kernel but havent paid attention to details and if you look at my second patch youd see a comment in dev_batch_xmit() which says i need to scrutinize multiqueue more. No worries. I'll try to get things together on my end and provide some patches to add a per-queue lock. In the meantime, I'll take a much closer look at the batching code, since I've stopped looking at the patches in-depth about a month ago. :-( Thanks, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
On Mon, 2007-24-09 at 15:57 -0700, Waskiewicz Jr, Peter P wrote: I've looked at that as a candidate to use. The lock for enqueue would be needed when actually placing the skb into the appropriate software queue for the qdisc, so it'd be quick. The enqueue is easy to comprehend. The single device queue lock should suffice. The dequeue is interesting: Maybe you can point me to some doc or describe to me the dequeue aspect; are you planning to have an array of txlocks per, one per ring? How is the policy to define the qdisc queues locked/mapped to tx rings? cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
On Mon, 2007-24-09 at 15:57 -0700, Waskiewicz Jr, Peter P wrote: I've looked at that as a candidate to use. The lock for enqueue would be needed when actually placing the skb into the appropriate software queue for the qdisc, so it'd be quick. The enqueue is easy to comprehend. The single device queue lock should suffice. The dequeue is interesting: We should make sure we're symmetric with the locking on enqueue to dequeue. If we use the single device queue lock on enqueue, then dequeue will also need to check that lock in addition to the individual queue lock. The details of this are more trivial than the actual dequeue to make it efficient though. Maybe you can point me to some doc or describe to me the dequeue aspect; are you planning to have an array of txlocks per, one per ring? How is the policy to define the qdisc queues locked/mapped to tx rings? The dequeue locking would be pushed into the qdisc itself. This is how I had it originally, and it did make the code more complex, but it was successful at breaking the heavily-contended queue_lock apart. I have a subqueue structure right now in netdev, which only has queue_state (for netif_{start|stop}_subqueue). This state is checked in sch_prio right now in the dequeue for both prio and rr. My approach is to add a queue_lock in that struct, so each queue allocated by the driver would have a lock per queue. Then in dequeue, that lock would be taken when the skb is about to be dequeued. The skb-queue_mapping field also maps directly to the queue index itself, so it can be unlocked easily outside of the context of the dequeue function. The policy would be to use a spin_trylock() in dequeue, so that dequeue can still do work if enqueue or another dequeue is busy. And the allocation of qdisc queues to device queues is assumed to be one-to-one (that's how the qdisc behaves now). I really just need to put my nose to the grindstone and get the patches together and to the list...stay tuned. Thanks, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHES] TX batching
jamal wrote: If the intel folks will accept the patch i'd really like to kill the e1000 LLTX interface. If I understood DaveM correctly, it is sounding like we want to deprecate all of use LLTX on real hardware? If so, several such projects might be considered, as well as possibly simplifying TX batching work perhaps. Also, WRT e1000 specifically, I was hoping to minimize changes, and focus people on e1000e. e1000e replaces (deprecates) large portions of e1000, namely the support for the PCI Express modern chips. When e1000e has proven itself in the field, we can potentially look at several e1000 simplifications, during the large scale code removal that becomes possible. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
On Mon, 24 Sep 2007 16:47:06 -0700 Waskiewicz Jr, Peter P [EMAIL PROTECTED] wrote: On Mon, 2007-24-09 at 15:57 -0700, Waskiewicz Jr, Peter P wrote: I've looked at that as a candidate to use. The lock for enqueue would be needed when actually placing the skb into the appropriate software queue for the qdisc, so it'd be quick. The enqueue is easy to comprehend. The single device queue lock should suffice. The dequeue is interesting: We should make sure we're symmetric with the locking on enqueue to dequeue. If we use the single device queue lock on enqueue, then dequeue will also need to check that lock in addition to the individual queue lock. The details of this are more trivial than the actual dequeue to make it efficient though. Maybe you can point me to some doc or describe to me the dequeue aspect; are you planning to have an array of txlocks per, one per ring? How is the policy to define the qdisc queues locked/mapped to tx rings? The dequeue locking would be pushed into the qdisc itself. This is how I had it originally, and it did make the code more complex, but it was successful at breaking the heavily-contended queue_lock apart. I have a subqueue structure right now in netdev, which only has queue_state (for netif_{start|stop}_subqueue). This state is checked in sch_prio right now in the dequeue for both prio and rr. My approach is to add a queue_lock in that struct, so each queue allocated by the driver would have a lock per queue. Then in dequeue, that lock would be taken when the skb is about to be dequeued. The skb-queue_mapping field also maps directly to the queue index itself, so it can be unlocked easily outside of the context of the dequeue function. The policy would be to use a spin_trylock() in dequeue, so that dequeue can still do work if enqueue or another dequeue is busy. And the allocation of qdisc queues to device queues is assumed to be one-to-one (that's how the qdisc behaves now). I really just need to put my nose to the grindstone and get the patches together and to the list...stay tuned. Thanks, -PJ Waskiewicz - Since we are redoing this, is there any way to make the whole TX path more lockless? The existing model seems to be more of a monitor than a real locking model. -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
I really just need to put my nose to the grindstone and get the patches together and to the list...stay tuned. Thanks, -PJ Waskiewicz - Since we are redoing this, is there any way to make the whole TX path more lockless? The existing model seems to be more of a monitor than a real locking model. That seems quite reasonable. I will certainly see what I can do. Thanks Stephen, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] sky2: FE+ Phy initialization
One more snippet of PHY initialization required for FE+ chips. Discovered in latest sk98lin 10.21.1.3 driver. Please apply to 2.6.23. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- a/drivers/net/sky2.c2007-09-24 09:10:36.0 -0700 +++ b/drivers/net/sky2.c2007-09-24 10:22:28.0 -0700 @@ -338,6 +338,16 @@ static void sky2_phy_init(struct sky2_hw if (!(hw-flags SKY2_HW_GIGABIT)) { /* enable automatic crossover */ ctrl |= PHY_M_PC_MDI_XMODE(PHY_M_PC_ENA_AUTO) 1; + + if (hw-chip_id == CHIP_ID_YUKON_FE_P + hw-chip_rev == CHIP_REV_YU_FE2_A0) { + u16 spec; + + /* Enable Class A driver for FE+ A0 */ + spec = gm_phy_read(hw, port, PHY_MARV_FE_SPEC_2); + spec |= PHY_M_FESC_SEL_CL_A; + gm_phy_write(hw, port, PHY_MARV_FE_SPEC_2, spec); + } } else { /* disable energy detect */ ctrl = ~PHY_M_PC_EN_DET_MSK; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: FE+ Phy initialization
Stephen Hemminger wrote: One more snippet of PHY initialization required for FE+ chips. Discovered in latest sk98lin 10.21.1.3 driver. Please apply to 2.6.23. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] applied - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: be more selective about FIFO watchdog
Stephen Hemminger wrote: Be more selective about when to enable the ram buffer watchdog code. It is unnecessary on XL A3 or later revs, and with Yukon FE the buffer is so small (4K) that the watchdog detects false positives. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] applied - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Remove broken netfilter binary sysctls from bridging code
Stephen Hemminger wrote: On Mon, 24 Sep 2007 18:55:38 +0200 Patrick McHardy [EMAIL PROTECTED] wrote: Eric W. Biederman wrote: A really good fix would be to remove the binary side and then to modify brnf_sysctl_call_tables to allocate a temporary ctl_table and integer on the stack and only set ctl-data after we have normalized the written value. But since in practice nothing cares about the race a better fix probably isn't worth it. I seem to be missing something, the entire brnf_sysctl_call_tables thing looks purely cosmetic to me, wouldn't it be better to simply remove it? I agree, removing seems like a better option. But probably need to go through a 3-6mo warning period, since sysctl's are technically an API. I meant removing brnf_sysctl_call_tables function, not the sysctls themselves, all it does is change values != 0 to 1. Or did you actually mean that something in userspace might depend on reading back the value 1 after writing a value != 0? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] bnx2: factor out gzip unpacker
On Sep 24, 2007, at 13:32:23, Lennart Sorensen wrote: On Fri, Sep 21, 2007 at 11:37:52PM +0100, Denys Vlasenko wrote: But I compile net/* into bzImage. I like netbooting :) Isn't it possible to netboot with an initramfs image? I am pretty sure I have seen some systems do exactly that. Yeah, I've got Debian boxes that have never *not* netbooted (one Dell Op^?^?Craptiplex box whose BIOS and ACPI sucks so bad it can't even load GRUB/LILO, although Windows somehow works fine). So they boot PXELinux using the PXE boot ROM on the NICs and it loads both a kernel and an initramfs into memory. Kernel is stock Debian and hardly has enough built-in to spit at you, let alone find network/ disks, but it manages to load everything it needs off the automagically-generated initramfs. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[git patches] net driver fixes
Please pull from 'upstream-linus' branch of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git upstream-linus to receive the following updates: drivers/net/pcmcia/3c589_cs.c |2 +- drivers/net/r8169.c | 14 +- drivers/net/sky2.c| 37 - drivers/net/sky2.h|2 +- 4 files changed, 39 insertions(+), 16 deletions(-) Edward Hsu (1): r8169: correct phy parameters for the 8110SC Francois Romieu (1): r8169: workaround against ignored TxPoll writes (8168) Jeff Garzik (1): Revert drivers/net/pcmcia/3c589_cs: fix port configuration switcheroo Stephen Hemminger (2): sky2: FE+ Phy initialization sky2: be more selective about FIFO watchdog diff --git a/drivers/net/pcmcia/3c589_cs.c b/drivers/net/pcmcia/3c589_cs.c index c06cae3..503f268 100644 --- a/drivers/net/pcmcia/3c589_cs.c +++ b/drivers/net/pcmcia/3c589_cs.c @@ -116,7 +116,7 @@ struct el3_private { spinlock_t lock; }; -static const char *if_names[] = { auto, 10base2, 10baseT, AUI }; +static const char *if_names[] = { auto, 10baseT, 10base2, AUI }; /**/ diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c index b85ab4a..c921ec3 100644 --- a/drivers/net/r8169.c +++ b/drivers/net/r8169.c @@ -1228,7 +1228,10 @@ static void rtl8169_hw_phy_config(struct net_device *dev) return; } - /* phy config for RTL8169s mac_version C chip */ + if ((tp-mac_version != RTL_GIGA_MAC_VER_02) + (tp-mac_version != RTL_GIGA_MAC_VER_03)) + return; + mdio_write(ioaddr, 31, 0x0001); //w 31 2 0 1 mdio_write(ioaddr, 21, 0x1000); //w 21 15 0 1000 mdio_write(ioaddr, 24, 0x65c7); //w 24 15 0 65c7 @@ -2567,6 +2570,15 @@ static void rtl8169_tx_interrupt(struct net_device *dev, (TX_BUFFS_AVAIL(tp) = MAX_SKB_FRAGS)) { netif_wake_queue(dev); } + /* +* 8168 hack: TxPoll requests are lost when the Tx packets are +* too close. Let's kick an extra TxPoll request when a burst +* of start_xmit activity is detected (if it is not detected, +* it is slow enough). -- FR +*/ + smp_rmb(); + if (tp-cur_tx != dirty_tx) + RTL_W8(TxPoll, NPQ); } } diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c index eaffe55..0792031 100644 --- a/drivers/net/sky2.c +++ b/drivers/net/sky2.c @@ -338,6 +338,16 @@ static void sky2_phy_init(struct sky2_hw *hw, unsigned port) if (!(hw-flags SKY2_HW_GIGABIT)) { /* enable automatic crossover */ ctrl |= PHY_M_PC_MDI_XMODE(PHY_M_PC_ENA_AUTO) 1; + + if (hw-chip_id == CHIP_ID_YUKON_FE_P + hw-chip_rev == CHIP_REV_YU_FE2_A0) { + u16 spec; + + /* Enable Class A driver for FE+ A0 */ + spec = gm_phy_read(hw, port, PHY_MARV_FE_SPEC_2); + spec |= PHY_M_FESC_SEL_CL_A; + gm_phy_write(hw, port, PHY_MARV_FE_SPEC_2, spec); + } } else { /* disable energy detect */ ctrl = ~PHY_M_PC_EN_DET_MSK; @@ -816,7 +826,8 @@ static void sky2_mac_init(struct sky2_hw *hw, unsigned port) sky2_write8(hw, SK_REG(port, TX_GMF_CTRL_T), GMF_RST_CLR); sky2_write16(hw, SK_REG(port, TX_GMF_CTRL_T), GMF_OPER_ON); - if (!(hw-flags SKY2_HW_RAMBUFFER)) { + /* On chips without ram buffer, pause is controled by MAC level */ + if (sky2_read8(hw, B2_E_0) == 0) { sky2_write8(hw, SK_REG(port, RX_GMF_LP_THR), 768/8); sky2_write8(hw, SK_REG(port, RX_GMF_UP_THR), 1024/8); @@ -1271,7 +1282,7 @@ static int sky2_up(struct net_device *dev) struct sky2_port *sky2 = netdev_priv(dev); struct sky2_hw *hw = sky2-hw; unsigned port = sky2-port; - u32 imask; + u32 imask, ramsize; int cap, err = -ENOMEM; struct net_device *otherdev = hw-dev[sky2-port^1]; @@ -1326,13 +1337,12 @@ static int sky2_up(struct net_device *dev) sky2_mac_init(hw, port); - if (hw-flags SKY2_HW_RAMBUFFER) { - /* Register is number of 4K blocks on internal RAM buffer. */ - u32 ramsize = sky2_read8(hw, B2_E_0) * 4; + /* Register is number of 4K blocks on internal RAM buffer. */ + ramsize = sky2_read8(hw, B2_E_0) * 4; + if (ramsize 0) { u32 rxspace; - printk(KERN_DEBUG PFX %s: ram buffer %dK\n, dev-name, ramsize); - +