Re: [NET]: Prevent multiple qdisc runs

2006-06-20 Thread David Miller
From: Herbert Xu <[EMAIL PROTECTED]>
Date: Mon, 19 Jun 2006 22:15:19 +1000

> [NET]: Prevent multiple qdisc runs

I have no real objection to this semantically.

But this is yet another atomic operation on the transmit
path :-(  This problem, however, is inevitable because of
how we do things and thus isn't the fault of your change.

I'm going to apply this patch to 2.6.18, however...  we should split
up the dev->state handling into seperate cacheline synchronizers.
Sharing RX and TX locking bits in the same word is not all that
efficient.

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NET]: Prevent multiple qdisc runs

2006-06-20 Thread Herbert Xu
On Mon, Jun 19, 2006 at 11:57:19PM -0700, David Miller wrote:
> 
> But this is yet another atomic operation on the transmit
> path :-(  This problem, however, is inevitable because of
> how we do things and thus isn't the fault of your change.
> 
> I'm going to apply this patch to 2.6.18, however...  we should split
> up the dev->state handling into seperate cacheline synchronizers.
> Sharing RX and TX locking bits in the same word is not all that
> efficient.

Good point.  This particular bit doesn't even need to be atomic since
it's sitting inside a spinlock.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[FORCEDETH]: Fix xmit_lock/netif_tx_lock after merge

2006-06-20 Thread Herbert Xu
Hi:

[FORCEDETH]: Fix xmit_lock/netif_tx_lock after merge

There has been an update to the forcedeth driver that added a few new
uses of xmit_lock which is no longer meant to be used directly.  This
patch replaces them with netif_tx_lock_bh.

Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c
index 04a53f1..62b38a4 100644
--- a/drivers/net/forcedeth.c
+++ b/drivers/net/forcedeth.c
@@ -2991,13 +2991,13 @@ static int nv_set_settings(struct net_de
netif_carrier_off(dev);
if (netif_running(dev)) {
nv_disable_irq(dev);
-   spin_lock_bh(&dev->xmit_lock);
+   netif_tx_lock_bh(dev);
spin_lock(&np->lock);
/* stop engines */
nv_stop_rx(dev);
nv_stop_tx(dev);
spin_unlock(&np->lock);
-   spin_unlock_bh(&dev->xmit_lock);
+   netif_tx_unlock_bh(dev);
}
 
if (ecmd->autoneg == AUTONEG_ENABLE) {
@@ -3131,13 +3131,13 @@ static int nv_nway_reset(struct net_devi
netif_carrier_off(dev);
if (netif_running(dev)) {
nv_disable_irq(dev);
-   spin_lock_bh(&dev->xmit_lock);
+   netif_tx_lock_bh(dev);
spin_lock(&np->lock);
/* stop engines */
nv_stop_rx(dev);
nv_stop_tx(dev);
spin_unlock(&np->lock);
-   spin_unlock_bh(&dev->xmit_lock);
+   netif_tx_unlock_bh(dev);
printk(KERN_INFO "%s: link down.\n", dev->name);
}
 
@@ -3244,7 +3244,7 @@ static int nv_set_ringparam(struct net_d
 
if (netif_running(dev)) {
nv_disable_irq(dev);
-   spin_lock_bh(&dev->xmit_lock);
+   netif_tx_lock_bh(dev);
spin_lock(&np->lock);
/* stop engines */
nv_stop_rx(dev);
@@ -3303,7 +3303,7 @@ static int nv_set_ringparam(struct net_d
nv_start_rx(dev);
nv_start_tx(dev);
spin_unlock(&np->lock);
-   spin_unlock_bh(&dev->xmit_lock);
+   netif_tx_unlock_bh(dev);
nv_enable_irq(dev);
}
return 0;
@@ -3339,13 +3339,13 @@ static int nv_set_pauseparam(struct net_
netif_carrier_off(dev);
if (netif_running(dev)) {
nv_disable_irq(dev);
-   spin_lock_bh(&dev->xmit_lock);
+   netif_tx_lock_bh(dev);
spin_lock(&np->lock);
/* stop engines */
nv_stop_rx(dev);
nv_stop_tx(dev);
spin_unlock(&np->lock);
-   spin_unlock_bh(&dev->xmit_lock);
+   netif_tx_unlock_bh(dev);
}
 
np->pause_flags &= ~(NV_PAUSEFRAME_RX_REQ|NV_PAUSEFRAME_TX_REQ);
@@ -3729,7 +3729,7 @@ static void nv_self_test(struct net_devi
if (test->flags & ETH_TEST_FL_OFFLINE) {
if (netif_running(dev)) {
netif_stop_queue(dev);
-   spin_lock_bh(&dev->xmit_lock);
+   netif_tx_lock_bh(dev);
spin_lock_irq(&np->lock);
nv_disable_hw_interrupts(dev, np->irqmask);
if (!(np->msi_flags & NV_MSI_X_ENABLED)) {
@@ -3745,7 +3745,7 @@ static void nv_self_test(struct net_devi
nv_drain_rx(dev);
nv_drain_tx(dev);
spin_unlock_irq(&np->lock);
-   spin_unlock_bh(&dev->xmit_lock);
+   netif_tx_unlock_bh(dev);
}
 
if (!nv_register_test(dev)) {


Re: [FORCEDETH]: Fix xmit_lock/netif_tx_lock after merge

2006-06-20 Thread David Miller
From: Herbert Xu <[EMAIL PROTECTED]>
Date: Tue, 20 Jun 2006 17:04:50 +1000

> [FORCEDETH]: Fix xmit_lock/netif_tx_lock after merge

Thanks for checking out that merge conflict.

Linus, please apply.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [FORCEDETH]: Fix xmit_lock/netif_tx_lock after merge

2006-06-20 Thread Jeff Garzik

Herbert Xu wrote:

Hi:

[FORCEDETH]: Fix xmit_lock/netif_tx_lock after merge

There has been an update to the forcedeth driver that added a few new
uses of xmit_lock which is no longer meant to be used directly.  This
patch replaces them with netif_tx_lock_bh.

Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>


ACK.  I'll apply, if Linus doesn't pick this up...

Jeff



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 6688] Memory allocation problem

2006-06-20 Thread Andrew Morton
On Mon, 19 Jun 2006 23:46:08 -0700
[EMAIL PROTECTED] wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=6688

This is looking like a net memory leak in 2.6.16.  1/3rd is in ip_fib_alias
and 2/3rds is in size-64.  I've asked the reporter to apply the leak
detector patch so we can find out who is using the size-64 part.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL

2006-06-20 Thread Jesper Dangaard Brouer

On Mon, 2006-06-19 at 22:35 -0700, Chris Wedgwood wrote:
> On Wed, Jun 14, 2006 at 11:40:04AM +0200, Jesper Dangaard Brouer wrote:
> 
> > The Linux traffic's control engine inaccurately calculates
> > transmission times for packets sent over ADSL links.  For some
> > packet sizes the error rises to over 50%.  This occurs because ADSL
> > uses ATM as its link layer transport, and ATM transmits packets in
> > fixed sized 53 byte cells.
> 
> What if AAL5 is used?  The cell-alignment math is going to be wrong
> there surely?

Actually it _is_ AAL5 which is accounted for.

See Chapter 5 "ADSL link layer overhead" (page 48-54).
http://www.adsl-optimizer.dk/thesis/main_final_hyper.pdf

-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network developer
  Cand. Scient Datalog / MSc.
  Author of http://adsl-optimizer.dk



signature.asc
Description: This is a digitally signed message part


Re: [DOC]: generic netlink

2006-06-20 Thread Thomas Graf
* jamal <[EMAIL PROTECTED]> 2006-06-19 09:41
> // the attributes you want to own
> 
> enum {
> FOOBAR_ATTR_UNSPEC,
> FOOBAR_ATTR_TYPE,
> FOOBAR_ATTR_TYPEID,
> FOOBAR_ATTR_TYPENAME,
> FOOBAR_ATTR_OPER,
>   /* add future attributes here */
> __FOOBAR_ATTR_MAX,
> };
> 
> #define FOOBAR_ATTR_MAX (__FOOBAR_ATTR_MAX - 1)

One important point about attributes in generic netlink is that
their scope is per command instead of per family as in netlink.
It's not forbidden to use the same set of attribute identifiers
for two separete commands but it should be avoided to have a
single large list of attributes and have every command pick out
the attributes it needs.


> TODO:
> a) Add a more complete compiling kernel module with events.
> Have Thomas put his Mashimaro example and point to it.

I guess we have a legal issue here ;)

> b) Describe some details on how user space -> kernel works
> probably using libnl??

I'll take care of that.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IOC3] IP27: Really set PCI64_ATTR_VIRTUAL, not PCI64_ATTR_PREC.

2006-06-20 Thread Ingo Oeser
Hi Ralf,

Ralf Baechle :
> IOC3's homegrown DMA mapping functions that are used to optimize things
> a little on IP27 set the wrong bit.

What about using a symbol instead of magic numbers?
That way one at least sees the intention of the coder.
 
> Signed-off-by: Ralf Baechle <[EMAIL PROTECTED]>
> 
> diff --git a/drivers/net/ioc3-eth.c b/drivers/net/ioc3-eth.c
> index ae71ed5..e76e6e7 100644
> --- a/drivers/net/ioc3-eth.c
> +++ b/drivers/net/ioc3-eth.c
> @@ -145,7 +145,7 @@ static inline struct sk_buff * ioc3_allo
>  static inline unsigned long ioc3_map(void *ptr, unsigned long vdev)
>  {
>  #ifdef CONFIG_SGI_IP27
> - vdev <<= 58;   /* Shift to PCI64_ATTR_VIRTUAL */
> + vdev <<= 57;   /* Shift to PCI64_ATTR_VIRTUAL */

So please use a symbolic value here.
  

Regards

Ingo Oeser
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] e1000: fix netpoll with NAPI

2006-06-20 Thread Andrew Grover

(trimmed CC to just netdev)


> > One of our engineers (on the I/O AT team) has been tasked with modifying
> > the Linux kernel to properly support multiple hardware queues (both TX and
> > RX).  We'll make sure that he looks at the netpoll interface as part of
> > that process.
>
> Might I ask who this is?  I might like to ping him/her on this topic.
> There is potentially some overlap with wireless, at least on the
> transmit side.
> John W. Linville


Hi John, so yeah we want multiple TX queues on wired ethernet for QoS,
same as wireless. Also for SMP scaling (maybe not needed until 10G+).
Did wireless people settle on a design since the email thread in
January?

Regards -- Andy
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] b44: fix manual speed/duplex/autoneg settings

2006-06-20 Thread Jeff Garzik

Gary Zambrano wrote:

Fixes for speed/duplex/autoneg settings and driver settings info.
This is a redo of a previous patch thanks to feedback from Jeff Garzik.


ACK patches 1-6, but unfortunately failed to apply against latest 
linux-2.6.git:



[EMAIL PROTECTED] netdev-2.6]$ git-applymbox /g/tmp/mbox ~/info/signoff.txt
6 patch(es) to process.

Applying 'b44: fix manual speed/duplex/autoneg settings'

fatal: corrupt patch at line 8



Also, I think I misunderstood the code in our last discussion.  You may 
be right about the advertise-all logic.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please pull 'upstream' branch of wireless-2.6

2006-06-20 Thread Jeff Garzik

John W. Linville wrote:

The following changes since commit 76df73ff90e99681a99e457aec4cfe0a240b7982:
  John W. Linville:
Merge branch 'from-linus' into upstream

are found in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git 
upstream

Jiri Slaby:
  pci: bcm43xx avoid pci_find_device

Larry Finger:
  wireless: Changes to ieee80211.h for user space regulatory daemon
  wireless: correct dump of WPA IE

Michael Buesch:
  bcm43xx: redesign locking
  bcm43xx: preemptible periodic work

Zhu Yi:
  ipw2200 locking fix


pulled


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Pull request for 'upstream' branch

2006-06-20 Thread Jeff Garzik

Francois Romieu wrote:

Please pull from branch 'upstream' to get the change below:

git://electric-eye.fr.zoreil.com/home/romieu/linux-2.6.git

Patch applies both to jeff#upstream and jeff#upstream-fixes

Shortlog


Pedro Alejandro López-Valencia:
  sundance: PCI ID for ip100a


pulled, thanks


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] 8139cp: fix eeprom read command length

2006-06-20 Thread Jeff Garzik

Philip Craig wrote:

The read command for the 93C46/93C56 EEPROMS should be 3 bits plus
the address.  This doesn't appear to affect the operation of the
read command, but similar errors for write commands do cause failures.

Signed-off-by: Philip Craig <[EMAIL PROTECTED]>


ACK patches 1-2, but patch appears corrupted:

[EMAIL PROTECTED] netdev-2.6]$ git-applymbox /g/tmp/mbox ~/info/signoff.txt
2 patch(es) to process.

Applying '8139cp: fix eeprom read command length'

fatal: corrupt patch at line 7

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[0/5] GSO: Generic Segmentation Offload

2006-06-20 Thread Herbert Xu
Hi:

This series adds Generic Segmentation Offload (GSO) support to the Linux
networking stack.

Many people have observed that a lot of the savings in TSO come from
traversing the networking stack once rather than many times for each
super-packet.  These savings can be obtained without hardware support.
In fact, the concept can be applied to other protocols such as TCPv6,
UDP, or even DCCP.

The key to minimising the cost in implementing this is to postpone the
segmentation as late as possible.  In the ideal world, the segmentation
would occur inside each NIC driver where they would rip the super-packet
apart and either produce SG lists which are directly fed to the hardware,
or linearise each segment into pre-allocated memory to be fed to the NIC.
This would elminate segmented skb's altogether.

Unfortunately this requires modifying each and every NIC driver so it
would take quite some time.  A much easier solution is to perform the
segmentation just before the entry into the driver's xmit routine.  This
series of patches does this.

I've attached some numbers to demonstrate the savings brought on by
doing this.  The best scenario is obviously the case where the underlying
NIC supports SG.  This means that we simply have to manipulate the SG
entries and place them into individual skb's before passing them to the
driver.  The attached file lo-res shows this.

The test was performed through the loopback device which is a fairly good
approxmiation of an SG-capable NIC.

GSO like TSO is only effective if the MTU is significantly less than the
maximum value of 64K.  So only the case where the MTU was set to 1500 is
of interest.  There we can see that the throughput improved by 17.5%
(3061.05Mb/s => 3598.17Mb/s).  The actual saving in transmission cost is
in fact a lot more than that as the majority of the time here is spent on
the RX side which still has to deal with 1500-byte packets.

The worst-case scenario is where the NIC does not support SG and the user
uses write(2) which means that we have to copy the data twice.  The files
gso-off/gso-on provide data for this case (the test was carried out on
e100).  As you can see, the cost of the extra copy is mostly offset by the
reduction in the cost of going through the networking stack.

For now GSO is off by default but can be enabled through ethtool.  It is
conceivable that with enough optimisation GSO could be a win in most cases
and we could enable it by default.

However, even without enabling GSO explicitly it can still function on
bridged and forwarded packets.  As it is, passing TSO packets through a
bridge only works if all constiuents support TSO.  With GSO, it provides
a fallback so that we may enable TSO for a bridge even if some of its
constituents do not support TSO.

This provides massive savings for Xen as it uses a bridge-based architecture
and TSO/GSO produces a much larger effective MTU for internal traffic between
domains.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[1/5] [NET]: Merge TSO/UFO fields in sk_buff

2006-06-20 Thread Herbert Xu
Hi:

[NET]: Merge TSO/UFO fields in sk_buff

Having separate fields in sk_buff for TSO/UFO (tso_size/ufo_size) is not
going to scale if we add any more segmentation methods (e.g., DCCP).  So
let's merge them.

They were used to tell the protocol of a packet.  This function has been
subsumed by the new gso_type field.  This is essentially a set of netdev
feature bits (shifted by 16 bits) that are required to process a specific
skb.  As such it's easy to tell whether a given device can process a GSO
skb: you just have to and the gso_type field and the netdev's features
field.

I've made gso_type a conjunction.  The idea is that you have a base type
(e.g., SKB_GSO_TCPV4) that can be modified further to support new features.
For example, if we add a hardware TSO type that supports ECN, they would
declare NETIF_F_TSO | NETIF_F_TSO_ECN.  All TSO packets with CWR set would
have a gso_type of SKB_GSO_TCPV4 | SKB_GSO_TCPV4_ECN while all other TSO
packets would be SKB_GSO_TCPV4.  This means that only the CWR packets need
to be emulated in software.  The emulation could even chop it up into one
CWR fragment and another super-packet to be further segmented by the NIC.

Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
diff --git a/drivers/net/8139cp.c b/drivers/net/8139cp.c
--- a/drivers/net/8139cp.c
+++ b/drivers/net/8139cp.c
@@ -792,7 +792,7 @@ static int cp_start_xmit (struct sk_buff
entry = cp->tx_head;
eor = (entry == (CP_TX_RING_SIZE - 1)) ? RingEnd : 0;
if (dev->features & NETIF_F_TSO)
-   mss = skb_shinfo(skb)->tso_size;
+   mss = skb_shinfo(skb)->gso_size;
 
if (skb_shinfo(skb)->nr_frags == 0) {
struct cp_desc *txd = &cp->tx_ring[entry];
diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -1640,7 +1640,7 @@ bnx2_tx_int(struct bnx2 *bp)
skb = tx_buf->skb;
 #ifdef BCM_TSO 
/* partial BD completions possible with TSO packets */
-   if (skb_shinfo(skb)->tso_size) {
+   if (skb_shinfo(skb)->gso_size) {
u16 last_idx, last_ring_idx;
 
last_idx = sw_cons +
@@ -4428,7 +4428,7 @@ bnx2_start_xmit(struct sk_buff *skb, str
(TX_BD_FLAGS_VLAN_TAG | (vlan_tx_tag_get(skb) << 16));
}
 #ifdef BCM_TSO 
-   if ((mss = skb_shinfo(skb)->tso_size) &&
+   if ((mss = skb_shinfo(skb)->gso_size) &&
(skb->len > (bp->dev->mtu + ETH_HLEN))) {
u32 tcp_opt_len, ip_tcp_len;
 
diff --git a/drivers/net/chelsio/sge.c b/drivers/net/chelsio/sge.c
--- a/drivers/net/chelsio/sge.c
+++ b/drivers/net/chelsio/sge.c
@@ -1418,7 +1418,7 @@ int t1_start_xmit(struct sk_buff *skb, s
struct cpl_tx_pkt *cpl;
 
 #ifdef NETIF_F_TSO
-   if (skb_shinfo(skb)->tso_size) {
+   if (skb_shinfo(skb)->gso_size) {
int eth_type;
struct cpl_tx_pkt_lso *hdr;
 
@@ -1433,7 +1433,7 @@ int t1_start_xmit(struct sk_buff *skb, s
hdr->ip_hdr_words = skb->nh.iph->ihl;
hdr->tcp_hdr_words = skb->h.th->doff;
hdr->eth_type_mss = htons(MK_ETH_TYPE_MSS(eth_type,
-   skb_shinfo(skb)->tso_size));
+   skb_shinfo(skb)->gso_size));
hdr->len = htonl(skb->len - sizeof(*hdr));
cpl = (struct cpl_tx_pkt *)hdr;
sge->stats.tx_lso_pkts++;
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -2394,7 +2394,7 @@ e1000_tso(struct e1000_adapter *adapter,
uint8_t ipcss, ipcso, tucss, tucso, hdr_len;
int err;
 
-   if (skb_shinfo(skb)->tso_size) {
+   if (skb_shinfo(skb)->gso_size) {
if (skb_header_cloned(skb)) {
err = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
if (err)
@@ -2402,7 +2402,7 @@ e1000_tso(struct e1000_adapter *adapter,
}
 
hdr_len = ((skb->h.raw - skb->data) + (skb->h.th->doff << 2));
-   mss = skb_shinfo(skb)->tso_size;
+   mss = skb_shinfo(skb)->gso_size;
if (skb->protocol == htons(ETH_P_IP)) {
skb->nh.iph->tot_len = 0;
skb->nh.iph->check = 0;
@@ -2519,7 +2519,7 @@ e1000_tx_map(struct e1000_adapter *adapt
 * tso gets written back prematurely before the data is fully
 * DMA'd to the controller */
if (!skb->data_len && tx_ring->last_tx_tso &&
-   !skb_shinfo(skb)->tso_size) {
+   !skb_shinfo(s

[2/5] [NET]: Add generic segmentation offload

2006-06-20 Thread Herbert Xu
Hi:

[NET]: Add generic segmentation offload

This patch adds the infrastructure for generic segmentation offload.
The idea is to tap into the potential savings of TSO without hardware
support by postponing the allocation of segmented skb's until just
before the entry point into the NIC driver.

The same structure can be used to support software IPv6 TSO, as well as
UFO and segmentation offload for other relevant protocols, e.g., DCCP.

Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -406,6 +406,9 @@ struct net_device
struct list_headqdisc_list;
unsigned long   tx_queue_len;   /* Max frames per queue allowed 
*/
 
+   /* Partially transmitted GSO packet. */
+   struct sk_buff  *gso_skb;
+
/* ingress path synchronizer */
spinlock_t  ingress_lock;
struct Qdisc*qdisc_ingress;
@@ -540,6 +543,7 @@ struct packet_type {
 struct net_device *,
 struct packet_type *,
 struct net_device *);
+   struct sk_buff  *(*gso_segment)(struct sk_buff *skb, int sg);
void*af_packet_priv;
struct list_headlist;
 };
@@ -690,7 +694,8 @@ extern int  dev_change_name(struct net_d
 extern int dev_set_mtu(struct net_device *, int);
 extern int dev_set_mac_address(struct net_device *,
struct sockaddr *);
-extern voiddev_queue_xmit_nit(struct sk_buff *skb, struct 
net_device *dev);
+extern int dev_hard_start_xmit(struct sk_buff *skb,
+   struct net_device *dev);
 
 extern voiddev_init(void);
 
@@ -964,6 +969,7 @@ extern int  netdev_max_backlog;
 extern int weight_p;
 extern int netdev_set_master(struct net_device *dev, struct 
net_device *master);
 extern int skb_checksum_help(struct sk_buff *skb, int inward);
+extern struct sk_buff *skb_gso_segment(struct sk_buff *skb, int sg);
 #ifdef CONFIG_BUG
 extern void netdev_rx_csum_fault(struct net_device *dev);
 #else
diff --git a/net/core/dev.c b/net/core/dev.c
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -116,6 +116,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * The list of packet types we will receive (as opposed to discard)
@@ -1048,7 +1049,7 @@ static inline void net_timestamp(struct 
  * taps currently in use.
  */
 
-void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
+static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
 {
struct packet_type *ptype;
 
@@ -1186,6 +1187,40 @@ out: 
return ret;
 }
 
+/**
+ * skb_gso_segment - Perform segmentation on skb.
+ * @skb: buffer to segment
+ * @sg: whether scatter-gather is supported on the target.
+ *
+ * This function segments the given skb and returns a list of segments.
+ */
+struct sk_buff *skb_gso_segment(struct sk_buff *skb, int sg)
+{
+   struct sk_buff *segs = ERR_PTR(-EPROTONOSUPPORT);
+   struct packet_type *ptype;
+   int type = skb->protocol;
+
+   BUG_ON(skb_shinfo(skb)->frag_list);
+   BUG_ON(skb->ip_summed != CHECKSUM_HW);
+
+   skb->mac.raw = skb->data;
+   skb->mac_len = skb->nh.raw - skb->data;
+   __skb_pull(skb, skb->mac_len);
+
+   rcu_read_lock();
+   list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & 15], list) {
+   if (ptype->type == type && !ptype->dev && ptype->gso_segment) {
+   segs = ptype->gso_segment(skb, sg);
+   break;
+   }
+   }
+   rcu_read_unlock();
+
+   return segs;
+}
+
+EXPORT_SYMBOL(skb_gso_segment);
+
 /* Take action when hardware reception checksum errors are detected. */
 #ifdef CONFIG_BUG
 void netdev_rx_csum_fault(struct net_device *dev)
@@ -1222,6 +1257,85 @@ static inline int illegal_highdma(struct
 #define illegal_highdma(dev, skb)  (0)
 #endif
 
+struct dev_gso_cb {
+   void (*destructor)(struct sk_buff *skb);
+};
+
+#define DEV_GSO_CB(skb) ((struct dev_gso_cb *)(skb)->cb)
+
+static void dev_gso_skb_destructor(struct sk_buff *skb)
+{
+   struct dev_gso_cb *cb;
+
+   do {
+   struct sk_buff *nskb = skb->next;
+
+   skb->next = nskb->next;
+   nskb->next = NULL;
+   kfree_skb(nskb);
+   } while (skb->next);
+
+   cb = DEV_GSO_CB(skb);
+   if (cb->destructor)
+   cb->destructor(skb);
+}
+
+/**
+ * dev_gso_segment 

Re: [0/5] GSO: Generic Segmentation Offload

2006-06-20 Thread Herbert Xu
On Tue, Jun 20, 2006 at 07:09:19PM +1000, herbert wrote:
>
> I've attached some numbers to demonstrate the savings brought on by
> doing this.  The best scenario is obviously the case where the underlying
> NIC supports SG.  This means that we simply have to manipulate the SG
> entries and place them into individual skb's before passing them to the
> driver.  The attached file lo-res shows this.

Obviously I forgot to attach them :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
$ sudo ./ethtool -K lo gso on
$ sudo ifconfig lo mtu 1500
$ netperf -t TCP_STREAM
TCP STREAM TEST to localhost
Recv   SendSend
Socket Socket  Message  Elapsed
Size   SizeSize Time Throughput
bytes  bytes   bytessecs.10^6bits/sec

 87380  16384  1638410.003598.17
$ sudo ./ethtool -K lo gso off
$ netperf -t TCP_STREAM
TCP STREAM TEST to localhost
Recv   SendSend
Socket Socket  Message  Elapsed
Size   SizeSize Time Throughput
bytes  bytes   bytessecs.10^6bits/sec

 87380  16384  1638410.003061.05
$ sudo ifconfig lo mtu 6
$ netperf -t TCP_STREAM
TCP STREAM TEST to localhost
Recv   SendSend
Socket Socket  Message  Elapsed
Size   SizeSize Time Throughput
bytes  bytes   bytessecs.10^6bits/sec

 87380  16384  1638410.008245.05
$ sudo ./ethtool -K lo gso on
$ netperf -t TCP_STREAM
TCP STREAM TEST to localhost
Recv   SendSend
Socket Socket  Message  Elapsed
Size   SizeSize Time Throughput
bytes  bytes   bytessecs.10^6bits/sec

 87380  16384  1638410.008563.36
$ sudo ifconfig lo mtu 16436
$ netperf -t TCP_STREAM
TCP STREAM TEST to localhost
Recv   SendSend
Socket Socket  Message  Elapsed
Size   SizeSize Time Throughput
bytes  bytes   bytessecs.10^6bits/sec

 87380  16384  1638410.007359.95
$ sudo ./ethtool -K lo gso off
$ netperf -t TCP_STREAM
TCP STREAM TEST to localhost
Recv   SendSend
Socket Socket  Message  Elapsed
Size   SizeSize Time Throughput
bytes  bytes   bytessecs.10^6bits/sec

 87380  16384  1638410.007535.04
$
CPU: PIII, speed 1200 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit 
mask of 0x00 (No unit mask) count 10
samples  %symbol name
1247 21.7551  csum_partial_copy_generic
294   5.1291  prep_new_page
240   4.1870  __alloc_skb
120   2.0935  tcp_sendmsg
113   1.9714  get_offset_pmtmr
113   1.9714  kfree
103   1.7969  skb_release_data
103   1.7969  timer_interrupt
101   1.7620  ip_queue_xmit
961.6748  skb_clone
941.6399  __kmalloc
941.6399  net_rx_action
861.5003  tcp_transmit_skb
801.3957  kmem_cache_free
761.3259  tcp_clean_rtx_queue
671.1689  ip_output
661.1514  mark_offset_pmtmr
651.1340  tcp_v4_rcv
641.1165  local_bh_enable
621.0816  kmem_cache_alloc
591.0293  irq_entries_start
591.0293  page_fault
570.9944  tcp_push_one
520.9072  kfree_skbmem
470.8200  __qdisc_run
470.8200  csum_partial
470.8200  netif_receive_skb
460.8025  __kfree_skb
460.8025  tcp_init_tso_segs
440.7676  __copy_to_user_ll
440.7676  dev_queue_xmit
390.6804  pfifo_fast_enqueue
390.6804  system_call
370.6455  __copy_from_user_ll
370.6455  ip_rcv
360.6281  __tcp_select_window
330.5757  sock_wfree
310.5408  __do_softirq
310.5408  tcp_v4_send_check
300.5234  eth_header
280.4885  tcp_rcv_established
270.4710  restore_nocheck
260.4536  pfifo_fast_dequeue
250.4361  __do_IRQ
250.4361  do_softirq
250.4361  tcp_build_and_update_options
250.4361  tcp_snd_test
230.4013  cache_alloc_refill
230.4013  handle_IRQ_event
230.4013  tcp_ack
220.3838  free_block
220.3838  ip_route_input
210.3664  __netif_rx_schedule
210.3664  schedule
200.3489  do_wp_page
200.3489  neigh_resolve_output
190.3315  do_IRQ
190.3315  do_page_fault
190.3315  do_select
190.3315  fget_light
190.3315  ip_local_deliver
180.3140  __tcp_push_pending_frames
180.3140  end_level_ioapic_irq
170.2966  cpu_idle
170.2966  delay_pmtmr
170.2966  tcp_select_window
160.2791  add_wait_queue
160.2791  rt_hash_code
160.2791  tcp_set_skb_tso_segs
150.2617  find_vma
150.2617  irq_exit
150.2617  update_send_head
140.2442  __switch_to
130.2268  __skb_checksum_complete
130.2268  common_interrupt
130.2268  dev_kfree_skb_any
130.2268  tcp_event_data_sent
13  

[4/5] [NET]: Added GSO toggle

2006-06-20 Thread Herbert Xu
Hi:

[NET]: Added GSO toggle

This patch adds a generic segmentation offload toggle that can be turned
on/off for each net device.  For now it only supports in TCPv4.

Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -408,6 +408,8 @@ struct ethtool_ops {
 #define ETHTOOL_GPERMADDR  0x0020 /* Get permanent hardware address */
 #define ETHTOOL_GUFO   0x0021 /* Get UFO enable (ethtool_value) */
 #define ETHTOOL_SUFO   0x0022 /* Set UFO enable (ethtool_value) */
+#define ETHTOOL_GGSO   0x0023 /* Get GSO enable (ethtool_value) */
+#define ETHTOOL_SGSO   0x0024 /* Set GSO enable (ethtool_value) */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET ETHTOOL_GSET
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -309,6 +309,7 @@ struct net_device
 #define NETIF_F_HW_VLAN_RX 256 /* Receive VLAN hw acceleration */
 #define NETIF_F_HW_VLAN_FILTER 512 /* Receive filtering on VLAN */
 #define NETIF_F_VLAN_CHALLENGED1024/* Device cannot handle VLAN 
packets */
+#define NETIF_F_GSO2048/* Enable software GSO. */
 #define NETIF_F_LLTX   4096/* LockLess TX */
 
/* Segmentation offload features */
diff --git a/include/net/sock.h b/include/net/sock.h
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1031,9 +1031,13 @@ static inline void sk_setup_caps(struct 
 {
__sk_dst_set(sk, dst);
sk->sk_route_caps = dst->dev->features;
+   if (sk->sk_route_caps & NETIF_F_GSO)
+   sk->sk_route_caps |= NETIF_F_TSO;
if (sk->sk_route_caps & NETIF_F_TSO) {
if (sock_flag(sk, SOCK_NO_LARGESEND) || dst->header_len)
sk->sk_route_caps &= ~NETIF_F_TSO;
+   else 
+   sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
}
 }
 
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -376,15 +376,20 @@ void br_features_recompute(struct net_br
features = br->feature_mask & ~NETIF_F_ALL_CSUM;
 
list_for_each_entry(p, &br->port_list, list) {
-   if (checksum & NETIF_F_NO_CSUM &&
-   !(p->dev->features & NETIF_F_NO_CSUM))
+   unsigned long feature = p->dev->features;
+
+   if (checksum & NETIF_F_NO_CSUM && !(feature & NETIF_F_NO_CSUM))
checksum ^= NETIF_F_NO_CSUM | NETIF_F_HW_CSUM;
-   if (checksum & NETIF_F_HW_CSUM &&
-   !(p->dev->features & NETIF_F_HW_CSUM))
+   if (checksum & NETIF_F_HW_CSUM && !(feature & NETIF_F_HW_CSUM))
checksum ^= NETIF_F_HW_CSUM | NETIF_F_IP_CSUM;
-   if (!(p->dev->features & NETIF_F_IP_CSUM))
+   if (!(feature & NETIF_F_IP_CSUM))
checksum = 0;
-   features &= p->dev->features;
+
+   if (feature & NETIF_F_GSO)
+   feature |= NETIF_F_TSO;
+   feature |= NETIF_F_GSO;
+
+   features &= feature;
}
 
br->dev->features = features | checksum | NETIF_F_LLTX;
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -614,6 +614,29 @@ static int ethtool_set_ufo(struct net_de
return dev->ethtool_ops->set_ufo(dev, edata.data);
 }
 
+static int ethtool_get_gso(struct net_device *dev, char __user *useraddr)
+{
+   struct ethtool_value edata = { ETHTOOL_GGSO };
+
+   edata.data = dev->features & NETIF_F_GSO;
+   if (copy_to_user(useraddr, &edata, sizeof(edata)))
+return -EFAULT;
+   return 0;
+}
+
+static int ethtool_set_gso(struct net_device *dev, char __user *useraddr)
+{
+   struct ethtool_value edata;
+
+   if (copy_from_user(&edata, useraddr, sizeof(edata)))
+   return -EFAULT;
+   if (edata.data)
+   dev->features |= NETIF_F_GSO;
+   else
+   dev->features &= ~NETIF_F_GSO;
+   return 0;
+}
+
 static int ethtool_self_test(struct net_device *dev, char __user *useraddr)
 {
struct ethtool_test test;
@@ -905,6 +928,12 @@ int dev_ethtool(struct ifreq *ifr)
case ETHTOOL_SUFO:
rc = ethtool_set_ufo(dev, useraddr);
break;
+   case ETHTOOL_GGSO:
+   rc = ethtool_get_gso(dev, useraddr);
+   break;
+   case ETHTOOL_SGSO:
+   rc = ethtool_set_gso(dev, useraddr);
+   break;
default:
rc =  -EOPNOT

[3/5] [NET]: Add software TSOv4

2006-06-20 Thread Herbert Xu
Hi:

[NET]: Add software TSOv4

This patch adds the GSO implementation for IPv4 TCP.

Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1299,6 +1299,7 @@ extern void  skb_split(struct sk_b
 struct sk_buff *skb1, const u32 len);
 
 extern void   skb_release_data(struct sk_buff *skb);
+extern struct sk_buff *skb_segment(struct sk_buff *skb, int sg);
 
 static inline void *skb_header_pointer(const struct sk_buff *skb, int offset,
   int len, void *buffer)
diff --git a/include/net/protocol.h b/include/net/protocol.h
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -37,6 +37,7 @@
 struct net_protocol {
int (*handler)(struct sk_buff *skb);
void(*err_handler)(struct sk_buff *skb, u32 info);
+   struct sk_buff *(*gso_segment)(struct sk_buff *skb, int sg);
int no_policy;
 };
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1087,6 +1087,8 @@ extern struct request_sock_ops tcp_reque
 
 extern int tcp_v4_destroy_sock(struct sock *sk);
 
+extern struct sk_buff *tcp_tso_segment(struct sk_buff *skb, int sg);
+
 #ifdef CONFIG_PROC_FS
 extern int  tcp4_proc_init(void);
 extern void tcp4_proc_exit(void);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1826,6 +1826,132 @@ unsigned char *skb_pull_rcsum(struct sk_
 
 EXPORT_SYMBOL_GPL(skb_pull_rcsum);
 
+/**
+ * skb_segment - Perform protocol segmentation on skb.
+ * @skb: buffer to segment
+ * @sg: whether scatter-gather can be used for generated segments
+ *
+ * This function performs segmentation on the given skb.  It returns
+ * the segment at the given position.  It returns NULL if there are
+ * no more segments to generate, or when an error is encountered.
+ */
+struct sk_buff *skb_segment(struct sk_buff *skb, int sg)
+{
+   struct sk_buff *segs = NULL;
+   struct sk_buff *tail = NULL;
+   unsigned int mss = skb_shinfo(skb)->gso_size;
+   unsigned int doffset = skb->data - skb->mac.raw;
+   unsigned int offset = doffset;
+   unsigned int headroom;
+   unsigned int len;
+   int nfrags = skb_shinfo(skb)->nr_frags;
+   int err = -ENOMEM;
+   int i = 0;
+   int pos;
+
+   __skb_push(skb, doffset);
+   headroom = skb_headroom(skb);
+   pos = skb_headlen(skb);
+
+   do {
+   struct sk_buff *nskb;
+   skb_frag_t *frag;
+   int hsize, nsize;
+   int k;
+   int size;
+
+   len = skb->len - offset;
+   if (len > mss)
+   len = mss;
+
+   hsize = skb_headlen(skb) - offset;
+   if (hsize < 0)
+   hsize = 0;
+   nsize = hsize + doffset;
+   if (nsize > len + doffset || !sg)
+   nsize = len + doffset;
+
+   nskb = alloc_skb(nsize + headroom, GFP_ATOMIC);
+   if (unlikely(!nskb))
+   goto err;
+
+   if (segs)
+   tail->next = nskb;
+   else
+   segs = nskb;
+   tail = nskb;
+
+   nskb->dev = skb->dev;
+   nskb->priority = skb->priority;
+   nskb->protocol = skb->protocol;
+   nskb->dst = dst_clone(skb->dst);
+   memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
+   nskb->pkt_type = skb->pkt_type;
+   nskb->mac_len = skb->mac_len;
+
+   skb_reserve(nskb, headroom);
+   nskb->mac.raw = nskb->data;
+   nskb->nh.raw = nskb->data + skb->mac_len;
+   nskb->h.raw = nskb->nh.raw + (skb->h.raw - skb->nh.raw);
+   memcpy(skb_put(nskb, doffset), skb->data, doffset);
+
+   if (!sg) {
+   nskb->csum = skb_copy_and_csum_bits(skb, offset,
+   skb_put(nskb, len),
+   len, 0);
+   continue;
+   }
+
+   frag = skb_shinfo(nskb)->frags;
+   k = 0;
+
+   nskb->ip_summed = CHECKSUM_HW;
+   nskb->csum = skb->csum;
+   memcpy(skb_put(nskb, hsize), skb->data + offset, hsize);
+
+   while (pos < offset + len) {
+   BUG_ON(i >= nfrags);
+
+   *frag = skb_shinfo(skb)->frags[i];
+ 

[5/5] [IPSEC]: Handle GSO packets

2006-06-20 Thread Herbert Xu
Hi:

[IPSEC]: Handle GSO packets

This patch segments GSO packets received by the IPsec stack.  This can
happen when a NIC driver injects GSO packets into the stack which are
then forwarded to another host.

The primary application of this is going to be Xen where its backend
driver may inject GSO packets into dom0.

Of course this also can be used by other virtualisation schemes such as
VMWare or UML since the tap device could be modified to inject GSO packets
received through splice.

Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
diff --git a/net/ipv4/xfrm4_output.c b/net/ipv4/xfrm4_output.c
--- a/net/ipv4/xfrm4_output.c
+++ b/net/ipv4/xfrm4_output.c
@@ -9,6 +9,8 @@
  */
 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -97,16 +99,10 @@ error_nolock:
goto out_exit;
 }
 
-static int xfrm4_output_finish(struct sk_buff *skb)
+static int xfrm4_output_finish2(struct sk_buff *skb)
 {
int err;
 
-#ifdef CONFIG_NETFILTER
-   if (!skb->dst->xfrm) {
-   IPCB(skb)->flags |= IPSKB_REROUTED;
-   return dst_output(skb);
-   }
-#endif
while (likely((err = xfrm4_output_one(skb)) == 0)) {
nf_reset(skb);
 
@@ -119,7 +115,7 @@ static int xfrm4_output_finish(struct sk
return dst_output(skb);
 
err = nf_hook(PF_INET, NF_IP_POST_ROUTING, &skb, NULL,
- skb->dst->dev, xfrm4_output_finish);
+ skb->dst->dev, xfrm4_output_finish2);
if (unlikely(err != 1))
break;
}
@@ -127,6 +123,48 @@ static int xfrm4_output_finish(struct sk
return err;
 }
 
+static int xfrm4_output_finish(struct sk_buff *skb)
+{
+   struct sk_buff *segs;
+
+#ifdef CONFIG_NETFILTER
+   if (!skb->dst->xfrm) {
+   IPCB(skb)->flags |= IPSKB_REROUTED;
+   return dst_output(skb);
+   }
+#endif
+
+   if (!skb_shinfo(skb)->gso_size)
+   return xfrm4_output_finish2(skb);
+
+   skb->protocol = htons(ETH_P_IP);
+   segs = skb_gso_segment(skb, 0);
+   kfree_skb(skb);
+   if (unlikely(IS_ERR(segs)))
+   return PTR_ERR(segs);
+
+   do {
+   struct sk_buff *nskb = segs->next;
+   int err;
+
+   segs->next = NULL;
+   err = xfrm4_output_finish2(segs);
+
+   if (unlikely(err)) {
+   while ((segs = nskb)) {
+   nskb = segs->next;
+   segs->next = NULL;
+   kfree_skb(segs);
+   }
+   return err;
+   }
+
+   segs = nskb;
+   } while (segs);
+
+   return 0;
+}
+
 int xfrm4_output(struct sk_buff *skb)
 {
return NF_HOOK_COND(PF_INET, NF_IP_POST_ROUTING, skb, NULL, 
skb->dst->dev,
diff --git a/net/ipv6/xfrm6_output.c b/net/ipv6/xfrm6_output.c
--- a/net/ipv6/xfrm6_output.c
+++ b/net/ipv6/xfrm6_output.c
@@ -94,7 +94,7 @@ error_nolock:
goto out_exit;
 }
 
-static int xfrm6_output_finish(struct sk_buff *skb)
+static int xfrm6_output_finish2(struct sk_buff *skb)
 {
int err;
 
@@ -110,7 +110,7 @@ static int xfrm6_output_finish(struct sk
return dst_output(skb);
 
err = nf_hook(PF_INET6, NF_IP6_POST_ROUTING, &skb, NULL,
- skb->dst->dev, xfrm6_output_finish);
+ skb->dst->dev, xfrm6_output_finish2);
if (unlikely(err != 1))
break;
}
@@ -118,6 +118,41 @@ static int xfrm6_output_finish(struct sk
return err;
 }
 
+static int xfrm6_output_finish(struct sk_buff *skb)
+{
+   struct sk_buff *segs;
+
+   if (!skb_shinfo(skb)->gso_size)
+   return xfrm6_output_finish2(skb);
+
+   skb->protocol = htons(ETH_P_IP);
+   segs = skb_gso_segment(skb, 0);
+   kfree_skb(skb);
+   if (unlikely(IS_ERR(segs)))
+   return PTR_ERR(segs);
+
+   do {
+   struct sk_buff *nskb = segs->next;
+   int err;
+
+   segs->next = NULL;
+   err = xfrm6_output_finish2(segs);
+
+   if (unlikely(err)) {
+   while ((segs = nskb)) {
+   nskb = segs->next;
+   segs->next = NULL;
+   kfree_skb(segs);
+   }
+   return err;
+   }
+
+   segs = nskb;
+   } while (segs);
+
+   return 0;
+}
+
 int xfrm6_output(struct sk_buff *skb)
 {
return NF_HOOK(PF_INET6, NF_IP6_POST_ROUTING, skb, NULL, skb->dst->dev,


[PATCH 2.6.17] AT91RM9200 Ethernet #1: Link poll

2006-06-20 Thread Andrew Victor
For Ethernet PHYs that don't have an IRQ pin or boards that don't
connect the IRQ pin to the processor, we enable a timer to poll the
PHY's link state.

Patch originally supplied by Eric Benard and Roman Kolesnikov.


Signed-off-by: Andrew Victor <[EMAIL PROTECTED]>

diff -urN linux-2.6.17.orig/drivers/net/arm/at91_ether.c 
linux-2.6.17/drivers/net/arm/at91_ether.c
--- linux-2.6.17.orig/drivers/net/arm/at91_ether.c  Tue Jun 20 11:27:37 2006
+++ linux-2.6.17/drivers/net/arm/at91_ether.c   Tue Jun 20 11:31:06 2006
@@ -45,6 +45,9 @@
 static struct net_device *at91_dev;
 static struct clk *ether_clk;
 
+static struct timer_list check_timer;
+#define LINK_POLL_INTERVAL (HZ)
+
 /* . */
 
 /*
@@ -143,7 +146,7 @@
  * MAC accordingly.
  * If no link or auto-negotiation is busy, then no changes are made.
  */
-static void update_linkspeed(struct net_device *dev)
+static void update_linkspeed(struct net_device *dev, int silent)
 {
struct at91_private *lp = (struct at91_private *) dev->priv;
unsigned int bmsr, bmcr, lpa, mac_cfg;
@@ -151,7 +154,8 @@
 
if (!mii_link_ok(&lp->mii)) {   /* no link */
netif_carrier_off(dev);
-   printk(KERN_INFO "%s: Link down.\n", dev->name);
+   if (!silent)
+   printk(KERN_INFO "%s: Link down.\n", dev->name);
return;
}
 
@@ -186,7 +190,8 @@
}
at91_emac_write(AT91_EMAC_CFG, mac_cfg);
 
-   printk(KERN_INFO "%s: Link now %i-%s\n", dev->name, speed, (duplex == 
DUPLEX_FULL) ? "FullDuplex" : "HalfDuplex");
+   if (!silent)
+   printk(KERN_INFO "%s: Link now %i-%s\n", dev->name, speed, 
(duplex == DUPLEX_FULL) ? "FullDuplex" : "HalfDuplex");
netif_carrier_on(dev);
 }
 
@@ -226,7 +231,7 @@
goto done;
}
 
-   update_linkspeed(dev);
+   update_linkspeed(dev, 0);
 
 done:
disable_mdi();
@@ -243,14 +248,17 @@
unsigned int dsintr, irq_number;
int status;
 
-   if (lp->phy_type == MII_RTL8201_ID) /* RTL8201 does not have an 
interrupt */
-   return;
-   if (lp->phy_type == MII_DP83847_ID) /* DP83847 does not have an 
interrupt */
-   return;
-   if (lp->phy_type == MII_AC101L_ID)  /* AC101L interrupt not 
supported yet */
+   irq_number = lp->board_data.phy_irq_pin;
+   if (!irq_number) {
+   /*
+* PHY doesn't have an IRQ pin (RTL8201, DP83847, AC101L),
+* or board does not have it connected.
+*/
+   check_timer.expires = jiffies + LINK_POLL_INTERVAL;
+   add_timer(&check_timer);
return;
+   }
 
-   irq_number = lp->board_data.phy_irq_pin;
status = request_irq(irq_number, at91ether_phy_interrupt, 0, dev->name, 
dev);
if (status) {
printk(KERN_ERR "at91_ether: PHY IRQ %d request failed - status 
%d!\n", irq_number, status);
@@ -292,12 +300,11 @@
unsigned int dsintr;
unsigned int irq_number;
 
-   if (lp->phy_type == MII_RTL8201_ID) /* RTL8201 does not have an 
interrupt */
-   return;
-   if (lp->phy_type == MII_DP83847_ID) /* DP83847 does not have an 
interrupt */
-   return;
-   if (lp->phy_type == MII_AC101L_ID)  /* AC101L interrupt not 
supported yet */
+   irq_number = lp->board_data.phy_irq_pin;
+   if (!irq_number) {
+   del_timer_sync(&check_timer);
return;
+   }
 
spin_lock_irq(&lp->lock);
enable_mdi();
@@ -326,7 +333,6 @@
disable_mdi();
spin_unlock_irq(&lp->lock);
 
-   irq_number = lp->board_data.phy_irq_pin;
free_irq(irq_number, dev);  /* Free interrupt 
handler */
 }
 
@@ -355,6 +361,18 @@
 }
 #endif
 
+static void at91ether_check_link(unsigned long dev_id)
+{
+   struct net_device *dev = (struct net_device *) dev_id;
+
+   enable_mdi();
+   update_linkspeed(dev, 1);
+   disable_mdi();
+
+   check_timer.expires = jiffies + LINK_POLL_INTERVAL;
+   add_timer(&check_timer);
+}
+
 /* . ADDRESS MANAGEMENT  */
 
 /*
@@ -708,7 +727,7 @@
/* Determine current link speed */
spin_lock_irq(&lp->lock);
enable_mdi();
-   update_linkspeed(dev);
+   update_linkspeed(dev, 0);
disable_mdi();
spin_unlock_irq(&lp->lock);
 
@@ -992,11 +1011,18 @@
/* Determine current link speed */
spin_lock_irq(&lp->lock);
enable_mdi();
-   update_linkspeed(dev);
+   update_linkspeed(dev, 0);
disable_mdi();
spin_unlock_irq(&lp->lock);
netif_carrier_off(dev); /* will be enabled in open() */
 
+   /* If board has no PHY IRQ, use a timer to poll the PHY */
+   if (!lp->board_data.phy_irq

[PATCH 2.6.17] AT91RM9200 Ethernet #2: MII interface

2006-06-20 Thread Andrew Victor
Adds support for the MII ioctls via generic_mii_ioctl().
Patch from Brian Stafford.

Set the mii.phy_id to the detected PHY address, otherwise ethtool cannot
access PHYs other than 0.
Patch from Roman Kolesnikov.


Signed-off-by: Andrew Victor <[EMAIL PROTECTED]>


diff -urN linux-2.6.17-rmk.orig/drivers/net/arm/at91_ether.c 
linux-2.6.17-rmk/drivers/net/arm/at91_ether.c
--- linux-2.6.17-rmk.orig/drivers/net/arm/at91_ether.c  Tue Jun 20 11:03:13 2006
+++ linux-2.6.17-rmk/drivers/net/arm/at91_ether.c   Tue Jun 20 11:03:00 2006
@@ -660,6 +660,22 @@
.get_link   = ethtool_op_get_link,
 };
 
+static int at91ether_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
+{
+   struct at91_private *lp = (struct at91_private *) dev->priv;
+   int res;
+
+   if (!netif_running(dev))
+   return -EINVAL;
+
+   spin_lock_irq(&lp->lock);
+   enable_mdi();
+   res = generic_mii_ioctl(&lp->mii, if_mii(rq), cmd, NULL);
+   disable_mdi();
+   spin_unlock_irq(&lp->lock);
+
+   return res;
+}
 
 /*  MAC  */
 
@@ -963,6 +979,7 @@
dev->set_multicast_list = at91ether_set_rx_mode;
dev->set_mac_address = set_mac_address;
dev->ethtool_ops = &at91ether_ethtool_ops;
+   dev->do_ioctl = at91ether_ioctl;
 
SET_NETDEV_DEV(dev, &pdev->dev);
 
@@ -993,6 +1010,9 @@
lp->mii.dev = dev;  /* Support for ethtool */
lp->mii.mdio_read = mdio_read;
lp->mii.mdio_write = mdio_write;
+   lp->mii.phy_id = phy_address;
+   lp->mii.phy_id_mask = 0x1f;
+   lp->mii.reg_num_mask = 0x1f;
 
lp->phy_type = phy_type;/* Type of PHY connected */
lp->phy_address = phy_address;  /* MDI address of PHY */



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IOC3] IP27: Really set PCI64_ATTR_VIRTUAL, not PCI64_ATTR_PREC.

2006-06-20 Thread Ralf Baechle
On Tue, Jun 20, 2006 at 10:15:01AM +0200, Ingo Oeser wrote:
> From: Ingo Oeser <[EMAIL PROTECTED]>
> To:   Ralf Baechle <[EMAIL PROTECTED]>
> Subject: Re: [IOC3] IP27: Really set PCI64_ATTR_VIRTUAL, not PCI64_ATTR_PREC.
> Date: Tue, 20 Jun 2006 10:15:01 +0200
> Cc:   netdev@vger.kernel.org, Jeff Garzik <[EMAIL PROTECTED]>
> Content-Type: text/plain;
>   charset="iso-8859-1"
> 
> Hi Ralf,
> 
> Ralf Baechle :
> > IOC3's homegrown DMA mapping functions that are used to optimize things
> > a little on IP27 set the wrong bit.
> 
> What about using a symbol instead of magic numbers?
> That way one at least sees the intention of the coder.
>  
> > Signed-off-by: Ralf Baechle <[EMAIL PROTECTED]>
> > 
> > diff --git a/drivers/net/ioc3-eth.c b/drivers/net/ioc3-eth.c
> > index ae71ed5..e76e6e7 100644
> > --- a/drivers/net/ioc3-eth.c
> > +++ b/drivers/net/ioc3-eth.c
> > @@ -145,7 +145,7 @@ static inline struct sk_buff * ioc3_allo
> >  static inline unsigned long ioc3_map(void *ptr, unsigned long vdev)
> >  {
> >  #ifdef CONFIG_SGI_IP27
> > -   vdev <<= 58;   /* Shift to PCI64_ATTR_VIRTUAL */
> > +   vdev <<= 57;   /* Shift to PCI64_ATTR_VIRTUAL */
> 
> So please use a symbolic value here.

It is a hack and meant to look like one, loudly marked with #ifdef.

  Ralf
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rtl8150 usb driver, needs more vendor ids?

2006-06-20 Thread Petko Manolov

Hi Ben,

What you have sent me is a bit of a puzzle.

Looking at the device's details i can see it is not RTL8150 based device, 
but ADMtek's ADM8511.  Both vendor and device IDs have been listed in 
pegasus.c for a long long time.


Using rtl8150.c will not help at all since it talks to a different device. 
I suggest using pegasus.c ...



Petko



On Mon, 19 Jun 2006, Ben Greear wrote:


Someone put a wired usb-to-ethernet adapter into a system running
the 2.6.13.5 kernel.  The driver is evidently the rtl8150, and this
person sent me what appeared to be a modified version of the rtl8150
that is in the kernel.  The kernel driver does not appear to even attempt
to use this device.

Here is the output from /proc/bus/usb/devices:

T:  Bus=01 Lev=01 Prnt=01 Port=00 Cnt=01 Dev#=  4 Spd=12  MxCh= 0
D:  Ver= 1.10 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=07a6 ProdID=8511 Rev= 1.01
S:  Manufacturer=ADMtek
S:  Product=USB To LAN Converter
S:  SerialNumber=0001
C:* #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr=160mA
I:  If#= 0 Alt= 0 #EPs= 3 Cls=00(>ifc ) Sub=00 Prot=00 Driver=(none)
E:  Ad=81(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=02(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=83(I) Atr=03(Int.) MxPS=   8 Ivl=1ms

The vendor driver (GPL) I received is available here:
http://www.candelatech.com/oss/RTL8150.C

When I get a chance later this week, I plan to add this vendor id
and see if it works.  Since the physical equipment is half a world
away, I'd welcome any conjecture as to whether simply adding the
device ID will work or not...

Thanks,
Ben

--
Ben Greear <[EMAIL PROTECTED]>
Candela Technologies Inc  http://www.candelatech.com



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2.6.17] AT91RM9200 Ethernet #3: Cleanup

2006-06-20 Thread Andrew Victor
Moved global ether_clk variable into controller data structure.
Patch from David Brownell.

Davicom 9161 PHY was being incorrectly displayed as "9196".
Patch from Brian Stafford.

clk_get() doesn't return NULL on error, so the return value needs to be
tested with IS_ERR().

Whitespace cleanup.


Signed-off-by: Andrew Victor <[EMAIL PROTECTED]>


diff -urN linux-2.6.17-rmk.orig/drivers/net/arm/at91_ether.c 
linux-2.6.17-rmk/drivers/net/arm/at91_ether.c
--- linux-2.6.17-rmk.orig/drivers/net/arm/at91_ether.c  Tue Jun 20 11:08:36 2006
+++ linux-2.6.17-rmk/drivers/net/arm/at91_ether.c   Tue Jun 20 11:11:41 2006
@@ -43,7 +43,6 @@
 #define DRV_VERSION"1.0"
 
 static struct net_device *at91_dev;
-static struct clk *ether_clk;
 
 static struct timer_list check_timer;
 #define LINK_POLL_INTERVAL (HZ)
@@ -519,7 +518,7 @@
hash_index |= (bitval << j);
}
 
-return hash_index;
+   return hash_index;
 }
 
 /*
@@ -575,10 +574,8 @@
at91_emac_write(AT91_EMAC_CFG, cfg);
 }
 
-
 /* . ETHTOOL SUPPORT ... */
 
-
 static int mdio_read(struct net_device *dev, int phy_id, int location)
 {
unsigned int value;
@@ -719,10 +716,10 @@
struct at91_private *lp = (struct at91_private *) dev->priv;
unsigned long ctl;
 
-if (!is_valid_ether_addr(dev->dev_addr))
-   return -EADDRNOTAVAIL;
+   if (!is_valid_ether_addr(dev->dev_addr))
+   return -EADDRNOTAVAIL;
 
-   clk_enable(ether_clk);  /* Re-enable Peripheral clock */
+   clk_enable(lp->ether_clk);  /* Re-enable Peripheral clock */
 
/* Clear internal statistics */
ctl = at91_emac_read(AT91_EMAC_CTL);
@@ -756,6 +753,7 @@
  */
 static int at91ether_close(struct net_device *dev)
 {
+   struct at91_private *lp = (struct at91_private *) dev->priv;
unsigned long ctl;
 
/* Disable Receiver and Transmitter */
@@ -772,7 +770,7 @@
 
netif_stop_queue(dev);
 
-   clk_disable(ether_clk); /* Disable Peripheral clock */
+   clk_disable(lp->ether_clk); /* Disable Peripheral clock */
 
return 0;
 }
@@ -904,7 +902,7 @@
if (intstatus & AT91_EMAC_RCOM) /* Receive complete */
at91ether_rx(dev);
 
-   if (intstatus & AT91_EMAC_TCOM) {   /* Transmit complete */
+   if (intstatus & AT91_EMAC_TCOM) {   /* Transmit complete */
/* The TCOM bit is set even if the transmission failed. */
if (intstatus & (AT91_EMAC_TUND | AT91_EMAC_RTRY))
lp->stats.tx_errors += 1;
@@ -933,7 +931,8 @@
 /*
  * Initialize the ethernet interface
  */
-static int __init at91ether_setup(unsigned long phy_type, unsigned short 
phy_address, struct platform_device *pdev)
+static int __init at91ether_setup(unsigned long phy_type, unsigned short 
phy_address,
+   struct platform_device *pdev, struct clk *ether_clk)
 {
struct at91_eth_data *board_data = pdev->dev.platform_data;
struct net_device *dev;
@@ -967,6 +966,7 @@
return -ENOMEM;
}
lp->board_data = *board_data;
+   lp->ether_clk = ether_clk;
platform_set_drvdata(pdev, dev);
 
spin_lock_init(&lp->lock);
@@ -1050,7 +1050,7 @@
dev->dev_addr[0], dev->dev_addr[1], dev->dev_addr[2],
dev->dev_addr[3], dev->dev_addr[4], dev->dev_addr[5]);
if ((phy_type == MII_DM9161_ID) || (lp->phy_type == MII_DM9161A_ID))
-   printk(KERN_INFO "%s: Davicom 9196 PHY %s\n", dev->name, 
(lp->phy_media == PORT_FIBRE) ? "(Fiber)" : "(Copper)");
+   printk(KERN_INFO "%s: Davicom 9161 PHY %s\n", dev->name, 
(lp->phy_media == PORT_FIBRE) ? "(Fiber)" : "(Copper)");
else if (phy_type == MII_LXT971A_ID)
printk(KERN_INFO "%s: Intel LXT971A PHY\n", dev->name);
else if (phy_type == MII_RTL8201_ID)
@@ -1076,9 +1076,10 @@
int detected = -1;
unsigned long phy_id;
unsigned short phy_address = 0;
+   struct clk *ether_clk;
 
ether_clk = clk_get(&pdev->dev, "ether_clk");
-   if (!ether_clk) {
+   if (IS_ERR(ether_clk)) {
printk(KERN_ERR "at91_ether: no clock defined\n");
return -ENODEV;
}
@@ -1101,7 +1102,7 @@
case MII_DP83847_ID:/* National 
Semiconductor DP83847:  */
case MII_AC101L_ID: /* Altima AC101L: 
PHY_ID1 = 0x22, PHY_ID2 = 0x5520 */
case MII_KS8721_ID: /* Micrel KS8721: 
PHY_ID1 = 0x22, PHY_ID2 = 0x1610 */
-   detected = at91ether_setup(phy_id, phy_address, 
pdev);
+   detected = at91ether_setup(phy_id, phy_address, 
pdev, ether_clk);
break;
}
 
@@ -1120,7 +112

[PATCH 2.6.17] AT91RM9200 Ethernet #4: Suspend/Resume

2006-06-20 Thread Andrew Victor
Adds power-management (suspend/resume) support to the AT91RM9200
Ethernet driver.
Patch from David Brownell.


Signed-off-by: Andrew Victor <[EMAIL PROTECTED]>


diff -urN linux-2.6.17-rmk.orig/drivers/net/arm/at91_ether.c 
linux-2.6.17-rmk/drivers/net/arm/at91_ether.c
--- linux-2.6.17-rmk.orig/drivers/net/arm/at91_ether.c  Tue Jun 20 11:16:31 2006
+++ linux-2.6.17-rmk/drivers/net/arm/at91_ether.c   Tue Jun 20 11:18:45 2006
@@ -1128,10 +1128,54 @@
return 0;
 }
 
+#ifdef CONFIG_PM
+
+static int at91ether_suspend(struct platform_device *pdev, pm_message_t mesg)
+{
+   struct at91_private *lp = (struct at91_private *) at91_dev->priv;
+   struct net_device *net_dev = platform_get_drvdata(pdev);
+   int phy_irq = lp->board_data.phy_irq_pin;
+
+   if (netif_running(net_dev)) {
+   if (phy_irq)
+   disable_irq(phy_irq);
+
+   netif_stop_queue(net_dev);
+   netif_device_detach(net_dev);
+
+   clk_disable(lp->ether_clk);
+   }
+   return 0;
+}
+
+static int at91ether_resume(struct platform_device *pdev)
+{
+   struct at91_private *lp = (struct at91_private *) at91_dev->priv;
+   struct net_device *net_dev = platform_get_drvdata(pdev);
+   int phy_irq = lp->board_data.phy_irq_pin;
+
+   if (netif_running(net_dev)) {
+   clk_enable(lp->ether_clk);
+
+   netif_device_attach(net_dev);
+   netif_start_queue(net_dev);
+
+   if (phy_irq)
+   enable_irq(phy_irq);
+   }
+   return 0;
+}
+
+#else
+#define at91ether_suspend  NULL
+#define at91ether_resume   NULL
+#endif
+
 static struct platform_driver at91ether_driver = {
.probe  = at91ether_probe,
.remove = __devexit_p(at91ether_remove),
-   /* FIXME:  support suspend and resume */
+   .suspend= at91ether_suspend,
+   .resume = at91ether_resume,
.driver = {
.name   = DRV_NAME,
.owner  = THIS_MODULE,



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0/5] GSO: Generic Segmentation Offload

2006-06-20 Thread David Miller
From: Herbert Xu <[EMAIL PROTECTED]>
Date: Tue, 20 Jun 2006 19:32:19 +1000

> On Tue, Jun 20, 2006 at 07:09:19PM +1000, herbert wrote:
> >
> > I've attached some numbers to demonstrate the savings brought on by
> > doing this.  The best scenario is obviously the case where the underlying
> > NIC supports SG.  This means that we simply have to manipulate the SG
> > entries and place them into individual skb's before passing them to the
> > driver.  The attached file lo-res shows this.
> 
> Obviously I forgot to attach them :)

:-)

The changes look good on first scan, I'll look more deeply and
meanwhile we'll let the patches ferment for a few days so others
can comment too :-)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bugme-new] [Bug 6682] New: BUG: soft lockup detected on CPU#0! / ksoftirqd takse 100% CPU

2006-06-20 Thread Herbert Xu
On Mon, Jun 19, 2006 at 10:20:10PM +, Andrew Morton wrote:
>
> >  [] dev_queue_xmit+0xe0/0x203
> >  [] ip_output+0x1e1/0x237
> >  [] ip_forward+0x181/0x1df
> >  [] ip_rcv+0x40c/0x485
> >  [] netif_receive_skb+0x12f/0x165
> >  [] e1000_clean_rx_irq+0x389/0x410 [e1000]
> >  [] e1000_clean+0x94/0x12f [e1000]
> >  [] net_rx_action+0x69/0xf0
> >  [] __do_softirq+0x55/0xbd
> >  [] do_softirq+0x2d/0x31
> >  [] local_bh_enable+0x5a/0x65
> >  [] rt_run_flush+0x5f/0x80

Could you tell us the frequency of route updates on this machine?
Route updates are pretty expensive especially when a large number
of flows hits your machine right afterwards.

You can monitor this by running ip mon.  You might just be getting
bogus route updates causing unnecessary flushes to the routing cache.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [DOC]: generic netlink

2006-06-20 Thread jamal
On Mon, 2006-19-06 at 11:54 -0400, James Morris wrote:
> On Mon, 19 Jun 2006, jamal wrote:
> 
> > Other that TIPC the two other users i have seen use it in this manner.
> > But, you are right if usage tends to lean in some other way we could get
> > rid of it (I think TIPC is a bad example).
> 
> Ok, perhaps make a note in the docs about this and keep an eye out when 
> new code is submitted, and encourage people not to do this.

Will do.

> Actually, what would help SELinux is the opposite, forcing everyone to use 
> separate commands and assigning security attributes to each one.  But 
> because TIPC is already multiplexing, it's not feasible.
> 

Then i would say they loose the fine level granularity that would have
otherwise been provided to them. Unless you are saying that choice is
not for them to make?

> Instead, I think the way to go for SELinux is to have each nl family 
> provide a permission callback, so SELinux can pass the skb back to the nl 
> module which then returns a type of permission ('read', 'write', 
> 'readpriv').  This way, the nl module can create and manage its own 
> internal table of command permissions and also know exactly where in the 
> message to dig for the command specifier.
> 

makes sense.

> > My view: If you want to have ACLs against such commands then it becomes 
> > easier to say "can only do ADD but not DEL" for example (We need to 
> > resolve genl_rcv_msg() check on commands to be in sync with SELinux as 
> > was pointed by Thomas)
> 
> This already exists, to some extent, but only for some protocols. You can 
> see examples of existing permission tables managed by SELinux in:
>  security/selinux/nlmsgtab.c
> 
> The hope move this out of SELinux and into each nl module, which is much 
> more manageable and scalable.

agreed.

cheers,
jamal


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [DOC]: generic netlink

2006-06-20 Thread jamal
On Mon, 2006-19-06 at 11:58 -0400, Shailabh Nagar wrote: 
> jamal wrote:
[..]

> But I'm not too clear about what are the advantages of trying to limit the
> number of commands registered by a given exploiter of genetlink (say TIPC or 
> taskstats),
> other than the conventional usage of netlink.
> 
> e.g in the taskstats code, userspace needs to GET data on a per-pid and 
> per-tgid basis
> from the kernel and supplies the specific pid or tgid. We could either have 
> registered
> two commands (say GET_PID and GET_TGID) and then the parsing of the supplied 
> uint32 would
> be implicit in the command. But we went with the model where we have only one 
> GET command
> and the type of the parameter is specified via netlink attributes.

The idea is for fine grain access control(ACL) of what user process can
do (as managed by SELinux not genetlink). As an example even in your
case, you may wanna allow user program "shailab1" to be able to get
information on a groupid but not pid. We should be able to add that
level of granularity easily since we have flags per command.

> In our case, it didn't matter and since the type of data returned is very 
> similar and so is
> the parameter supplied (pid/tgid), one GET suffices. But I'm wondering if 
> userspace should
> consciously try and limit the commands or would it be better from a 
> performance standpoint,
> to permit a reasonably larger "fan-out" to happen at the genetlink command 
> level (for each exploiter).
> I guess this introduces more overhead for in-kernel structures (the linked 
> list of command structures
> that needs to be kept around) while saving time on doing a second level of 
> parsing within the
> exploiter-defined function that services the GET command.
> 
> The "small" set model looks like a good compromise. Reducing number of 
> commands to one is not a good
> idea IMHOfor reasons similar to why ioctl type syscalls aren't 
> encouraged...since the genetlink
> layer anyway has code for demultiplexing, might as well use it and avoid an 
> extra level of indirection.
> 

indeed.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT] pcnet32 NAPI changes

2006-06-20 Thread Jon Mason
On Mon, Jun 19, 2006 at 04:49:33PM -0400, Lennart Sorensen wrote:
> On Mon, Jun 19, 2006 at 03:41:40PM -0500, Jon Mason wrote:
> > I believe it is preferred to be a compile option for non-gigabit
> > drivers, given that it will be eating a lot of cycles for infrequent
> > packets (especially for the 10Mb).  I believe there was a thread about
> > this last year when e100 was having NAPI problems.
> 
> How does NAPI eat cycles?  It goes back to interrupt mode when the queue
> is empty, and only on RX interrupt does it turn on polling again.

The amount of polls per received packet is very low, thus removing the
benefit of NAPI.  A compile time option would allow those users who know
better to DTRT.

> It is certainly possible that there are bugs in a NAPI conversion, which
> I guess could be a reason to have the option to stick with the old
> method, although then again not having the option ensures the bugs get
> found sooner.
> 
> > A general nit.  There are ALOT of magic numbers in the code, most
> > existing prior to this patch.  The driver would benefit from a little
> > clean-up.
> > 
> > Also nothing to do with this patch, but I noticed it when the code was
> > moved.  A comment about why the following is necessary might be nice:
> > lp->rx_ring[i].buf_length = le16_to_cpu(2 - PKT_BUF_SZ);
> 
> I suspect many drivers are in need of some cleanup.

Yup, but the "everyone else is doing it" argument never worked with my
parents. All it takes is one brave soul to determine the reasoning
behind the magic numbers and convert them into #define's.  Shouldn't be
more than one day's work.

> 
> Len Sorensen
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] NET: Accurate packet scheduling for ATM/ADSL (userspace)

2006-06-20 Thread jamal

took off lartc off the list because it doesnt allow me to post
and i refuse to subscribe.

On Mon, 2006-19-06 at 21:31 +0200, Jesper Dangaard Brouer wrote:
> 
> On Thu, 15 Jun 2006, jamal wrote:
> > It is probably doable by just looking at netdevice->type and figuring
> > the link layer technology. Totally in user space and building the
> > compensated for tables there before telling the kernel (advantage is no
> > kernel changes and therefore it would work with older kernels as well).
> 
> I think you have got the setup all wrong.
> 
> The linux middlebox/router has two ethernet interfaces, one of the 
> ethernet interfaces is connected to the ADSL modem.  Thus, the linux 
> ethernet card cannot determine that it is connected to an ADSL line.
> 

Actually you may be making my point for me.

Heres the standard setup as i understand it(at least in north america, I
know Europeans love their ATM with a little gravy on top):

   
|Linux| --ethernet-- |Modem| --DSL-- |DSLAM| --ATM-- |BRAS| 


What this means is that Linux computes based on ethernet
headers. Somewhere downstream ATM (refer to above) comes in and that
causes mismatch in what Linux expects to be the bandwidth and what
your service provider who doesnt account for the ATM overhead when
they sell you "1.5Mbps".
Reminds me of hard disk vendors who define 1K to be 1000 to show
how large their drives are.
Yes, Linux cant tell if your service provider is lying to you.

> 
> The patch is the solution to the classical problem people 
> have when tryng to configure traffic control on an ADSL link?
> 
> Q: The packet scheduling does not work all the time?
> A: Try to decrease to bandwidth.
> 
>
> The issue here is, that ATM does not have fixed overhead (due to alignment 
> and padding).  This means that a fixed reduction of the bandwidth is not 
> the solution.  We could reduce the bandwidth to the worst-case overhead, 
> which is 62%, I do not think that is a good solution...
> 

I dont see it as wrong to be honest with you. Your mileage may vary.

> With the patch, you can now simply configure HTB to use the rate that was 
> specified by the ISP.
> 


Dont have time to read your doc and dont get me wrong, there is a
"quark" practical problem: As practical as the hard disk manufacturer
who claims that they have 11G drive when it is 10G. It needs to be
resolved - but not in an intrusive way in my opinion.

cheers,
jamal



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] 2.6.17 missing a call to ieee80211softmac_capabilities from ieee80211softmac_assoc_req

2006-06-20 Thread Larry Finger
In commit ba9b28d19a3251bb1dfe6a6f8cc89b96fb85f683, routine ieee80211softmac_capabilities was added 
to net/ieee80211/softmac/ieee80211softmac_io.c. As denoted by its name, it completes the 
capabilities IE that is needed in the associate and reassociate requests sent to the AP. For at 
least one AP, the Linksys WRT54G V5, the capabilities field must set the 'short preamble' bit or the 
AP refuses to associate. In the commit noted above, there is a call to the new routine from 
ieee80211softmac_reassoc_req, but not from ieee80211softmac_assoc_req. This patch fixes that oversight.


As noted in the subject, v2.6.17 is affected. My bcm43xx card had been unable to associate since I 
was forced to buy a new AP. I finally was able to get a packet dump and traced the problem to the 
capabilities info. Although I had heard that a patch was "floating around", I had not seen it before 
2.6.17 was released. As this bug does not affect security and I seem to have the only AP affected by 
it, there should be no problem in leaving it for 2.6.18.


Signed-Off-By: Larry Finger <[EMAIL PROTECTED]>

index 0954161..8cc8b20 100644
--- a/net/ieee80211/softmac/ieee80211softmac_io.c
+++ b/net/ieee80211/softmac/ieee80211softmac_io.c
@@ -229,6 +229,9 @@ ieee80211softmac_assoc_req(struct ieee8
return 0;
ieee80211softmac_hdr_3addr(mac, &((*pkt)->header), IEEE80211_STYPE_ASSOC_REQ, 
net->bssid, net->bssid);

+   /* Fill in the capabilities */
+   (*pkt)->capability = ieee80211softmac_capabilities(mac, net);
+
/* Fill in Listen Interval (?) */
(*pkt)->listen_interval = cpu_to_le16(10);

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rtl8150 usb driver, needs more vendor ids?

2006-06-20 Thread Ben Greear

Petko Manolov wrote:

Hi Ben,

What you have sent me is a bit of a puzzle.

Looking at the device's details i can see it is not RTL8150 based 
device, but ADMtek's ADM8511.  Both vendor and device IDs have been 
listed in pegasus.c for a long long time.


Using rtl8150.c will not help at all since it talks to a different 
device. I suggest using pegasus.c ...


Ahhh, that would explain it.  The pegasus driver loads straight
away.

Thanks!
Ben

--
Ben Greear <[EMAIL PROTECTED]>
Candela Technologies Inc  http://www.candelatech.com

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NET]: Prevent multiple qdisc runs

2006-06-20 Thread jamal
Herbert,
Thanks for your patience.

On Tue, 2006-20-06 at 08:33 +1000, Herbert Xu wrote:

> First of all you could receive an IRQ in between dropping xmit_lock
> and regaining the queue lock.  

Indeed you could. Sorry, I overlooked that in my earlier email. This
issue has been there forever though - and i dont mean to dilute its
existence by saying the chances of it happening are very very slim. I
claim though that you will be _unable to reproduce this in an
experimental setup_ i.e thats how complex it is. 

> Secondly we now have lockless drivers where this assumption also 
> does not hold.

Ok, forgot about lockless drivers;
The chances are certainly much higher with lockless driver for a very
simple reason. We used to have lock ordering that is now changed for
lockless drivers. i.e we had:

1) grab qlock, 
2)  dq
3)  grab txlock, 
4) release qlock, 
5)transmit, 
6) release txlock

to the new sequence #1,#2,#4,#3,#5,#6
and at times that same replacement txlock being also used in the rx path
to guard the tx DMA. 
A possible solution is to alias the tx lock to be dev->txlock
(DaveM had pointed out he didnt like this approach, I cant remember the
details.)

Heres where i am coming from (you may have suspected it already):
My concern is i am not sure what the performance implications are on 
this change (yes, there goes that soup^Wperformance nazi again) or what
the impact on how good the qos granularity is any longer[1].
If it is to make lock-less drivers happy, then someone oughta validate
if this performance benefit that lockless drivers give still exists. I
almost feel like we gained the 5% from lockless driving and lost 10% for
everyone else trying to fix the sins of lockless driving. So i am unsure
of the net gain. 

I apologize for hand-waving with % numbers above and using gut feeling
instead of experimental facts - I dont have time to chase it. I have
CCed Robert who may have time to see if this impacts forwarding
performance for one. I will have more peace of mind to find out there is
no impact.

cheers,
jamal

[1] By having both the forwarding path and tx softirq from multiple CPUs
enter this qdiscrun path, the chances that a packet will be dequeued
successfully and sent out within reasonable time are higher.
The tx_collision vs tx success are a good measure of how lucky you get.
This improves timeliness and granularity of qos for one. What your patch
does is reduce the granularity/possibility that we may enter
that region sooner.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] NET: Accurate packet scheduling for ATM/ADSL (userspace)

2006-06-20 Thread Patrick McHardy
jamal wrote:
> Heres the standard setup as i understand it(at least in north america, I
> know Europeans love their ATM with a little gravy on top):
> 
>
> |Linux| --ethernet-- |Modem| --DSL-- |DSLAM| --ATM-- |BRAS| 
> 
> 
> What this means is that Linux computes based on ethernet
> headers. Somewhere downstream ATM (refer to above) comes in and that
> causes mismatch in what Linux expects to be the bandwidth and what
> your service provider who doesnt account for the ATM overhead when
> they sell you "1.5Mbps".

Actually in the PPPoE case Linux doesn't know about ethernet
headers either, since shaping is usually done on the PPP device.
But that doesn't really matter since the ethernet link is not
the bottleneck - although it does add some delay for packetization.

> Yes, Linux cant tell if your service provider is lying to you.

I wouldn't call it lying as long as they don't say "1.5mbps IP
layer throughput". Ethernet doesn't provide 100mbit IP layer
throughput either, and with minimum sized IP packets its actually
well below that.

>>The patch is the solution to the classical problem people 
>>have when tryng to configure traffic control on an ADSL link?
>>
>>Q: The packet scheduling does not work all the time?
>>A: Try to decrease to bandwidth.
>>
>>
>>The issue here is, that ATM does not have fixed overhead (due to alignment 
>>and padding).  This means that a fixed reduction of the bandwidth is not 
>>the solution.  We could reduce the bandwidth to the worst-case overhead, 
>>which is 62%, I do not think that is a good solution...
>>
> 
> I dont see it as wrong to be honest with you. Your mileage may vary.

Its wasteful, and it can be avoided.

> Dont have time to read your doc and dont get me wrong, there is a
> "quark" practical problem: As practical as the hard disk manufacturer
> who claims that they have 11G drive when it is 10G. It needs to be
> resolved - but not in an intrusive way in my opinion.

Not sure what a "quark" problem is .. but I think you're focusing
too much on the aspect of "somebody is lying, not our fault".
This is a real problem for any medium that adds link-layer headers.
ATM is not even very special, the only thing special about it is
that it has multiple "steps". But maybe I'm misunderstanding you,
it has happened before :)

A non intrusive way is prefered of course, but I can't really see
one if you want more than just a special-case solution that only
covers qdiscs using rate-tables and even ignores inner qdiscs.
HFSC and SFQ for example both need to calculate the wire length
at runtime.

Handling all qdiscs would mean adding a pointer to a mapping table
to struct net_device and using something like "skb_wire_len(skb, dev)"
instead of skb->len in the queueing layer. That of course doesn't
mean that we can't still provide pre-adjusted ratetables for qdiscs
that use them.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT] pcnet32 NAPI changes

2006-06-20 Thread Lennart Sorensen
On Tue, Jun 20, 2006 at 08:53:55AM -0500, Jon Mason wrote:
> The amount of polls per received packet is very low, thus removing the
> benefit of NAPI.  A compile time option would allow those users who know
> better to DTRT.

Well I know on the slow poke system I run on, with the napi polling, the
system can process packets, and get work done, and not fall over and die
from handling interrupts.  Without it, even 70Mbit of data on a single
port will flood the system with packet overruns to the point the
watchdog times out and the system reboots.  So I don't know if polling
is slightly more inefficient with little traffic, it is certainly a lot
more efficient and safer when there is suddenly a lot more traffic.
Maybe it should be a module option, so that you can pick what you want.
Heck it could be a per port option even. :)

> Yup, but the "everyone else is doing it" argument never worked with my
> parents. All it takes is one brave soul to determine the reasoning
> behind the magic numbers and convert them into #define's.  Shouldn't be
> more than one day's work.

Is this a magic number in your opinion?

lp->a.write_csr(ioaddr, 0, 0x0002);  /* Set STRT bit */

I guess one could do
#define CSR0_RST 0x0001
#define CSR0_STRT 0x0002
#define CSR0_STOP 0x0004
etc...

and then
lp->a.write_csr(ioaddr, 0, CSR0_STRT); /* Set STRT bit */

Does that help?  I am not sure.  I think the comment behind it is
plenty.

Len Sorensen
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PATCH] TIPC updates

2006-06-20 Thread Per Liden
Hi Dave,

Here are the latest TIPC updates.

Please pull from:

git://tipc.cslab.ericsson.net/pub/git/tipc.git

Thanks
/Per

 include/net/tipc/tipc_bearer.h |   12 ++
 net/tipc/bcast.c   |   79 ---
 net/tipc/bcast.h   |2 
 net/tipc/bearer.c  |   70 +++--
 net/tipc/cluster.c |   22 ++--
 net/tipc/config.c  |   85 +++-
 net/tipc/core.c|7 +
 net/tipc/core.h|   21 +++-
 net/tipc/discover.c|7 -
 net/tipc/eth_media.c   |9 +-
 net/tipc/link.c|  217 +++-
 net/tipc/name_distr.c  |   30 --
 net/tipc/name_table.c  |  203 -
 net/tipc/node.c|   78 --
 net/tipc/node.h|2 
 net/tipc/node_subscr.c |   15 +--
 net/tipc/port.c|   41 
 net/tipc/ref.c |   31 +-
 net/tipc/socket.c  |  100 +++---
 net/tipc/subscr.c  |   18 ++-
 net/tipc/zone.c|   19 ++--
 21 files changed, 647 insertions(+), 421 deletions(-)

Allan Stephens:
  [TIPC] Prevent name table corruption if no room for new publication
  [TIPC] Use correct upper bound when validating network zone number.
  [TIPC] Corrected potential misuse of tipc_media_addr structure.
  [TIPC] Allow ports to receive multicast messages through native API.
  [TIPC] Links now validate destination node specified by incoming messages.
  [TIPC] Multicast link failure now resets all links to "nacking" node.
  [TIPC] Allow compilation when CONFIG_TIPC_DEBUG is not set.
  [TIPC] Fixed privilege checking typo in dest_name_check().
  [TIPC] Fix misleading comment in buf_discard() routine.
  [TIPC] Added support for MODULE_VERSION capability.
  [TIPC] Validate entire interface name when locating bearer to enable.
  [TIPC] Non-operation-affecting corrections to comments & function 
definitions.
  [TIPC] Fixed connect() to detect a dest address that is missing or too 
short.
  [TIPC] Implied connect now saves dest name for retrieval as ancillary 
data.
  [TIPC] Can now return destination name of form {0,x,y} via ancillary data.
  [TIPC] Connected send now checks socket state when retrying congested 
send.
  [TIPC] Stream socket send indicates partial success if data partially 
sent.
  [TIPC] Improved performance of error checking during socket creation.
  [TIPC] recvmsg() now returns TIPC ancillary data using correct level 
(SOL_TIPC)
  [TIPC] Simplify code for returning partial success of stream send request.
  [TIPC] Optimized argument validation done by connect().
  [TIPC] Withdrawing all names from nameless port now returns success, not 
error
  [TIPC] Added missing warning for out-of-memory condition
  [TIPC] Fixed memory leak in tipc_link_send() when destination is 
unreachable
  [TIPC] Disallow config operations that aren't supported in certain modes.
  [TIPC] First phase of assert() cleanup
  [TIPC] Enhanced & cleaned up system messages; fixed 2 obscure memory 
leaks.
  [TIPC] Fixed link switchover bugs
  [TIPC] Get rid of dynamically allocated arrays in broadcast code.

Eric Sesterhenn:
  [TIPC] Fix for NULL pointer dereference

Per Liden:
  [TIPC] Fixed incorrect access permissions

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [DOC]: generic netlink

2006-06-20 Thread jamal
On Mon, 2006-19-06 at 18:37 -0400, Shailabh Nagar wrote:

> Completing the documentation on generic netlink usage will definitely be
> useful. I'd be happy to help out with this since I've recently gone through
> trying to understand and use genetlink for the taskstats interface. Hopefully
> this will help other users like me who aren't netlink experts to begin with !
> 

Thanks - I really appreciate it. 

> I've sent you a patch to the document that attempts to cover the following
> TODOS (didn't see any point sending it to the whole list since its harder to
> read patches to documentation). Pls use as you see fit.
> 

Ive received it and will respond to you privately.

> > TODO:
> > a) Add a more complete compiling kernel module with events.
> > Have Thomas put his Mashimaro example and point to it.
> (not the Mashimaro example, nor a completly compiled module but snippets
> of pseudo code taken from the user space program used in taskstats 
> development,
> modified to the foobar example you've used)

Thomas had a more complete piece of code which exercised more paths.
The document just has to point to where that code is.

> > b) Describe some details on how user space -> kernel works
> > probably using libnl??
> > c) Describe discovery using the controller..
> 
> I'll provide another patch that will cover d) and e) in the set below, again
> in the context of the foobar example, which might need to be modified a bit.
> 

no problem. go nuts.

> > d) talk about policies etc
> > e) talk about how something coming from user space eventually
> > gets to you.
> > f) Talk about the TLV manipulation stuff from Thomas.
> > g) submit controller patch to iproute2
> 
> One point...does d), f) etc. belong in a separate doc describing usage
> of netlink attributes ? Its useful here too but not directly related to
> genetlink perhaps.
> 

My thought was to provide a one-stop shop; however,
it may be a separate doc or incorporated in this and referenced by it.

> > PS:- I dont have a good place to put this doc and point to, hence the
> > 17K attachment
> >
> 
> http://www.kernel.org/pub/linux/kernel/people/hadi/ ?
> 
> (unless your permissions have been revoked for lack of use ! :-)
> 

I am only allowed to put kernel patches there by the powers that be. So
this wont fit the criteria. It is hard to believe in these
times my ISP charges me $1/M/month every time i exceed my allocated 5M
quota. I have been with this ISP for > 10 years, hence migration gets
harder - and given that many years on the same account, even my .bashrc
approaches 5M ;->

cheers,
jamal



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL

2006-06-20 Thread jamal
On Tue, 2006-20-06 at 02:54 +0200, Patrick McHardy wrote:
> jamal wrote:
> > - For further reflection: Have you considered the case where the rate
> > table has already been considered on some link speed in user space and
> > then somewhere post-config the physical link speed changes? This would
> > happen in the case where ethernet AN is involved and the partner makes
> > some changes (use ethtool). 
> > 
[..]
> I've thought about this a couple of times, scaling the virtual clock
> rate should be enough for "simple" qdiscs like TBF or HTB, which have
> a linear relation between time and bandwidth. I haven't really thought
> about the effects on HFSC yet, on a small scale the relation is
> non-linear. 

Does HFSC not depend on bandwith? How is rate control achieved?

> But this is a different problem from trying to accomodate
> for link-layer overhead.
> 

Yes it is different issue.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] FS_ENET: phydev pointer may be dereferenced without NULL check

2006-06-20 Thread Vitaly Bordug

When interface is down, phy is "disconnected" from the bus and phydev is NULL.
But ethtool may try to get/set phy regs even at that time, which results in
NULL pointer dereference and OOPS hereby.

Signed-off-by: Vitaly Bordug <[EMAIL PROTECTED]>
---

 drivers/net/fs_enet/fs_enet-main.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/net/fs_enet/fs_enet-main.c 
b/drivers/net/fs_enet/fs_enet-main.c
index 302ecaa..e475e22 100644
--- a/drivers/net/fs_enet/fs_enet-main.c
+++ b/drivers/net/fs_enet/fs_enet-main.c
@@ -882,12 +882,16 @@ static void fs_get_regs(struct net_devic
 static int fs_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
 {
struct fs_enet_private *fep = netdev_priv(dev);
+   if (!fep->phydev)
+   return -EINVAL;
return phy_ethtool_gset(fep->phydev, cmd);
 }
 
 static int fs_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
 {
struct fs_enet_private *fep = netdev_priv(dev);
+   if (!fep->phydev)
+   return -EINVAL;
phy_ethtool_sset(fep->phydev, cmd);
return 0;
 }

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [DOC]: generic netlink

2006-06-20 Thread jamal
On Tue, 2006-20-06 at 10:02 +0200, Thomas Graf wrote:
> * jamal <[EMAIL PROTECTED]> 2006-06-19 09:41

> One important point about attributes in generic netlink is that
> their scope is per command instead of per family as in netlink.
> It's not forbidden to use the same set of attribute identifiers
> for two separete commands but it should be avoided to have a
> single large list of attributes and have every command pick out
> the attributes it needs.
> 

Thanks - I will add this to the doc. Additionally the commands are 
scoped per registered family (as opposed of needing them to be 
encapsulated in the nlmsg_type).

> 
> > TODO:
> > a) Add a more complete compiling kernel module with events.
> > Have Thomas put his Mashimaro example and point to it.
> 
> I guess we have a legal issue here ;)
> 

change the name ;->

> > b) Describe some details on how user space -> kernel works
> > probably using libnl??
> 
> I'll take care of that.

Whats the plan? To add to this doc or separate doc?

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bugme-new] [Bug 6682] New: BUG: soft lockup detected on CPU#0! / ksoftirqd takse 100% CPU

2006-06-20 Thread Robert Olsson

Hello!

Yes seems the system is very loaded for some reason 

> > sometimes a day) we get 100% usage on ksoftirqd/0 and following messages 
   in logs:

as all softirq's are run via ksoftirqd. That's still OK but why don't the 
watchdog get any CPU share at all? Mismatch in priorities? 

Herbert Xu writes:
> On Mon, Jun 19, 2006 at 10:20:10PM +, Andrew Morton wrote:
 > >
 > > >  [] dev_queue_xmit+0xe0/0x203
 > > >  [] ip_output+0x1e1/0x237
 > > >  [] ip_forward+0x181/0x1df
 > > >  [] ip_rcv+0x40c/0x485
 > > >  [] netif_receive_skb+0x12f/0x165
 > > >  [] e1000_clean_rx_irq+0x389/0x410 [e1000]
 > > >  [] e1000_clean+0x94/0x12f [e1000]
 > > >  [] net_rx_action+0x69/0xf0
 > > >  [] __do_softirq+0x55/0xbd
 > > >  [] do_softirq+0x2d/0x31
 > > >  [] local_bh_enable+0x5a/0x65
 > > >  [] rt_run_flush+0x5f/0x80

Normal for a router...

 > Could you tell us the frequency of route updates on this machine?
 > Route updates are pretty expensive especially when a large number
 > of flows hits your machine right afterwards.

Yes flush is costly an unfortunly hard to avoid. We discussed this a
bit before...

 > You can monitor this by running ip mon.  You might just be getting
 > bogus route updates causing unnecessary flushes to the routing cache.

Just sampled 10 min in one of routers with full 2 * (Full BGP). Well
remember Zebra/Quagga has just one set in kernel. Anyway during the 
10 minutes I looked I got 4 (insertion/deletions)/second in average.

Cheers.
--ro

 
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Fix recommended permissions for /dev/net/tun

2006-06-20 Thread David Woodhouse
There's no reason to restrict unprivileged users from opening
the /dev/net/tun device node -- to do anything exciting requires
CAP_NET_ADMIN or a persistent device which is owned by the user in
question anyway. And if it _isn't_ openable by unprivileged users, then
giving ownership of devices to those users is a fairly pointless
exercise.

Signed-Off-By: David Woodhouse <[EMAIL PROTECTED]>

diff --git a/Documentation/networking/tuntap.txt 
b/Documentation/networking/tuntap.txt
index 76750fb..9d696f2 100644
--- a/Documentation/networking/tuntap.txt
+++ b/Documentation/networking/tuntap.txt
@@ -39,10 +39,13 @@ Copyright (C) 1999-2000 Maxim Krasnyansk
  mknod /dev/net/tun c 10 200
   
   Set permissions:
- e.g. chmod 0700 /dev/net/tun
- if you want the device only accessible by root. Giving regular users the
- right to assign network devices is NOT a good idea. Users could assign
- bogus network interfaces to trick firewalls or administrators.
+ e.g. chmod 0666 /dev/net/tun
+ There's no harm in allowing the device to be accessible by non-root users,
+ since CAP_NET_ADMIN is required for creating network devices or for 
+ connecting to network devices which aren't owned by the user in question.
+ If you want to create persistent devices and give ownership of them to 
+ unprivileged users, then you need the /dev/net/tun device to be usable by
+ those users.
 
   Driver module autoloading
 


-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] NET: Accurate packet scheduling for ATM/ADSL (userspace)

2006-06-20 Thread jamal
On Tue, 2006-20-06 at 16:45 +0200, Patrick McHardy wrote:
> jamal wrote:

[..]

> Actually in the PPPoE case Linux doesn't know about ethernet
> headers either, since shaping is usually done on the PPP device.
> But that doesn't really matter since the ethernet link is not
> the bottleneck - although it does add some delay for packetization.

good point. But one could argue that is within linux (local) as opposed
to something downstream at the ISP i.e. i have knowledge of it and i
could do clever things. The other is: I have to know that the ISP is
using pigeons as the link layer downstream and compensate for it.

The issue is really is whether Linux should be interested in the
throughput it is told about or the goodput (also known as effective
throughput) the service provider offers. Two different issues by
definition. 

> > Yes, Linux cant tell if your service provider is lying to you.
> 
> I wouldn't call it lying as long as they don't say "1.5mbps IP
> layer throughput". 

It is a scam for sure.
By definition of what throughput is - you are telling the truth; just
not the whole truth. Most users think in terms of goodput and not
throughput. 
i.e you are not telling the whole truth by not saying "it is 1.5Mbps ATM
throughput". Tpyically not an issue until somebody finds that by leaving
out "ATM" you meant throughput and not goodput. 

> Ethernet doesn't provide 100mbit IP layer
> throughput either, and with minimum sized IP packets its actually
> well below that.
> 

OTOH, nobody has ethernet MTUs of 64 bytes.

> >>The issue here is, that ATM does not have fixed overhead (due to alignment 
> >>and padding).  This means that a fixed reduction of the bandwidth is not 
> >>the solution.  We could reduce the bandwidth to the worst-case overhead, 
> >>which is 62%, I do not think that is a good solution...
> >>
> > 
> > I dont see it as wrong to be honest with you. Your mileage may vary.
> 
> Its wasteful, and it can be avoided.
> 

If it can be avoided by being generic and without being intrusive, then
by all means.

> > Dont have time to read your doc and dont get me wrong, there is a
> > "quark" practical problem: As practical as the hard disk manufacturer
> > who claims that they have 11G drive when it is 10G. It needs to be
> > resolved - but not in an intrusive way in my opinion.
> 
> Not sure what a "quark" problem is .. but I think you're focusing
> too much on the aspect of "somebody is lying, not our fault".

No no - that is not my intent; sorry if it comes out that way. 
I am saying there is a practical "problem". The problem being someone is
equating throughput to effective throughput (also know as goodput).

To be academic and pedantic: The schedulers should be focusing on
throughput and not goodput.
Look at it from another angle related to the nature of the link layer
used:
If i buy a 1.5 Mbps 802.11JHS (such a link layer technology doesnt
exist, but assume for the sake of arguement it does) from a wireless
service provider, ethernet headers etc - but in this case the link is so
bad (because of the link layer technology) i have to retransmit so much
that 0.5 Mbps is wasted on retransmits, the question becomes: 
1)Do i fix the scheduler to compensate for this link layer retransmit?
or
2)Do i find some other creative way to tell the scheduler that
without making any changes to it that my ftp (despite the retransmits)
should only chew 100Kbps.?

I am saying that #2 is the choice to go with hence my assertion earlier,
it should be fine to tell the scheduler all it has is 1Mbps and nobody
gets hurt. #1 if i could do it with minimal intrusion and still get to
use it when i have 802.11g. 

Not sure i made sense.

> This is a real problem for any medium that adds link-layer headers.
> ATM is not even very special, the only thing special about it is
> that it has multiple "steps". But maybe I'm misunderstanding you,
> it has happened before :)
> 

I am not sure if i am making more sense now ;->

> A non intrusive way is prefered of course, but I can't really see
> one if you want more than just a special-case solution that only
> covers qdiscs using rate-tables and even ignores inner qdiscs.
> HFSC and SFQ for example both need to calculate the wire length
> at runtime.
> 

Agreed. That would be equivalent to #1 above.

> Handling all qdiscs would mean adding a pointer to a mapping table
> to struct net_device and using something like "skb_wire_len(skb, dev)"
> instead of skb->len in the queueing layer. 

That does seem sensible and simpler. I would suspect then that you will
do this one time with something like
ip dev add compensate_header 100 bytes

> That of course doesn't
> mean that we can't still provide pre-adjusted ratetables for qdiscs
> that use them.
> 

But what would the point be then if you can compensate as you did above?

Anyways, I have to go and meet The Man and i feel like i have hijacked
netdev this morning. So ttl.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsu

[PATCH 1/3] PAL: Support of the fixed PHY

2006-06-20 Thread Vitaly Bordug

This makes it possible for HW PHY-less boards to utilize PAL goodies.
Generic routines to connect to fixed PHY are provided, as well as ability
to specify software callback that fills up link, speed, etc. information
into PHY descriptor (the latter feature not tested so far).

Signed-off-by: Vitaly Bordug <[EMAIL PROTECTED]>
---

 drivers/net/phy/Kconfig  |   17 ++
 drivers/net/phy/fixed.c  |  385 ++
 drivers/net/phy/phy_device.c |   51 +++---
 include/linux/phy.h  |1 
 4 files changed, 433 insertions(+), 21 deletions(-)

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index cda3e53..425be84 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -51,5 +51,22 @@ config SMSC_PHY
---help---
  Currently supports the LAN83C185 PHY
 
+config FIXED_PHY
+   tristate "Drivers for PHY emulation on fixed speed/link"
+   depends on PHYLIB
+   ---help---
+ Adds the driver to PHY layer to cover the boards that do not have any 
PHY bound,
+ but with the ability to manipulate with speed/link in software. The 
relavant MII
+ speed/duplex parameters could be effectively handled in 
user-specified  fuction.
+ Currently tested with mpc866ads.
+
+config FIXED_MII_10_FDX
+   bool "Emulation for 10M Fdx fixed PHY behavior"
+   depends on FIXED_PHY
+
+config FIXED_MII_100_FDX
+   bool "Emulation for 100M Fdx fixed PHY behavior"
+   depends on FIXED_PHY
+
 endmenu
 
diff --git a/drivers/net/phy/fixed.c b/drivers/net/phy/fixed.c
new file mode 100644
index 000..0360f65
--- /dev/null
+++ b/drivers/net/phy/fixed.c
@@ -0,0 +1,385 @@
+/*
+ * drivers/net/phy/fixed.c
+ *
+ * Driver for fixed PHYs, when transceiver is able to operate in one fixed 
mode.
+ *
+ * Author: Vitaly Bordug
+ *
+ * Copyright (c) 2006 MontaVista Software, Inc.
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ *
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#define MII_REGS_NUM   7
+
+/*
+The idea is to emulate normal phy behavior by responding with
+pre-defined values to mii BMCR read, so that read_status hook could
+take all the needed info.
+*/
+
+struct fixed_phy_status {
+   u8  link;
+   u16 speed;
+   u8  duplex;
+};
+
+/*-
+ *  Private information hoder for mii_bus
+ 
*-*/
+struct fixed_info {
+   u16 *regs;
+   u8 regs_num;
+   struct fixed_phy_status phy_status;
+   struct phy_device *phydev; /* pointer to the container */
+   /* link & speed cb */
+   int(*link_update)(struct net_device*, struct fixed_phy_status*);
+
+};
+
+/*
+This is made global to free all the allocations on _exit call.
+Looks a bit odd, seems the only way.
+*/
+static struct fixed_info *fixed_ptr;
+
+/*-
+ *  If something weird is required to be done with link/speed,
+ * network driver is able to assign a function to implement this.
+ * May be useful for PHY's that need to be software-driven.
+ 
*-*/
+int fixed_mdio_set_link_update(struct phy_device* phydev,
+   int(*link_update)(struct net_device*, struct fixed_phy_status*))
+{
+   struct fixed_info *fixed;
+
+   if(link_update == NULL)
+   return -EINVAL;
+
+   if(phydev) {
+   if(phydev->bus) {
+   fixed = phydev->bus->priv;
+   fixed->link_update = link_update;
+   return 0;
+   }
+   }
+   return -EINVAL;
+}
+EXPORT_SYMBOL(fixed_mdio_set_link_update);
+
+/*-
+ *  This is used for updating internal mii regs from the status
+ 
*-*/
+static int fixed_mdio_update_regs(struct fixed_info *fixed)
+{
+   u16 *regs = fixed->regs;
+   u16 bmsr = 0;
+   u16 bmcr = 0;
+
+   if(!regs) {
+   printk(KERN_ERR "%s: regs not set up", __FUNCTION__);
+   return -1;
+   }
+
+   if(fixed->phy_status.link)
+   bmsr |= BMSR_LSTATUS;
+
+   if(fixed->phy_status.duplex) {
+   bmcr |= BMCR_FULLDPLX;
+
+   switch ( fixed->phy_status.speed ) {
+ 

Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL

2006-06-20 Thread Patrick McHardy
jamal wrote:
> On Tue, 2006-20-06 at 02:54 +0200, Patrick McHardy wrote:
> 
>>jamal wrote:
>>
>>>- For further reflection: Have you considered the case where the rate
>>>table has already been considered on some link speed in user space and
>>>then somewhere post-config the physical link speed changes? This would
>>>happen in the case where ethernet AN is involved and the partner makes
>>>some changes (use ethtool). 
>>>
> 
> [..]
> 
>>I've thought about this a couple of times, scaling the virtual clock
>>rate should be enough for "simple" qdiscs like TBF or HTB, which have
>>a linear relation between time and bandwidth. I haven't really thought
>>about the effects on HFSC yet, on a small scale the relation is
>>non-linear. 
> 
> 
> Does HFSC not depend on bandwith? How is rate control achieved?

"Depend on bandwidth" is not the right term. All of TBF, HTB and HFSC
provide bandwidth per time, but with TBF and HTB the relation between
the amount of bandwidth is linear to the amount of time, with HFSC
it is only on a linear on larger scale since it uses service curves,
which are represented as two linear pieces. So you have bandwidth b1
for time t1, bandwidth b2 after that until eternity. By scaling the
clock rate you alter after how much time b2 kicks in, which affects
the guaranteed delays. The end result should be that both bandwidth
and delay scale up or down proportionally, but I'm not sure that this
is what HFSC would do in all cases (on small scale). But it should
be easy to answer with a bit more time for visualizing it.

The thing I'm not sure about is whether this wouldn't be handled better
by userspace, if the link layer speed changes you might not want
proportional scaling but prefer to still give a fixed amount of that
bandwidth to some class, for example VoIP traffic. Do we have netlink
notifications for link speed changes?

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL

2006-06-20 Thread jamal
On Tue, 2006-20-06 at 03:04 +0200, Patrick McHardy wrote:
> jamal wrote:
> > You are still speaking ATM (and the above may still be valid), but: 
> > Could you for example look at the netdevice->type and from that figure
> > out the link layer overhead and compensate for it.
> > Obviously a lot more useful if such activity is doable in user space
> > without any knowledge of the kernel? and therefore zero change to the
> > kernel and everything then becomes forward and backward compatible.
> 
> It would be nice to have support for HFSC as well, which unfortunately
> needs to be done in the kernel since it doesn't use rate tables.
> What about qdiscs like SFQ (which uses the packet size in quantum
> calculations)? I guess it would make sense to use the wire-length
> there as well.

Didnt even think of that ;-> 
Is it getting too complicated? 

BTW, I forgot to mention one thing on the bandwidth issue is we could do
is send netlink events on link speed changes too; some listener
somewhere would then do the adjustment.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT] pcnet32 NAPI changes

2006-06-20 Thread Jon Mason
On Tue, Jun 20, 2006 at 10:48:07AM -0400, Lennart Sorensen wrote:
> On Tue, Jun 20, 2006 at 08:53:55AM -0500, Jon Mason wrote:
> > The amount of polls per received packet is very low, thus removing the
> > benefit of NAPI.  A compile time option would allow those users who know
> > better to DTRT.
> 
> Well I know on the slow poke system I run on, with the napi polling, the
> system can process packets, and get work done, and not fall over and die
> from handling interrupts.  Without it, even 70Mbit of data on a single
> port will flood the system with packet overruns to the point the
> watchdog times out and the system reboots.  So I don't know if polling
> is slightly more inefficient with little traffic, it is certainly a lot
> more efficient and safer when there is suddenly a lot more traffic.
> Maybe it should be a module option, so that you can pick what you want.
> Heck it could be a per port option even. :)

The point of my comment was CPU utilization.

It appears that a bug is trying to be fixed by adding NAPI. This
sounds a bit hackish to me, and could hide the root cause of the
problem. So I'm not sure that is the best idea, but I will defer to
the maintainer.

> 
> > Yup, but the "everyone else is doing it" argument never worked with my
> > parents. All it takes is one brave soul to determine the reasoning
> > behind the magic numbers and convert them into #define's.  Shouldn't be
> > more than one day's work.
> 
> Is this a magic number in your opinion?
> 
> lp->a.write_csr(ioaddr, 0, 0x0002);  /* Set STRT bit */
> 
> I guess one could do
> #define CSR0_RST 0x0001
> #define CSR0_STRT 0x0002
> #define CSR0_STOP 0x0004
> etc...
> 
> and then
> lp->a.write_csr(ioaddr, 0, CSR0_STRT); /* Set STRT bit */
> 
> Does that help?  I am not sure.  I think the comment behind it is
> plenty.

But your example is just one instance.  Here is one without a comment:

lp->a.write_csr(ioaddr, 4, 0x0915);

What is it doing?  Is it still needed?  Can it be done anywhere else?  
Who knows, because it is magic.  The 4 can be defined as CSR0_STOP, per
your example above, but what does value 0x0915 do?

My point was that there are certain parts of the code which are
non-intuative and should be commented and there are others which a
good descrptive value would be nice.

> 
> Len Sorensen
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix recommended permissions for /dev/net/tun

2006-06-20 Thread David Woodhouse
On Tue, 2006-06-20 at 16:35 +0100, David Woodhouse wrote:
> There's no reason to restrict unprivileged users from opening
> the /dev/net/tun device node -- to do anything exciting requires
> CAP_NET_ADMIN or a persistent device which is owned by the user in
> question anyway. 

Hm, I lie. Let us alter reality to match my previous perception of it...

[PATCH] Require CAP_SYS_ADMIN to create tuntap devices.

The tuntap driver allows an admin to create persistent devices and
assign ownership of them to individual users. Unfortunately, relaxing
the permissions on the /dev/net/tun device node _also_ allows those
users to create arbitrary new devices of their own. This patch corrects
that, and adjusts the recommended permissions for the device node
accordingly.

Signed-Off-By: David Woodhouse <[EMAIL PROTECTED]>

diff --git a/Documentation/networking/tuntap.txt 
b/Documentation/networking/tuntap.txt
index 76750fb..839cbb7 100644
--- a/Documentation/networking/tuntap.txt
+++ b/Documentation/networking/tuntap.txt
@@ -39,10 +39,13 @@ Copyright (C) 1999-2000 Maxim Krasnyansk
  mknod /dev/net/tun c 10 200
   
   Set permissions:
- e.g. chmod 0700 /dev/net/tun
- if you want the device only accessible by root. Giving regular users the
- right to assign network devices is NOT a good idea. Users could assign
- bogus network interfaces to trick firewalls or administrators.
+ e.g. chmod 0666 /dev/net/tun
+ There's no harm in allowing the device to be accessible by non-root users,
+ since CAP_NET_ADMIN is required for creating network devices or for 
+ connecting to network devices which aren't owned by the user in question.
+ If you want to create persistent devices and give ownership of them to 
+ unprivileged users, then you need the /dev/net/tun device to be usable by
+ those users.
 
   Driver module autoloading
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a1ed2d9..6c62d5c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -490,6 +490,9 @@ static int tun_set_iff(struct file *file
 
err = -EINVAL;
 
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
/* Set dev type */
if (ifr->ifr_flags & IFF_TUN) {
/* TUN device */


-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0/5] GSO: Generic Segmentation Offload

2006-06-20 Thread Rick Jones

$ sudo ./ethtool -K lo gso on
$ sudo ifconfig lo mtu 1500
$ netperf -t TCP_STREAM
TCP STREAM TEST to localhost
Recv   SendSend
Socket Socket  Message  Elapsed
Size   SizeSize Time Throughput
bytes  bytes   bytessecs.10^6bits/sec

 87380  16384  1638410.003598.17


Would it really mess people up if netperf started doing CPU utilization 
measurements by default on those platforms where it did not require 
prior calibrarion?  I think that might make it more likely that when 
folks run tests, even over loopback (esp on MP), we'll get the service 
demand figures that help show the the change in stack efficiency.


rick jones

BTW, the style of the netperf test banner tells me you might want to 
upgrade to a newer version of netperf :)

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix recommended permissions for /dev/net/tun

2006-06-20 Thread Chase Venters

On Tue, 20 Jun 2006, David Woodhouse wrote:


On Tue, 2006-06-20 at 16:35 +0100, David Woodhouse wrote:

There's no reason to restrict unprivileged users from opening
the /dev/net/tun device node -- to do anything exciting requires
CAP_NET_ADMIN or a persistent device which is owned by the user in
question anyway.


Hm, I lie. Let us alter reality to match my previous perception of it...


Perhaps you lie again :)

Are you sure you're adding a capable(CAP_SYS_ADMIN)? :P


[PATCH] Require CAP_SYS_ADMIN to create tuntap devices.

The tuntap driver allows an admin to create persistent devices and
assign ownership of them to individual users. Unfortunately, relaxing
the permissions on the /dev/net/tun device node _also_ allows those
users to create arbitrary new devices of their own. This patch corrects
that, and adjusts the recommended permissions for the device node
accordingly.

Signed-Off-By: David Woodhouse <[EMAIL PROTECTED]>

diff --git a/Documentation/networking/tuntap.txt 
b/Documentation/networking/tuntap.txt
index 76750fb..839cbb7 100644
--- a/Documentation/networking/tuntap.txt
+++ b/Documentation/networking/tuntap.txt
@@ -39,10 +39,13 @@ Copyright (C) 1999-2000 Maxim Krasnyansk
 mknod /dev/net/tun c 10 200

  Set permissions:
- e.g. chmod 0700 /dev/net/tun
- if you want the device only accessible by root. Giving regular users the
- right to assign network devices is NOT a good idea. Users could assign
- bogus network interfaces to trick firewalls or administrators.
+ e.g. chmod 0666 /dev/net/tun
+ There's no harm in allowing the device to be accessible by non-root users,
+ since CAP_NET_ADMIN is required for creating network devices or for
+ connecting to network devices which aren't owned by the user in question.
+ If you want to create persistent devices and give ownership of them to
+ unprivileged users, then you need the /dev/net/tun device to be usable by
+ those users.

  Driver module autoloading

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a1ed2d9..6c62d5c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -490,6 +490,9 @@ static int tun_set_iff(struct file *file

err = -EINVAL;

+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
/* Set dev type */
if (ifr->ifr_flags & IFF_TUN) {
/* TUN device */





Thanks,
Chase
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] NET: Accurate packet scheduling for ATM/ADSL (userspace)

2006-06-20 Thread Patrick McHardy
jamal wrote:
> On Tue, 2006-20-06 at 16:45 +0200, Patrick McHardy wrote:
> 
>>Actually in the PPPoE case Linux doesn't know about ethernet
>>headers either, since shaping is usually done on the PPP device.
>>But that doesn't really matter since the ethernet link is not
>>the bottleneck - although it does add some delay for packetization.
> 
> 
> good point. But one could argue that is within linux (local) as opposed
> to something downstream at the ISP i.e. i have knowledge of it and i
> could do clever things. The other is: I have to know that the ISP is
> using pigeons as the link layer downstream and compensate for it.
> 
> The issue is really is whether Linux should be interested in the
> throughput it is told about or the goodput (also known as effective
> throughput) the service provider offers. Two different issues by
> definition. 


In the case of PPPoE non-work-conserving qdiscs are already used
to manage a link that is non-local with knowledge of the its
bandwidth, contrary to a local link that would be best managed
in work-conserving mode. And I think for better accuracy it is
necessary to manage effective throughput, especially if you're
interested in guaranteed delays.

>>>Yes, Linux cant tell if your service provider is lying to you.
>>
>>I wouldn't call it lying as long as they don't say "1.5mbps IP
>>layer throughput". 
> 
> 
> It is a scam for sure.
> By definition of what throughput is - you are telling the truth; just
> not the whole truth. Most users think in terms of goodput and not
> throughput. 
> i.e you are not telling the whole truth by not saying "it is 1.5Mbps ATM
> throughput". Tpyically not an issue until somebody finds that by leaving
> out "ATM" you meant throughput and not goodput. 


I think that point can be used to argue in favour of that Linux should
be able to manage effective throughput :)

>>Ethernet doesn't provide 100mbit IP layer
>>throughput either, and with minimum sized IP packets its actually
>>well below that.
>
> 
> OTOH, nobody has ethernet MTUs of 64 bytes.


Sure, but I might now want my HFSC class with guaranteed delay of 140us
to be distrurbed by someone sending small packets, that need more time
on the wire than HFSC thinks.

> To be academic and pedantic: The schedulers should be focusing on
> throughput and not goodput.
> Look at it from another angle related to the nature of the link layer
> used:
> If i buy a 1.5 Mbps 802.11JHS (such a link layer technology doesnt
> exist, but assume for the sake of arguement it does) from a wireless
> service provider, ethernet headers etc - but in this case the link is so
> bad (because of the link layer technology) i have to retransmit so much
> that 0.5 Mbps is wasted on retransmits, the question becomes: 
> 1)Do i fix the scheduler to compensate for this link layer retransmit?
> or
> 2)Do i find some other creative way to tell the scheduler that
> without making any changes to it that my ftp (despite the retransmits)
> should only chew 100Kbps.?
> 
> I am saying that #2 is the choice to go with hence my assertion earlier,
> it should be fine to tell the scheduler all it has is 1Mbps and nobody
> gets hurt. #1 if i could do it with minimal intrusion and still get to
> use it when i have 802.11g. 
> 
> Not sure i made sense.

HFSC is actually capable of handling this quite well. If you use it
in work-conserving mode (and the card doesn't do (much) internal
queueing) it will get clocked by successful transmissions. Using
link-sharing classes you can define proportions for use of available
bandwidth, possibly with upper limits. No hacks required :)

Anyway, this again goes more in the direction of handling link speed
changes.

>>A non intrusive way is prefered of course, but I can't really see
>>one if you want more than just a special-case solution that only
>>covers qdiscs using rate-tables and even ignores inner qdiscs.
>>HFSC and SFQ for example both need to calculate the wire length
>>at runtime.
>>
> 
> Agreed. That would be equivalent to #1 above.
> 
> 
>>Handling all qdiscs would mean adding a pointer to a mapping table
>>to struct net_device and using something like "skb_wire_len(skb, dev)"
>>instead of skb->len in the queueing layer. 
> 
> 
> That does seem sensible and simpler. I would suspect then that you will
> do this one time with something like
> ip dev add compensate_header 100 bytes

Something like that, but its a bit more complicated.
For ATM we need some mapping:
[0-48]  -> 53
[49-96] -> 106
...

for Ethernet we need:
[0-60] -> 64
[60-n] -> n + 4

We could do something like this (feel free to imagine nicer names):

ATM:
table = {
.step = 53,
.map = {
[0..48] = 53,
[49..96] = 106,
...
}
};

Requiring a table of size 32 for typical MTUs.

Ethernet:

table = {
.step = 60,
.map = {
[0..60] = 60,
[...] = 0,
},
.fixed_overhead = 4,
};

static inline unsigned int
sk

Re: [PATCH] Fix recommended permissions for /dev/net/tun

2006-06-20 Thread David Woodhouse
On Tue, 2006-06-20 at 11:46 -0500, Chase Venters wrote:
> Perhaps you lie again :)
> 
> Are you sure you're adding a capable(CAP_SYS_ADMIN)? :P 

I'm going to go home now. G'night.

-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL

2006-06-20 Thread Patrick McHardy
jamal wrote:
> On Tue, 2006-20-06 at 03:04 +0200, Patrick McHardy wrote:
> 
>>It would be nice to have support for HFSC as well, which unfortunately
>>needs to be done in the kernel since it doesn't use rate tables.
>>What about qdiscs like SFQ (which uses the packet size in quantum
>>calculations)? I guess it would make sense to use the wire-length
>>there as well.
> 
> 
> Didnt even think of that ;-> 
> Is it getting too complicated? 

The code wouldn't be very complicated, it just adds some overhead. If
you do something like I described in my previous mail the overhead for
people not using it would be an additional pointer test before reading
skb->len. I guess we could also make it a compile time option.
I personally think this is something that really improves our quality
of implementation, after all, its "wire" resources qdiscs are meant
to manage.

> BTW, I forgot to mention one thing on the bandwidth issue is we could do
> is send netlink events on link speed changes too; some listener
> somewhere would then do the adjustment.

See the mail I just wrote :)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2/5] [NET]: Add generic segmentation offload

2006-06-20 Thread Michael Chan
On Tue, 2006-06-20 at 19:28 +1000, Herbert Xu wrote:

> [NET]: Add generic segmentation offload
> 
> +static int dev_gso_segment(struct sk_buff *skb)
> +{
> +   struct sk_buff *segs;
> +
> +   segs = skb_gso_segment(skb, skb->dev->features & NETIF_F_SG &&
> +   !illegal_highdma(dev, skb));

I think you need !illegal_highdma(skb->dev, skb)

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT] pcnet32 NAPI changes

2006-06-20 Thread Lennart Sorensen
On Tue, Jun 20, 2006 at 11:05:04AM -0500, Jon Mason wrote:
> The point of my comment was CPU utilization.
> 
> It appears that a bug is trying to be fixed by adding NAPI. This
> sounds a bit hackish to me, and could hide the root cause of the
> problem. So I'm not sure that is the best idea, but I will defer to
> the maintainer.

No it isn't a bug.  If the hardware generates enough interrupts to keep
the cpu at 100% handling them, starving user space (since interrupts
have high priority compared to just running user code of course), then
the watchdog daemon which of course runs in user space will never run
and hence the watchdog hardware times out and resets the system, as it
is designed to do.  There is no bug, just a problem of too many
interrupts generated by the network hardware.  NAPI elliminates the
receive interrupts when the system is busy, solving the problem at it's
root cause.

> But your example is just one instance.  Here is one without a comment:
> 
> lp->a.write_csr(ioaddr, 4, 0x0915);

Hmm.  0x0915 =  1001 0001 0101 =>
*Auto Pad Transmit (bit 11).  Enabled auto padding of packets.
*Missed Frame Counter Overflow Mask (bit 8):  Masks out interrupts on
 overflow of the missed frame counter.
*Receive Collision Counter Overflow Mask (bit 4):  Masks out interrupts on
 overflow of the receive collision counter.
*Transmit Start Mask (bit 2):  Masks out interrupts on start of
 transmit.

So every CSR has a different meaning for all its bits.  Defining each
one, and combining all of them could make a lot of the code really
messy.  Perhaps more comments on those places would be clearer.

> What is it doing?  Is it still needed?  Can it be done anywhere else?  
> Who knows, because it is magic.  The 4 can be defined as CSR0_STOP, per
> your example above, but what does value 0x0915 do?

No the 4 has a different meaning in CSR4.  It means stop in CSR0.  in
CSR4 it means Transmit Start Mask.  It masks interrupts on transmit
start.  I think the value is wrong, since my data sheet says bit 0 and 1
are reserved and should be written as 0.  0x0915 would write bit 0 as a
1 which violates the data sheet of the 972 at least.

> My point was that there are certain parts of the code which are
> non-intuative and should be commented and there are others which a
> good descrptive value would be nice.

Well I agree the code could get a bit better.  I did think overall that
the code was rather nice actually.

Len Sorensen
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/06] MLSXFRM: Add security context to acquire messages using PF_KEY

2006-06-20 Thread Venkat Yekkirala

This includes the security context of a security association created for use by 
IKE
in the acquire messages sent to IKE daemons using PF_KEY. This would allow
the daemons to include the security context in the negotiation, so that the 
resultant
association is unique to that security context.

Signed-off-by: Venkat Yekkirala <[EMAIL PROTECTED]>
---

net/key/af_key.c |   22 ++
1 file changed, 22 insertions(+)

--- linux-2.6.16.vanilla/net/key/af_key.c   2006-06-12 17:49:42.0 
-0500
+++ linux-2.6.16/net/key/af_key.c   2006-06-19 19:48:24.0 -0500
@@ -2709,6 +2709,9 @@ static int pfkey_send_acquire(struct xfr
#endif
int sockaddr_size;
int size;
+   struct sadb_x_sec_ctx *sec_ctx;
+   struct xfrm_sec_ctx *xfrm_ctx;
+   int ctx_size = 0;

sockaddr_size = pfkey_sockaddr_size(x->props.family);
if (!sockaddr_size)
@@ -2724,6 +2727,11 @@ static int pfkey_send_acquire(struct xfr
else if (x->id.proto == IPPROTO_ESP)
size += count_esp_combs(t);

+   if ((xfrm_ctx = x->security)) {
+   ctx_size = PFKEY_ALIGN8(xfrm_ctx->ctx_len);
+   size +=  sizeof(struct sadb_x_sec_ctx) + ctx_size;
+   }
+
skb =  alloc_skb(size + 16, GFP_ATOMIC);
if (skb == NULL)
return -ENOMEM;
@@ -2819,6 +2827,20 @@ static int pfkey_send_acquire(struct xfr
else if (x->id.proto == IPPROTO_ESP)
dump_esp_combs(skb, t);

+   /* security context */
+   if (xfrm_ctx) {
+   sec_ctx = (struct sadb_x_sec_ctx *) skb_put(skb,
+   sizeof(struct sadb_x_sec_ctx) + ctx_size);
+   sec_ctx->sadb_x_sec_len =
+ (sizeof(struct sadb_x_sec_ctx) + ctx_size) / sizeof(uint64_t);
+   sec_ctx->sadb_x_sec_exttype = SADB_X_EXT_SEC_CTX;
+   sec_ctx->sadb_x_ctx_doi = xfrm_ctx->ctx_doi;
+   sec_ctx->sadb_x_ctx_alg = xfrm_ctx->ctx_alg;
+   sec_ctx->sadb_x_ctx_len = xfrm_ctx->ctx_len;
+   memcpy(sec_ctx + 1, xfrm_ctx->ctx_str,
+  xfrm_ctx->ctx_len);
+   }
+
return pfkey_broadcast(skb, GFP_ATOMIC, BROADCAST_REGISTERED, NULL);
}

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


FOR REFERENCE ONLY: MLSXFRM: Add support to serefpolicy

2006-06-20 Thread Venkat Yekkirala

This patch has been included here just for reference. It will be submitted
to the serefpolicy list later.

This patch adds a polmatch avperm to arbitrate flow/state's access to
a xfrm policy. It also defines MLS policy for association { sendto,
recvfrom, polmatch }.

NOTE: When an inbound packet is not using an IPSec SA, a check is performed
between the socket label and the unlabeled sid (SYSTEM_HIGH MLS label). For
MLS purposes however, the target of the check should be the MLS label taken
from the node sid (or secmark in the new secmark world). This would present
a severe performance overhead (to make a new sid based on the unlabeled sid
with the MLS taken from the node sid or secmark and then using this sid as
the target). While discussions are ongoing on fine tuning the networking
design in the context of secmark, IPSec, netlabel, etc., I have chosen to
currently make an exception for unlabeled_t SAs if TE policy allowed it. A
similar problem exists for the outbound case and it has been similarly
handled in the policy below (by making an exception for unlabeled_t).



--- serefpolicy-2.2.34/policy/mls   2006-04-20 07:18:44.0 -0500
+++ serefpolicy-2.2.34.ipsec/policy/mls 2006-05-11 10:04:29.0 -0500
@@ -671,4 +671,18 @@
# these access vectors have no MLS restrictions
# association *

+mlsconstrain association { recvfrom }
+((( l1 dom l2 ) and ( l1 domby h2 )) or
+ (( t1 == mlsnetreadtoclr ) and ( h1 dom l2 )) or
+ ( t1 == mlsnetread ) or
+ ( t2 == unlabeled_t ));
+
+mlsconstrain association { sendto }
+((( l1 dom l2 ) and ( l1 domby h2 )) or
+ ( t2 == unlabeled_t ));
+
+mlsconstrain association { polmatch }
+ (( l1 dom l2 ) and ( h1 domby h2 ));
+
+
') dnl end enable_mls
--- serefpolicy-2.2.34/policy/flask/access_vectors  2006-04-20 
07:18:44.0 -0500
+++ serefpolicy-2.2.34.ipsec/policy/flask/access_vectors2006-04-27 
10:34:44.0 -0500
@@ -602,6 +602,7 @@
   sendto
   recvfrom
   setcontext
+   polmatch
}

# Updated Netlink class for KOBJECT_UEVENT family.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/06] MLSXFRM: Add security context to acquire messages using netlink

2006-06-20 Thread Venkat Yekkirala

From: Serge Hallyn <[EMAIL PROTECTED]>

This includes the security context of a security association created for use by 
IKE
in the acquire messages sent to IKE daemons using netlink/xfrm_user. This would 
allow
the daemons to include the security context in the negotiation, so that the 
resultant
association is unique to that security context.

Signed-off-by:  <[EMAIL PROTECTED]>
---

net/xfrm/xfrm_user.c |   45 ++---
1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index bac2dab..9c17261 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -909,27 +909,40 @@ rtattr_failure:
return -1;
}

-static int copy_to_user_sec_ctx(struct xfrm_policy *xp, struct sk_buff *skb)
+static int copy_sec_ctx(struct xfrm_sec_ctx *s, struct sk_buff *skb)
{
-   if (xp->security) {
-   int ctx_size = sizeof(struct xfrm_sec_ctx) +
-   xp->security->ctx_len;
-   struct rtattr *rt = __RTA_PUT(skb, XFRMA_SEC_CTX, ctx_size);
-   struct xfrm_user_sec_ctx *uctx = RTA_DATA(rt);
-
-   uctx->exttype = XFRMA_SEC_CTX;
-   uctx->len = ctx_size;
-   uctx->ctx_doi = xp->security->ctx_doi;
-   uctx->ctx_alg = xp->security->ctx_alg;
-   uctx->ctx_len = xp->security->ctx_len;
-   memcpy(uctx + 1, xp->security->ctx_str, xp->security->ctx_len);
-   }
-   return 0;
+   int ctx_size = sizeof(struct xfrm_sec_ctx) + s->ctx_len;
+   struct rtattr *rt = __RTA_PUT(skb, XFRMA_SEC_CTX, ctx_size);
+   struct xfrm_user_sec_ctx *uctx = RTA_DATA(rt);
+
+   uctx->exttype = XFRMA_SEC_CTX;
+   uctx->len = ctx_size;
+   uctx->ctx_doi = s->ctx_doi;
+   uctx->ctx_alg = s->ctx_alg;
+   uctx->ctx_len = s->ctx_len;
+   memcpy(uctx + 1, s->ctx_str, s->ctx_len);
+   return 0;

 rtattr_failure:
return -1;
}

+static inline int copy_to_user_state_sec_ctx(struct xfrm_state *x, struct 
sk_buff *skb)
+{
+   if (x->security) {
+   return copy_sec_ctx(x->security, skb);
+   }
+   return 0;
+}
+
+static inline int copy_to_user_sec_ctx(struct xfrm_policy *xp, struct sk_buff 
*skb)
+{
+   if (xp->security) {
+   return copy_sec_ctx(xp->security, skb);
+   }
+   return 0;
+}
+
static int dump_one_policy(struct xfrm_policy *xp, int dir, int count, void 
*ptr)
{
struct xfrm_dump_info *sp = ptr;
@@ -1708,7 +1721,7 @@ static int build_acquire(struct sk_buff 


if (copy_to_user_tmpl(xp, skb) < 0)
goto nlmsg_failure;
-   if (copy_to_user_sec_ctx(xp, skb))
+   if (copy_to_user_state_sec_ctx(x, skb))
goto nlmsg_failure;

nlh->nlmsg_len = skb->tail - b;
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/06] MLSXFRM: Granular IPSec associations for use in MLS environments

2006-06-20 Thread Venkat Yekkirala

The current approach to labeling Security Associations for SELinux purposes
uses a one-to-one mapping between xfrm policy rules and security associations.
This doesn’t address the needs of real world MLS (Multi-level System, 
traditional
Bell-LaPadula) environments where a single xfrm policy rule (pertaining to a 
range,
classified to secret for example) might need to map to multiple Security 
Associations
(one each for classified, secret, top secret and all the compartments 
applicable to
these security levels).

This patch set addresses the above problem by allowing for the mapping of a 
single
xfrm policy rule to multiple security associations, with each association used 
in
the security context it is defined for. It also includes the security context 
to be
used in IKE negotiation in the acquire messages sent to the IKE daemon so that 
a unique
SA can be negotiated for each unique security context. A couple of bug fixes 
are also
included; checks to make sure the SAs used by a packet match policy (security 
context-wise)
on the inbound and also that the bundle used for the outbound matches the 
security context
of the flow. This patch set also makes the use of the SELinux sid in flow cache 
lookups
seemless by including the sid in the flow key itself.

Description of changes:

A "sid" member has been added to the flow cache key resulting in the sid being 
available
at all needed locations and the flow cache lookups automatically using the sid. 
The flow
sid is derived from the socket on the outbound and the SAs (unlabeled where an 
SA was not
used) on the inbound.

Outbound case:
1. Find policy for the socket.

2. OLD: Find an SA that matches the policy.
  NEW: Find an SA that matches BOTH the policy and the flow/socket.
This is necessary since not every SA that matches the policy
can be used for the flow/socket. Consider policy range Secret-TS,
and SAs each for Secret and TS. We don't want a TS socket to
use the Secret SA. Hence the additional check for the SA Vs. flow/socket.

3. NEW: When looking thru bundles for a policy, make sure the flow/socket can 
use the
  bundle. If a bundle is not found, create one, calling for IKE if necessary. 
If using IKE,
  include the security context in the acquire message to the IKE daemon.

Inbound case:
1. OLD: Find policy for the socket.
  NEW: Find policy for the incoming packet based on the sid of the SA(s) it 
used or the
  unlabeled sid if no SAs were used. (Consider a case where a socket is 
"authorized" for
  two policies (unclassified-confidential, secret-top_secret). If the packet 
has come in
  using a secret SA, we really ought to be using the latter policy 
(secret-top_secret).)

2. OLD: BUG: No check to see if the SAs used by the packet agree with the 
policy sec_ctx-wise.
  (It was indicated in selinux_xfrm_sock_rcv_skb() that this was being 
accomplished by
   (x->id.spi == tmpl->id.spi || !tmpl->id.spi) in xfrm_state_ok, but it turns out 
tmpl->id.spi
   would normally be zero (unless xfrm policy rules specify one at the template 
level, which
   they usually don't).
  NEW: The socket is checked for access to the SAs used (based on the sid of 
the SAs) in
  selinux_xfrm_sock_rcv_skb().

Forward case:
  This would be Step 1 from the Inbound case, followed by Steps 2 and 3 from 
the Outbound case.

Outstanding items/issues:
- Timewait acknowledgements and such are generated in the current/upstream 
implementation using
a NULL socket resulting in the any_socket sid (SYSTEM_HIGH) to be used. This 
problem is not
addressed by this patch set.

This patch: Add new flask definitions to SELinux

Adds a new avperm "polmatch" to arbitrate flow/state access to a xfrm policy 
rule.

Signed-off-by: Venkat Yekkirala <[EMAIL PROTECTED]>
---

The patch set is relative to 2.6.17-rc6-mm2. A policy patch is also included 
for reference.
A patch to ipsec-tools/racoon will follow later on the ipsectools-devel list.
ipsec-tools 0.6.5 src in FC rawhide already has the setkey changes needed to 
work with this.

FUNCTIONAL DESCRIPTION:

The basic idea is to have the IPSec policy specify an MLS range and have unique 
SAs
generated/used for each of the levels that fall in the range. SAs for different 
levels
can either be manually loaded (using setkey and such) or negotiated using IKE 
(racoon, etc.).

Example:

Let's say we have the following in the SPD (Security Policy Database):

spdadd 9.2.9.15 9.2.9.17 any -ctx 1 1 "system_u:object_r:zzyzx_t:s0-s9:c0-c127"
-P in ipsec esp/transport//require ;
spdadd 9.2.9.17 9.2.9.15 any -ctx 1 1 "system_u:object_r:zzyzx_t:s0-s9:c0-c127"
-P out ipsec esp/transport//require ;

with nothing in the SAD (Security Association Database) initially. When the 
kernel
runs into the first packet with the label s2:c4 destined for 9.2.9.17, it will 
see
that there's no SA available to encrypt it with. So, it will call upon 
racoon/IKE
to generate an SA. Racoon will obtain the label (s2:c4) from the kernel, do the
negotiation with its

[PATCH 02/06] MLSXFRM: Define new SELinux service routine

2006-06-20 Thread Venkat Yekkirala

This defines a routine that combines the Type Enforcement portion of one sid
with the MLS portion from the other sid to arrive at a new sid. This is 
currently
used to define a sid for a security association that is to be negotiated by IKE.

Signed-off-by: Venkat Yekkirala <[EMAIL PROTECTED]>
---

security/selinux/include/security.h |2 +
security/selinux/ss/mls.c   |   20 --
security/selinux/ss/mls.h   |   20 ++
security/selinux/ss/services.c  |   48 ++
4 files changed, 70 insertions(+), 20 deletions(-)


--- linux-2.6.16.vanilla/security/selinux/ss/mls.c  2006-06-12 
17:38:25.0 -0500
+++ linux-2.6.16/security/selinux/ss/mls.c  2006-06-19 19:48:24.0 
-0500
@@ -212,26 +212,6 @@ int mls_context_isvalid(struct policydb 
}


/*
- * Copies the MLS range from `src' into `dst'.
- */
-static inline int mls_copy_context(struct context *dst,
-  struct context *src)
-{
-   int l, rc = 0;
-
-   /* Copy the MLS range from the source context */
-   for (l = 0; l < 2; l++) {
-   dst->range.level[l].sens = src->range.level[l].sens;
-   rc = ebitmap_cpy(&dst->range.level[l].cat,
-&src->range.level[l].cat);
-   if (rc)
-   break;
-   }
-
-   return rc;
-}
-
-/*
 * Set the MLS fields in the security context structure
 * `context' based on the string representation in
 * the string `*scontext'.  Update `*scontext' to
--- linux-2.6.16.vanilla/security/selinux/ss/mls.h  2006-06-12 
17:38:25.0 -0500
+++ linux-2.6.16/security/selinux/ss/mls.h  2006-06-19 19:48:24.0 
-0500
@@ -17,6 +17,26 @@
#include "context.h"
#include "policydb.h"

+/*
+ * Copies the MLS range from `src' into `dst'.
+ */
+static inline int mls_copy_context(struct context *dst,
+  struct context *src)
+{
+   int l, rc = 0;
+
+   /* Copy the MLS range from the source context */
+   for (l = 0; l < 2; l++) {
+   dst->range.level[l].sens = src->range.level[l].sens;
+   rc = ebitmap_cpy(&dst->range.level[l].cat,
+&src->range.level[l].cat);
+   if (rc)
+   break;
+   }
+
+   return rc;
+}
+
int mls_compute_context_len(struct context *context);
void mls_sid_to_context(struct context *context, char **scontext);
int mls_context_isvalid(struct policydb *p, struct context *c);
--- linux-2.6.16.vanilla/security/selinux/ss/services.c 2006-06-12 
17:49:44.0 -0500
+++ linux-2.6.16/security/selinux/ss/services.c 2006-06-19 19:48:24.0 
-0500
@@ -1817,6 +1817,54 @@ out:
return rc;
}

+/*
+ * security_sid_mls_copy() - computes a new sid based on the given
+ * sid and the mls portion of mls_sid.
+ */
+int security_sid_mls_copy(u32 sid, u32 mls_sid, u32 *new_sid)
+{
+   struct context *context1 = NULL;
+   struct context *context2 = NULL;
+   struct context newcon;
+   int rc = 0;
+
+   if (!ss_initialized) {
+   *new_sid = sid;
+   goto out;
+   }
+
+   POLICY_RDLOCK;
+   context1 = sidtab_search(&sidtab, sid);
+   if (!context1) {
+   printk(KERN_ERR "security_sid_mls_copy:  unrecognized SID "
+  "%d\n", sid);
+   rc = -EINVAL;
+   goto out_unlock;
+   }
+
+   context2 = sidtab_search(&sidtab, mls_sid);
+   if (!context2) {
+   printk(KERN_ERR "security_sid_mls_copy:  unrecognized SID "
+  "%d\n", mls_sid);
+   rc = -EINVAL;
+   goto out_unlock;
+   }
+
+   newcon.user = context1->user;
+   newcon.role = context1->role;
+   newcon.type = context1->type;
+   rc = mls_copy_context(&newcon, context2);
+   if (rc)
+   goto out_unlock;
+
+   rc = sidtab_context_to_sid(&sidtab, &newcon, new_sid);
+
+out_unlock:
+   POLICY_RDUNLOCK;
+out:
+   return rc;
+}
+
struct selinux_audit_rule {
u32 au_seqno;
struct context au_ctxt;
--- linux-2.6.16.vanilla/security/selinux/include/security.h2006-06-12 
17:38:25.0 -0500
+++ linux-2.6.16/security/selinux/include/security.h2006-06-19 
19:48:24.0 -0500
@@ -78,6 +78,8 @@ int security_node_sid(u16 domain, void *
int security_validate_transition(u32 oldsid, u32 newsid, u32 tasksid,
 u16 tclass);

+int security_sid_mls_copy(u32 sid, u32 mls_sid, u32 *new_sid);
+
#define SECURITY_FS_USE_XATTR   1 /* use xattr */
#define SECURITY_FS_USE_TRANS   2 /* use transition SIDs, e.g. 
devpts/tmpfs */
#define SECURITY_FS_USE_TASK3 /* use task SIDs, e.g. pipefs/sockfs 
*/
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/m

[PATCH 03/06] MLSXFRM: Add security sid to sock

2006-06-20 Thread Venkat Yekkirala

This adds security for IP sockets at the sock level. Security at the
sock level is needed to enforce the SELinux security policy for security
associations even when a sock is orphaned (such as in the TCP LAST_ACK state).

Signed-off-by: Venkat Yekkirala <[EMAIL PROTECTED]>
---

include/linux/security.h  |   12 
include/net/sock.h|9 +
net/core/sock.c   |2 +-
security/dummy.c  |5 +
security/selinux/hooks.c  |   26 --
security/selinux/include/objsec.h |1 +
6 files changed, 48 insertions(+), 7 deletions(-)

--- linux-2.6.16.vanilla/include/linux/security.h   2006-06-12 
17:49:31.0 -0500
+++ linux-2.6.16/include/linux/security.h   2006-06-19 19:48:24.0 
-0500
@@ -799,6 +800,8 @@ struct swap_info_struct;
 *  which is used to copy security attributes between local stream sockets.
 * @sk_free_security:
 *  Deallocate security structure.
+ * @sk_clone_security:
+ * Clone/copy security structure.
 * @sk_getsid:
 *  Retrieve the LSM-specific sid for the sock to enable caching of network
 *  authorizations.
@@ -1303,6 +1323,7 @@ struct security_operations {
int (*socket_getpeersec_dgram) (struct sk_buff *skb, char **secdata, 
u32 *seclen);
int (*sk_alloc_security) (struct sock *sk, int family, gfp_t priority);
void (*sk_free_security) (struct sock *sk);
+   void (*sk_clone_security) (const struct sock *sk, struct sock *newsk);
unsigned int (*sk_getsid) (struct sock *sk, struct flowi *fl, u8 dir);
#endif  /* CONFIG_SECURITY_NETWORK */

@@ -2809,6 +2836,11 @@ static inline void security_sk_free(stru
return security_ops->sk_free_security(sk);
}

+static inline void security_sk_clone(const struct sock *sk, struct sock *newsk)
+{
+   return security_ops->sk_clone_security(sk, newsk);
+}
+
static inline unsigned int security_sk_sid(struct sock *sk, struct flowi *fl, 
u8 dir)
{
return security_ops->sk_getsid(sk, fl, dir);
@@ -2936,6 +2968,10 @@ static inline void security_sk_free(stru
{
}

+static inline void security_sk_clone(const struct sock *sk, struct sock *newsk)
+{
+}
+
static inline unsigned int security_sk_sid(struct sock *sk, struct flowi *fl, 
u8 dir)
{
return 0;
--- linux-2.6.16.vanilla/include/net/sock.h 2006-06-19 17:02:23.0 
-0500
+++ linux-2.6.16/include/net/sock.h 2006-06-19 19:48:24.0 -0500
@@ -964,6 +964,15 @@ static inline void sock_graft(struct soc
write_unlock_bh(&sk->sk_callback_lock);
}

+static inline void sock_copy(struct sock *nsk, const struct sock *osk)
+{
+   void *sptr = nsk->sk_security;
+
+   memcpy(nsk, osk, osk->sk_prot->obj_size);
+   nsk->sk_security = sptr;
+   security_sk_clone(osk, nsk);
+}
+
extern int sock_i_uid(struct sock *sk);
extern unsigned long sock_i_ino(struct sock *sk);

--- linux-2.6.16.vanilla/net/core/sock.c2006-06-12 17:49:39.0 
-0500
+++ linux-2.6.16/net/core/sock.c2006-06-19 19:48:24.0 -0500
@@ -841,7 +841,7 @@ struct sock *sk_clone(const struct sock 
	if (newsk != NULL) {

struct sk_filter *filter;

-   memcpy(newsk, sk, sk->sk_prot->obj_size);
+   sock_copy(newsk, sk);

/* SANITY */
sk_node_init(&newsk->sk_node);
--- linux-2.6.16.vanilla/security/dummy.c   2006-06-12 17:49:44.0 
-0500
+++ linux-2.6.16/security/dummy.c   2006-06-19 19:50:12.0 -0500
@@ -794,6 +794,10 @@ static inline void dummy_sk_free_securit
{
}

+static inline void dummy_sk_clone_security (const struct sock *sk, struct sock 
*newsk)
+{
+}
+
static unsigned int dummy_sk_getsid(struct sock *sk, struct flowi *fl, u8 dir)
{
return 0;
@@ -1034,6 +1056,7 @@ void security_fixup_ops (struct security
set_to_dummy_if_null(ops, socket_getpeersec_dgram);
set_to_dummy_if_null(ops, sk_alloc_security);
set_to_dummy_if_null(ops, sk_free_security);
+   set_to_dummy_if_null(ops, sk_clone_security);
set_to_dummy_if_null(ops, sk_getsid);
 #endif /* CONFIG_SECURITY_NETWORK */
#ifdef  CONFIG_SECURITY_NETWORK_XFRM
--- linux-2.6.16.vanilla/security/selinux/hooks.c   2006-06-12 
17:49:44.0 -0500
+++ linux-2.6.16/security/selinux/hooks.c   2006-06-19 19:48:24.0 
-0500
@@ -268,15 +268,13 @@ static int sk_alloc_security(struct sock
{
struct sk_security_struct *ssec;

-   if (family != PF_UNIX)
-   return 0;
-
ssec = kzalloc(sizeof(*ssec), priority);
if (!ssec)
return -ENOMEM;

ssec->sk = sk;
ssec->peer_sid = SECINITSID_UNLABELED;
+   ssec->sid = SECINITSID_UNLABELED;
sk->sk_security = ssec;

return 0;
@@ -286,9 +284,6 @@ static void sk_free_security(struct sock
{
struct sk_security_struct *ssec = sk->sk_security;

-   if (sk->sk_family != PF_UNIX)
-

[PATCH 04/06] MLSXFRM: Flow based matching of xfrm policy and state

2006-06-20 Thread Venkat Yekkirala

This makes the security sid a part of the flow key and implements a seemless
mechanism for xfrm policy selection and state matching based on the flow sid.
This also includes the necessary SELinux enforcement pieces.

Signed-off-by: Venkat Yekkirala <[EMAIL PROTECTED]>
---


include/linux/security.h|  104 +--
include/net/flow.h  |5 
net/core/flow.c |7 -

net/xfrm/xfrm_policy.c  |   28 ++--
net/xfrm/xfrm_state.c   |   12 +
security/dummy.c|   23 +++
security/selinux/hooks.c|7 -
security/selinux/include/xfrm.h |   22 ++-
security/selinux/xfrm.c |  200 +-
9 files changed, 329 insertions(+), 79 deletions(-)

--- linux-2.6.16.test1/include/linux/security.h 2006-06-20 11:10:29.0 
-0500
+++ linux-2.6.16/include/linux/security.h   2006-06-20 11:42:27.0 
-0500
@@ -31,6 +31,7 @@
#include 
#include 
#include 
+#include 

struct ctl_table;

@@ -812,9 +813,8 @@ struct swap_info_struct;
 *  used by the XFRM system.
 *  @sec_ctx contains the security context information being provided by
 *  the user-level policy update program (e.g., setkey).
- * Allocate a security structure to the xp->security field.
- * The security field is initialized to NULL when the xfrm_policy is
- * allocated.
+ * Allocate a security structure to the xp->security field; the security
+ * field is initialized to NULL when the xfrm_policy is allocated.
 *  Return 0 if operation was successful (memory to allocate, legal context)
 * @xfrm_policy_clone_security:
 *  @old contains an existing xfrm_policy in the SPD.
@@ -833,9 +833,14 @@ struct swap_info_struct;
 *  Database by the XFRM system.
 *  @sec_ctx contains the security context information being provided by
 *  the user-level SA generation program (e.g., setkey or racoon).
- * Allocate a security structure to the x->security field.  The
- * security field is initialized to NULL when the xfrm_state is
- * allocated.
+ * @polsec contains the security context information associated with a xfrm
+ * policy rule from which to take the base context. polsec must be NULL
+ * when sec_ctx is specified.
+ * @sid contains the sid from which to take the mls portion of the context.
+ * Allocate a security structure to the x->security field; the security
+ * field is initialized to NULL when the xfrm_state is allocated. Set the
+ * context to correspond to either sec_ctx or polsec, with the mls portion
+ * taken from sid in the latter case.
 *  Return 0 if operation was successful (memory to allocate, legal 
context).
 * @xfrm_state_free_security:
 *  @x contains the xfrm_state.
@@ -846,13 +851,26 @@ struct swap_info_struct;
 * @xfrm_policy_lookup:
 *  @xp contains the xfrm_policy for which the access control is being
 *  checked.
- * @sk_sid contains the sock security label that is used to authorize
+ * @fl_sid contains the flow security label that is used to authorize
 *  access to the policy xp.
 *  @dir contains the direction of the flow (input or output).
- * Check permission when a sock selects a xfrm_policy for processing
+ * Check permission when a flow selects a xfrm_policy for processing
 *  XFRMs on a packet.  The hook is called when selecting either a
 *  per-socket policy or a generic xfrm policy.
 *  Return 0 if permission is granted.
+ * @xfrm_state_pol_flow_match:
+ * @x contains the state to match.
+ * @xp contains the policy to check for a match.
+ * @fl contains the flow to check for a match.
+ * Return 1 if there is a match.
+ * @xfrm_flow_state_match:
+ * @fl contains the flow key to match.
+ * @xfrm points to the xfrm_state to match.
+ * Return 1 if there is a match.
+ * @xfrm_decode_session:
+ * @skb points to skb to decode.
+ * @fl points to the flow key to set.
+ * Return 0 if successful decoding.
 *
 * Security hooks affecting all Key Management operations
 *
@@ -1314,10 +1332,16 @@ struct security_operations {
int (*xfrm_policy_clone_security) (struct xfrm_policy *old, struct 
xfrm_policy *new);
void (*xfrm_policy_free_security) (struct xfrm_policy *xp);
int (*xfrm_policy_delete_security) (struct xfrm_policy *xp);
-   int (*xfrm_state_alloc_security) (struct xfrm_state *x, struct 
xfrm_user_sec_ctx *sec_ctx);
+   int (*xfrm_state_alloc_security) (struct xfrm_state *x,
+   struct xfrm_user_sec_ctx *sec_ctx, struct xfrm_sec_ctx *polsec,
+   u32 sid);
void (*xfrm_state_free_security) (struct xfrm_state *x);
int (*xfrm_state_delete_security) (struct xfrm_state *x);
-   int (*xfrm_policy_lookup)(struct xfrm_policy *xp, u32 sk_sid, u8 dir);
+   int (*xfrm_policy_lookup)(struct xfrm_policy *xp, u32 fl_sid, u8 dir);
+   int (*xfrm_state_pol_flow_match)(struct xfrm_state

Intel ixgb driver bug in linux-2.6.17-rc6-mm2

2006-06-20 Thread Linas Vepstas

Hi,

I sat down to do some testing of the ixgb driver a few days ago, and
get failures within seconds.  From what I can tell, I'm getting either a
DMA to a bad address or some other PCI bus error, not sure which. 
The problem appears to happen only for the driver that's in
2.6.17-rc6-mm2. As a sanity check, I'm testing the SuSE SLES10 beta,
which is 2.6.16 based, and it doesn't seem to have any problems.

My test is dirt-simple: telnet to the chargen port.  After an eyeblink,
I get the pci bus error, that's that. "eyeblink" is after about 300MBytes
transfered.  That was with a driver with NAPI enabled. I tried again
with NAPI disabled, and got to about 1.8 GB transfered in two eyeblinks.

To make sure that I'm not dealing with faulty hardware, I tried the same
thing w/ SLES10 2.6.16.18-1.8  and have gotten to RX bytes:20889480686
(19921.7 Mb) so far, with no problems. I don't have easy access to a PCI
bus analyzer, otherwise, I'd tell you more. Ideas? Suggestions? 

I could try taking the diff between these two driver versions, and
seeing what change caused the problem, but thought I should email first,
before doing that.

--linas
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Require CAP_NET_ADMIN to create tuntap devices.

2006-06-20 Thread David Woodhouse
The tuntap driver allows an admin to create persistent devices and
assign ownership of them to individual users. Unfortunately, relaxing
the permissions on the /dev/net/tun device node so that they can
actually use those devices will _also_ allow those users to create
arbitrary new devices of their own. This patch corrects that, and
adjusts the recommended permissions for the device node accordingly.

Signed-Off-By: David Woodhouse <[EMAIL PROTECTED]>

diff --git a/Documentation/networking/tuntap.txt 
b/Documentation/networking/tuntap.txt
index 76750fb..839cbb7 100644
--- a/Documentation/networking/tuntap.txt
+++ b/Documentation/networking/tuntap.txt
@@ -39,10 +39,13 @@ Copyright (C) 1999-2000 Maxim Krasnyansk
  mknod /dev/net/tun c 10 200
   
   Set permissions:
- e.g. chmod 0700 /dev/net/tun
- if you want the device only accessible by root. Giving regular users the
- right to assign network devices is NOT a good idea. Users could assign
- bogus network interfaces to trick firewalls or administrators.
+ e.g. chmod 0666 /dev/net/tun
+ There's no harm in allowing the device to be accessible by non-root users,
+ since CAP_NET_ADMIN is required for creating network devices or for 
+ connecting to network devices which aren't owned by the user in question.
+ If you want to create persistent devices and give ownership of them to 
+ unprivileged users, then you need the /dev/net/tun device to be usable by
+ those users.
 
   Driver module autoloading
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a1ed2d9..6c62d5c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -490,6 +490,9 @@ static int tun_set_iff(struct file *file
 
err = -EINVAL;
 
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
/* Set dev type */
if (ifr->ifr_flags & IFF_TUN) {
/* TUN device */


-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 4/7] AMSO1100 Memory Management.

2006-06-20 Thread Steve Wise

V2 Review Changes:

- removed c2_array services and replaced them with the idr.

- removed c2_alloc services and made them pd-specific.

- don't use GFP_DMA.

- correctly map host memory for DMA (don't use __pa()).

V1 Review Changes:

- sizeof -> sizeof()

- cleaned up comments
---

 drivers/infiniband/hw/amso1100/c2_alloc.c |  144 +++
 drivers/infiniband/hw/amso1100/c2_mm.c|  375 +
 2 files changed, 519 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_alloc.c 
b/drivers/infiniband/hw/amso1100/c2_alloc.c
new file mode 100644
index 000..013b152
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_alloc.c
@@ -0,0 +1,144 @@
+/*
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include 
+#include 
+#include 
+
+#include "c2.h"
+
+static int c2_alloc_mqsp_chunk(struct c2_dev *c2dev, gfp_t gfp_mask, 
+  struct sp_chunk **head)
+{
+   int i;
+   struct sp_chunk *new_head;
+
+   new_head = (struct sp_chunk *) __get_free_page(gfp_mask);
+   if (new_head == NULL)
+   return -ENOMEM;
+
+   new_head->dma_addr = dma_map_single(c2dev->ibdev.dma_device, new_head, 
+   PAGE_SIZE, DMA_FROM_DEVICE);
+   pci_unmap_addr_set(new_head, mapping, new_head->dma_addr);
+
+   new_head->next = NULL;
+   new_head->head = 0;
+
+   /* build list where each index is the next free slot */
+   for (i = 0;
+i < (PAGE_SIZE - sizeof(struct sp_chunk) - 
+ sizeof(u16)) / sizeof(u16) - 1; 
+i++) {
+   new_head->shared_ptr[i] = i + 1;
+   }
+   /* terminate list */
+   new_head->shared_ptr[i] = 0x;
+
+   *head = new_head;
+   return 0;
+}
+
+int c2_init_mqsp_pool(struct c2_dev *c2dev, gfp_t gfp_mask, 
+ struct sp_chunk **root)
+{
+   return c2_alloc_mqsp_chunk(c2dev, gfp_mask, root);
+}
+
+void c2_free_mqsp_pool(struct c2_dev *c2dev, struct sp_chunk *root)
+{
+   struct sp_chunk *next;
+
+   while (root) {
+   next = root->next;
+   dma_unmap_single(c2dev->ibdev.dma_device, 
+pci_unmap_addr(root, mapping), PAGE_SIZE, 
+DMA_FROM_DEVICE);
+   __free_page((struct page *) root);
+   root = next;
+   }
+}
+
+u16 *c2_alloc_mqsp(struct c2_dev *c2dev, struct sp_chunk *head, 
+  dma_addr_t *dma_addr, gfp_t gfp_mask)
+{
+   u16 mqsp;
+
+   while (head) {
+   mqsp = head->head;
+   if (mqsp != 0x) {
+   head->head = head->shared_ptr[mqsp];
+   break;
+   } else if (head->next == NULL) {
+   if (c2_alloc_mqsp_chunk(c2dev, gfp_mask, &head->next) ==
+   0) {
+   head = head->next;
+   mqsp = head->head;
+   head->head = head->shared_ptr[mqsp];
+   break;
+   } else
+   return NULL;
+   } else
+   head = head->next;
+   }
+   if (head) {
+   *dma_addr = head->dma_addr + 
+   ((unsigned long) &(head->shared_ptr[mqsp]) - 
+(unsigned long) head);
+   pr_debug("%s addr %p dma_addr

[PATCH v3 2/2] iWARP Core Changes.

2006-06-20 Thread Steve Wise

This patch contains modifications to the existing rdma header files,
core files, drivers, and ulp files to support iWARP.

V2 Review updates:

V1 Review updates:

- copy_addr() -> rdma_copy_addr()

- dst_dev_addr param in rdma_copy_addr to const.

- various spacing nits with recasting

- include linux/inetdevice.h to get ip_dev_find() prototype.

- dev_put() after successful ip_dev_find()
---

 drivers/infiniband/core/Makefile |4 
 drivers/infiniband/core/addr.c   |   19 +
 drivers/infiniband/core/cache.c  |8 -
 drivers/infiniband/core/cm.c |3 
 drivers/infiniband/core/cma.c|  355 +++---
 drivers/infiniband/core/device.c |6 
 drivers/infiniband/core/mad.c|   11 +
 drivers/infiniband/core/sa_query.c   |5 
 drivers/infiniband/core/smi.c|   18 +
 drivers/infiniband/core/sysfs.c  |   18 +
 drivers/infiniband/core/ucm.c|5 
 drivers/infiniband/core/user_mad.c   |9 -
 drivers/infiniband/hw/ipath/ipath_verbs.c|2 
 drivers/infiniband/hw/mthca/mthca_provider.c |2 
 drivers/infiniband/ulp/ipoib/ipoib_main.c|8 +
 drivers/infiniband/ulp/srp/ib_srp.c  |2 
 include/rdma/ib_addr.h   |   15 +
 include/rdma/ib_verbs.h  |   39 ++-
 18 files changed, 437 insertions(+), 92 deletions(-)

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index 68e73ec..163d991 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -1,7 +1,7 @@
 infiniband-$(CONFIG_INFINIBAND_ADDR_TRANS) := ib_addr.o rdma_cm.o
 
 obj-$(CONFIG_INFINIBAND) +=ib_core.o ib_mad.o ib_sa.o \
-   ib_cm.o $(infiniband-y)
+   ib_cm.o iw_cm.o $(infiniband-y)
 obj-$(CONFIG_INFINIBAND_USER_MAD) +=   ib_umad.o
 obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=ib_uverbs.o ib_ucm.o
 
@@ -14,6 +14,8 @@ ib_sa-y :=sa_query.o
 
 ib_cm-y := cm.o
 
+iw_cm-y := iwcm.o
+
 rdma_cm-y :=   cma.o
 
 ib_addr-y :=   addr.o
diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index d294bbc..83f84ef 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -32,6 +32,7 @@ #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -60,12 +61,15 @@ static LIST_HEAD(req_list);
 static DECLARE_WORK(work, process_req, NULL);
 static struct workqueue_struct *addr_wq;
 
-static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
-unsigned char *dst_dev_addr)
+int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
+const unsigned char *dst_dev_addr)
 {
switch (dev->type) {
case ARPHRD_INFINIBAND:
-   dev_addr->dev_type = IB_NODE_CA;
+   dev_addr->dev_type = RDMA_NODE_IB_CA;
+   break;
+   case ARPHRD_ETHER:
+   dev_addr->dev_type = RDMA_NODE_RNIC;
break;
default:
return -EADDRNOTAVAIL;
@@ -77,6 +81,7 @@ static int copy_addr(struct rdma_dev_add
memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN);
return 0;
 }
+EXPORT_SYMBOL(rdma_copy_addr);
 
 int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr)
 {
@@ -88,7 +93,7 @@ int rdma_translate_ip(struct sockaddr *a
if (!dev)
return -EADDRNOTAVAIL;
 
-   ret = copy_addr(dev_addr, dev, NULL);
+   ret = rdma_copy_addr(dev_addr, dev, NULL);
dev_put(dev);
return ret;
 }
@@ -160,7 +165,7 @@ static int addr_resolve_remote(struct so
 
/* If the device does ARP internally, return 'done' */
if (rt->idev->dev->flags & IFF_NOARP) {
-   copy_addr(addr, rt->idev->dev, NULL);
+   rdma_copy_addr(addr, rt->idev->dev, NULL);
goto put;
}
 
@@ -180,7 +185,7 @@ static int addr_resolve_remote(struct so
src_in->sin_addr.s_addr = rt->rt_src;
}
 
-   ret = copy_addr(addr, neigh->dev, neigh->ha);
+   ret = rdma_copy_addr(addr, neigh->dev, neigh->ha);
 release:
neigh_release(neigh);
 put:
@@ -244,7 +249,7 @@ static int addr_resolve_local(struct soc
if (ZERONET(src_ip)) {
src_in->sin_family = dst_in->sin_family;
src_in->sin_addr.s_addr = dst_ip;
-   ret = copy_addr(addr, dev, dev->dev_addr);
+   ret = rdma_copy_addr(addr, dev, dev->dev_addr);
} else if (LOOPBACK(src_ip)) {
ret = rdma_translate_ip((struct sockaddr *)dst_in, addr);
if (!ret)
diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache

[PATCH v3 6/7] AMSO1100: Privileged Verbs Queues.

2006-06-20 Thread Steve Wise

Review Changes:

dprintk() -> pr_debug()
---

 drivers/infiniband/hw/amso1100/c2_vq.c |  260 
 drivers/infiniband/hw/amso1100/c2_vq.h |   63 
 2 files changed, 323 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_vq.c 
b/drivers/infiniband/hw/amso1100/c2_vq.c
new file mode 100644
index 000..445b1ed
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_vq.c
@@ -0,0 +1,260 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include 
+#include 
+
+#include "c2_vq.h"
+#include "c2_provider.h"
+
+/*
+ * Verbs Request Objects:
+ *
+ * VQ Request Objects are allocated by the kernel verbs handlers.
+ * They contain a wait object, a refcnt, an atomic bool indicating that the
+ * adapter has replied, and a copy of the verb reply work request.
+ * A pointer to the VQ Request Object is passed down in the context
+ * field of the work request message, and reflected back by the adapter
+ * in the verbs reply message.  The function handle_vq() in the interrupt
+ * path will use this pointer to:
+ * 1) append a copy of the verbs reply message
+ * 2) mark that the reply is ready
+ * 3) wake up the kernel verbs handler blocked awaiting the reply.
+ *
+ *
+ * The kernel verbs handlers do a "get" to put a 2nd reference on the 
+ * VQ Request object.  If the kernel verbs handler exits before the adapter
+ * can respond, this extra reference will keep the VQ Request object around
+ * until the adapter's reply can be processed.  The reason we need this is
+ * because a pointer to this object is stuffed into the context field of
+ * the verbs work request message, and reflected back in the reply message.
+ * It is used in the interrupt handler (handle_vq()) to wake up the appropriate
+ * kernel verb handler that is blocked awaiting the verb reply.  
+ * So handle_vq() will do a "put" on the object when it's done accessing it.
+ * NOTE:  If we guarantee that the kernel verb handler will never bail before 
+ *getting the reply, then we don't need these refcnts.
+ *
+ *
+ * VQ Request objects are freed by the kernel verbs handlers only 
+ * after the verb has been processed, or when the adapter fails and
+ * does not reply.  
+ *
+ *
+ * Verbs Reply Buffers:
+ *
+ * VQ Reply bufs are local host memory copies of a 
+ * outstanding Verb Request reply
+ * message.  The are always allocated by the kernel verbs handlers, and _may_ 
be
+ * freed by either the kernel verbs handler -or- the interrupt handler.  The
+ * kernel verbs handler _must_ free the repbuf, then free the vq request object
+ * in that order.
+ */
+
+int vq_init(struct c2_dev *c2dev)
+{
+   sprintf(c2dev->vq_cache_name, "c2-vq:dev%c",
+   (char) ('0' + c2dev->devnum));
+   c2dev->host_msg_cache =
+   kmem_cache_create(c2dev->vq_cache_name, c2dev->rep_vq.msg_size, 0,
+ SLAB_HWCACHE_ALIGN, NULL, NULL);
+   if (c2dev->host_msg_cache == NULL) {
+   return -ENOMEM;
+   }
+   return 0;
+}
+
+void vq_term(struct c2_dev *c2dev)
+{
+   kmem_cache_destroy(c2dev->host_msg_cache);
+}
+
+/* vq_req_alloc - allocate a VQ Request Object and initialize it.
+ * The refcnt is set to 1.
+ */
+struct c2_vq_req *vq_req_alloc(struct c2_dev *c2dev)
+{
+   struct c2_vq_req *r;
+
+   r = kmalloc(sizeof(struct c2_vq_req), GFP_KERNEL);
+   if (r) {
+   init_waitqueue_head(&r->wait_object);
+   r->reply_msg = (u64) NULL;
+   r->

[PATCH v3 5/7] AMSO1100 Message Queues.

2006-06-20 Thread Steve Wise

V2 Review Changes:

- correctly map host memory for DMA (don't use __pa()).

V1 Review Changes:

- remove useless asserts

- assert() -> BUG_ON()

- C2_DEBUG -> DEBUG
---

 drivers/infiniband/hw/amso1100/c2_mq.c |  175 
 drivers/infiniband/hw/amso1100/c2_mq.h |  107 
 2 files changed, 282 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_mq.c 
b/drivers/infiniband/hw/amso1100/c2_mq.c
new file mode 100644
index 000..96bbe9a
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_mq.c
@@ -0,0 +1,175 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "c2.h"
+#include "c2_mq.h"
+
+void *c2_mq_alloc(struct c2_mq *q)
+{
+   BUG_ON(q->magic != C2_MQ_MAGIC);
+   BUG_ON(q->type != C2_MQ_ADAPTER_TARGET);
+
+   if (c2_mq_full(q)) {
+   return NULL;
+   } else {
+#ifdef DEBUG
+   struct c2wr_hdr *m =
+   (struct c2wr_hdr *) (q->msg_pool.host + q->priv * 
q->msg_size);
+#ifdef CCMSGMAGIC
+   BUG_ON(m->magic != be32_to_cpu(~CCWR_MAGIC));
+   m->magic = cpu_to_be32(CCWR_MAGIC);
+#endif
+   return m;
+#else
+   return q->msg_pool.host + q->priv * q->msg_size;
+#endif
+   }
+}
+
+void c2_mq_produce(struct c2_mq *q)
+{
+   BUG_ON(q->magic != C2_MQ_MAGIC);
+   BUG_ON(q->type != C2_MQ_ADAPTER_TARGET);
+
+   if (!c2_mq_full(q)) {
+   q->priv = (q->priv + 1) % q->q_size;
+   q->hint_count++;
+   /* Update peer's offset. */
+   __raw_writew(cpu_to_be16(q->priv), &q->peer->shared);
+   }
+}
+
+void *c2_mq_consume(struct c2_mq *q)
+{
+   BUG_ON(q->magic != C2_MQ_MAGIC);
+   BUG_ON(q->type != C2_MQ_HOST_TARGET);
+
+   if (c2_mq_empty(q)) {
+   return NULL;
+   } else {
+#ifdef DEBUG
+   struct c2wr_hdr *m = (struct c2wr_hdr *)
+   (q->msg_pool.host + q->priv * q->msg_size);
+#ifdef CCMSGMAGIC
+   BUG_ON(m->magic != be32_to_cpu(CCWR_MAGIC));
+#endif
+   return m;
+#else
+   return q->msg_pool.host + q->priv * q->msg_size;
+#endif
+   }
+}
+
+void c2_mq_free(struct c2_mq *q)
+{
+   BUG_ON(q->magic != C2_MQ_MAGIC);
+   BUG_ON(q->type != C2_MQ_HOST_TARGET);
+
+   if (!c2_mq_empty(q)) {
+
+#ifdef CCMSGMAGIC
+   {
+   struct c2wr_hdr __iomem *m = (struct c2wr_hdr __iomem *)
+   (q->msg_pool.adapter + q->priv * q->msg_size);
+   __raw_writel(cpu_to_be32(~CCWR_MAGIC), &m->magic);
+   }
+#endif
+   q->priv = (q->priv + 1) % q->q_size;
+   /* Update peer's offset. */
+   __raw_writew(cpu_to_be16(q->priv), &q->peer->shared);
+   }
+}
+
+
+void c2_mq_lconsume(struct c2_mq *q, u32 wqe_count)
+{
+   BUG_ON(q->magic != C2_MQ_MAGIC);
+   BUG_ON(q->type != C2_MQ_ADAPTER_TARGET);
+
+   while (wqe_count--) {
+   BUG_ON(c2_mq_empty(q));
+   *q->shared = cpu_to_be16((be16_to_cpu(*q->shared)+1) % 
q->q_size);
+   }
+}
+
+
+u32 c2_mq_count(struct c2_mq *q)
+{
+   s32 count;
+
+   if (q->type == C2_MQ_HOST_TARGET) {
+   count = be16_to_cpu(*q->shared) - q->priv;
+   } else {
+   count = q->priv - be16_to_cpu(*q->shared);
+   }
+
+   if (count < 0) {
+   count += q->q_size;
+   }
+
+   return (u32) count;
+}
+
+voi

[PATCH v3 0/2][RFC] iWARP Core Support

2006-06-20 Thread Steve Wise

This patchset defines the modifications to the Linux infiniband subsystem
to support iWARP devices.  We're submitting it for review now with the
goal for inclusion in the 2.6.19 kernel.  This code has gone through
several reviews in the openib-general list.  Now we are submitting it
for external review by the linux community.

This StGIT patchset is cloned from Roland Dreier's infiniband.git
for-2.6.19 branch.  The patchset consists of 2 patches:

1 - New iWARP CM implementation.  
2 - Core changes to support iWARP.

I believe I've addressed all the round 1 and 2 review comments.
Details of the changes are tracked in each patch comment.

Signed-off-by: Tom Tucker <[EMAIL PROTECTED]>
Signed-off-by: Steve Wise <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 7/7] AMSO1100 Makefiles and Kconfig changes.

2006-06-20 Thread Steve Wise

Review Changes:

- C2DEBUG -> DEBUG
---

 drivers/infiniband/Kconfig |1 +
 drivers/infiniband/Makefile|1 +
 drivers/infiniband/hw/amso1100/Kbuild  |   10 ++
 drivers/infiniband/hw/amso1100/Kconfig |   15 +++
 drivers/infiniband/hw/amso1100/README  |   11 +++
 5 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index ba2d650..04e6d4f 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -36,6 +36,7 @@ config INFINIBAND_ADDR_TRANS
 
 source "drivers/infiniband/hw/mthca/Kconfig"
 source "drivers/infiniband/hw/ipath/Kconfig"
+source "drivers/infiniband/hw/amso1100/Kconfig"
 
 source "drivers/infiniband/ulp/ipoib/Kconfig"
 
diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile
index eea2732..e2b93f9 100644
--- a/drivers/infiniband/Makefile
+++ b/drivers/infiniband/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_INFINIBAND)   += core/
 obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/
 obj-$(CONFIG_IPATH_CORE)   += hw/ipath/
+obj-$(CONFIG_INFINIBAND_AMSO1100)  += hw/amso1100/
 obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/
 obj-$(CONFIG_INFINIBAND_SRP)   += ulp/srp/
diff --git a/drivers/infiniband/hw/amso1100/Kbuild 
b/drivers/infiniband/hw/amso1100/Kbuild
new file mode 100644
index 000..e1f10ab
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/Kbuild
@@ -0,0 +1,10 @@
+EXTRA_CFLAGS += -Idrivers/infiniband/include
+
+ifdef CONFIG_INFINIBAND_AMSO1100_DEBUG
+EXTRA_CFLAGS += -DDEBUG
+endif
+
+obj-$(CONFIG_INFINIBAND_AMSO1100) += iw_c2.o
+
+iw_c2-y := c2.o c2_provider.o c2_rnic.o c2_alloc.o c2_mq.o c2_ae.o c2_vq.o \
+   c2_intr.o c2_cq.o c2_qp.o c2_cm.o c2_mm.o c2_pd.o
diff --git a/drivers/infiniband/hw/amso1100/Kconfig 
b/drivers/infiniband/hw/amso1100/Kconfig
new file mode 100644
index 000..809cb14
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/Kconfig
@@ -0,0 +1,15 @@
+config INFINIBAND_AMSO1100
+   tristate "Ammasso 1100 HCA support"
+   depends on PCI && INET && INFINIBAND
+   ---help---
+ This is a low-level driver for the Ammasso 1100 host
+ channel adapter (HCA).
+
+config INFINIBAND_AMSO1100_DEBUG
+   bool "Verbose debugging output"
+   depends on INFINIBAND_AMSO1100
+   default n
+   ---help---
+ This option causes the amso1100 driver to produce a bunch of
+ debug messages.  Select this if you are developing the driver
+ or trying to diagnose a problem.
diff --git a/drivers/infiniband/hw/amso1100/README 
b/drivers/infiniband/hw/amso1100/README
new file mode 100644
index 000..1331353
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/README
@@ -0,0 +1,11 @@
+This is the OpenFabrics provider driver for the 
+AMSO1100 1Gb RNIC adapter. 
+
+This adapter is available in limited quantities 
+for development purposes from Open Grid Computing.
+
+This driver requires the IWCM and CMA mods necessary
+to support iWARP.
+
+Contact [EMAIL PROTECTED] for more information.
+
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/2] iWARP Connection Manager.

2006-06-20 Thread Steve Wise

This patch provides the new files implementing the iWARP Connection
Manager.

This module is a logical instance of the xx_cm where xx is the transport
type (ib or iw). The symbols exported are used by the transport
independent rdma_cm module, and are available also for transport
dependent ULPs.

V2 Review Changes:

- BUG_ON(1) -> BUG()

- Don't typecast whan assigning between something* and void*

- pre-allocate iwcm_work objects to avoid allocating them in the interrupt
  context.

- copy private data on connect request and connect reply events.

- #if !defined() -> #ifndef

V1 Review Changes:

- sizeof -> sizeof()

- removed printks

- removed TT debug code

- cleaned up lock/unlock around switch statements.

- waitqueue -> completion for destroy path.
---

 drivers/infiniband/core/iwcm.c | 1008 
 include/rdma/iw_cm.h   |  255 ++
 include/rdma/iw_cm_private.h   |   63 +++
 3 files changed, 1326 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/iwcm.c b/drivers/infiniband/core/iwcm.c
new file mode 100644
index 000..fe43c00
--- /dev/null
+++ b/drivers/infiniband/core/iwcm.c
@@ -0,0 +1,1008 @@
+/*
+ * Copyright (c) 2004, 2005 Intel Corporation.  All rights reserved.
+ * Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+ * Copyright (c) 2004, 2005 Voltaire Corporation.  All rights reserved.
+ * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ * Copyright (c) 2005 Network Appliance, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_AUTHOR("Tom Tucker");
+MODULE_DESCRIPTION("iWARP CM");
+MODULE_LICENSE("Dual BSD/GPL");
+
+static struct workqueue_struct *iwcm_wq;
+struct iwcm_work {
+   struct work_struct work;
+   struct iwcm_id_private *cm_id;
+   struct list_head list;
+   struct iw_cm_event event;
+   struct list_head free_list;
+};
+
+/* 
+ * The following services provide a mechanism for pre-allocating iwcm_work 
+ * elements.  The design pre-allocates them  based on the cm_id type:
+ * LISTENING IDS:  Get enough elements preallocated to handle the
+ * listen backlog.
+ * ACTIVE IDS: 4: CONNECT_REPLY, ESTABLISHED, DISCONNECT, CLOSE
+ * PASSIVE IDS:3: ESTABLISHED, DISCONNECT, CLOSE 
+ *
+ * Allocating them in connect and listen avoids having to deal
+ * with allocation failures on the event upcall from the provider (which 
+ * is called in the interrupt context).  
+ *
+ * One exception is when creating the cm_id for incoming connection requests.  
+ * There are two cases:
+ * 1) in the event upcall, cm_event_handler(), for a listening cm_id.  If
+ *the backlog is exceeded, then no more connection request events will
+ *be processed.  cm_event_handler() returns -ENOMEM in this case.  Its up
+ *to the provider to reject the connectino request.
+ * 2) in the connection request workqueue handler, cm_conn_req_handler().
+ *If work elements cannot be allocated for the new connect request cm_id,
+ *then IWCM will call the provider reject method.  This is ok since
+ *cm_conn_req_handler() runs in the workqueue thread context.
+ */
+
+static struct iwcm_work *get_work(struct iwcm_id_private *cm_id_priv)
+{
+   struct iwcm_work *work;
+
+   if (list_empty(&cm_id_priv->work_free_list))
+   return NULL;
+   work = list_entry(cm_id_priv->work_free_list

[PATCH v3 0/7][RFC] Ammasso 1100 iWARP Driver

2006-06-20 Thread Steve Wise

This patchset implements the iWARP provider driver for the Ammasso
1100 RNIC.  It is dependent on the "iWARP Core Support" patch set.  We're
submitting it for review with the goal for inclusion in the 2.6.19 kernel.
This code has gone through several reviews in the openib-general list.
Now we are submitting it for external review by the linux community.

This StGIT patchset is cloned from Roland Dreier's infiniband.git
for-2.6.19 branch.  The patchset consists of 7 patches:

1 - Low-level device interface and native stack support
2 - Work request definitions
3 - Provider interface
4 - Memory management
5 - User mode message queue implementation  
6 - Verbs queue implementation
7 - Kconfig and Makefile

I believe I've addressed all the round 1 and 2 review comments.
Details of the changes are tracked in each patch comment.

Signed-off-by: Tom Tucker <[EMAIL PROTECTED]>
Signed-off-by: Steve Wise <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/7] AMSO1100 Low Level Driver.

2006-06-20 Thread Arjan van de Ven
On Tue, 2006-06-20 at 15:30 -0500, Steve Wise wrote:

> +/*
> + * Allocate TX ring elements and chain them together.
> + * One-to-one association of adapter descriptors with ring elements.
> + */
> +static int c2_tx_ring_alloc(struct c2_ring *tx_ring, void *vaddr,
> + dma_addr_t base, void __iomem * mmio_txp_ring)
> +{
> + struct c2_tx_desc *tx_desc;
> + struct c2_txp_desc __iomem *txp_desc;
> + struct c2_element *elem;
> + int i;
> +
> + tx_ring->start = kmalloc(sizeof(*elem) * tx_ring->count, GFP_KERNEL);

I would think this needs a dma_alloc_coherent() rather than a kmalloc...


> +
> +/* Free all buffers in RX ring, assumes receiver stopped */
> +static void c2_rx_clean(struct c2_port *c2_port)
> +{
> + struct c2_dev *c2dev = c2_port->c2dev;
> + struct c2_ring *rx_ring = &c2_port->rx_ring;
> + struct c2_element *elem;
> + struct c2_rx_desc *rx_desc;
> +
> + elem = rx_ring->start;
> + do {
> + rx_desc = elem->ht_desc;
> + rx_desc->len = 0;
> +
> + __raw_writew(0, elem->hw_desc + C2_RXP_STATUS);
> + __raw_writew(0, elem->hw_desc + C2_RXP_COUNT);
> + __raw_writew(0, elem->hw_desc + C2_RXP_LEN);

you seem to be a fan of the __raw_write() functions... any reason why?
__raw_ is not a magic "go faster" prefix

Also on a related note, have you checked the driver for the needed PCI
posting flushes?

> +
> + /* Disable IRQs by clearing the interrupt mask */
> + writel(1, c2dev->regs + C2_IDIS);
> + writel(0, c2dev->regs + C2_NIMR0);

like here...
> +
> + elem = tx_ring->to_use;
> + elem->skb = skb;
> + elem->mapaddr = mapaddr;
> + elem->maplen = maplen;
> +
> + /* Tell HW to xmit */
> + __raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR);
> + __raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN);
> + __raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS);

or here

> +static int c2_change_mtu(struct net_device *netdev, int new_mtu)
> +{
> + int ret = 0;
> +
> + if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU)
> + return -EINVAL;
> +
> + netdev->mtu = new_mtu;
> +
> + if (netif_running(netdev)) {
> + c2_down(netdev);
> +
> + c2_up(netdev);
> + }

this looks odd...



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/7] AMSO1100 WR / Event Definitions.

2006-06-20 Thread Steve Wise
2/7 gzipped & attached cuz it it gets dropped by lklm and netdev
lists...


Steve.


amso1100_wr.gz
Description: GNU Zip compressed data


Fwd: soft lockup detected with 2.6.16 kernel + e1000 driver under load

2006-06-20 Thread Massimiliano Poletto

Shaw Vrana asked me to resend the message below (originally posted to
lkml) to this list.

Note that I managed to make the problem go away by adding the boot
options "pci=usepirqmask acpi=noirq".  I got the hint by reading dmesg
output and Documentation/kernel-parameters.txt.  Still, maybe the
information below and my apparent solution will be of use to someone
more knowledgeable in this matter than I am.

Regards,
max

-- Forwarded message --

Summary: running tcpdump on an e1000 gig-e interface under heavy load
(line rate, minimum-size packets) causes "soft lockup detected" error
messages on the console.  The machine becomes unresponsive (no ping,
console freezes), but after the traffic stops it usually comes back.
The same experiment with a Broadcom gig-e card and a tg3 driver
succeeded: there were no lockups, and the machine continued to be
reachable and somewhat responsive even under load.

I would appreciate any help or advice.  Details are below.

Thanks,
max


- Hardware: IBM x346 dual-Xeon server with dual on-board Broadcom
NetXtreme (95721) 100/1000 interfaces (tg3 driver) and one dual-port
Intel e1000 card.

- Linux  kernel versions: several versions of 2.6.16, including the
latest 2.6.16.19.

- Intel e1000 driver versions: several, including 6.3.9-k4 (which is in
the kernel.org sources) and the latest 7.0.41 from sourceforge.  NAPI
disabled.

- Test traffic: random 64-byte TCP packets at 1.4Mpps from an Ixia
device.

- Kernel config: see http://maxp.net/linux/config-2.6.16.19-1

- Ring buffer output (dmesg): see http://maxp.net/linux/dmesg-11541-1

- Sample error messages (driver 7.0.41):

BUG: soft lockup detected on CPU#0!

Pid: 0, comm:  swapper
EIP: 0060:[] CPU: 0
EIP is at kfree+0x4f/0x61
EFLAGS: 0286Not tainted  (2.6.16.19 #1)
EAX: 0004 EBX: f7ff8380 ECX: f7d08188 EDX: f7d08188
ESI: f7fff880 EDI: f5483600 EBP: 0286 DS: 007b ES: 007b
CR0: 8005003b CR2: 081df780 CR3: 042b2ce0 CR4: 06f0
[] kfree_skbmem+0x8/0x73
[] netif_rx+0x149/0x18b
[] e1000_clean_rx_irq+0x1c3/0x555 [e1000]
[] e1000_intr+0x10d/0x3c9 [e1000]
[] mark_offset_tsc+0x1a7/0x2c9
[] do_timer+0x3b/0x439
[] handle_IRQ_event+0x2e/0x5a
[] __do_IRQ+0x91/0xe7
[] do_IRQ+0x4e/0x86
===
[] common_interrupt+0x1a/0x20
[] _raw_spin_unlock+0x3b/0x74
[] packet_rcv+0xb8/0x3a9
[] common_interrupt+0x1a/0x20
[] do_gettimeofday+0x20/0xd1
[] netif_receive_skb+0x206/0x291
[] process_backlog+0x82/0x107
[] net_rx_action+0x74/0x107
[] __do_softirq+0x70/0xda
[] do_softirq+0x4b/0x4f
===
[] do_IRQ+0x55/0x86
[] common_interrupt+0x1a/0x20
[] mwait_idle+0x2a/0x34
[] cpu_idle+0x61/0x76
[] start_kernel+0x2db/0x387
[] unknown_bootoption+0x0/0x257
BUG: soft lockup detected on CPU#0!

Pid: 0, comm:  swapper
EIP: 0060:[] CPU: 0
EIP is at e1000_alloc_rx_buffers+0x12e/0x461 [e1000]
EFLAGS: 0282Not tainted  (2.6.16.19 #1)
EAX:  EBX: 35c80812 ECX: 008e EDX: f7d16ac0
ESI: 008e EDI: f6138380 EBP: 05ee DS: 007b ES: 007b
CR0: 8005003b CR2: 081df780 CR3: 042b2ce0 CR4: 06f0
[] e1000_clean_rx_irq+0x434/0x555 [e1000]
[] e1000_intr+0x10d/0x3c9 [e1000]
[] mark_offset_tsc+0x1a7/0x2c9
[] do_timer+0x3b/0x439
[] handle_IRQ_event+0x2e/0x5a
[] __do_IRQ+0x91/0xe7
[] do_IRQ+0x4e/0x86
===
[] common_interrupt+0x1a/0x20
[] tg3_poll+0x5d4/0x93d [tg3]
[] _spin_unlock_irqrestore+0xa/0xc
[] common_interrupt+0x1a/0x20
[] net_rx_action+0x74/0x107
[] __do_softirq+0x70/0xda
[] do_softirq+0x4b/0x4f
===
[] do_IRQ+0x55/0x86
[] common_interrupt+0x1a/0x20
[] mwait_idle+0x2a/0x34
[] cpu_idle+0x61/0x76
[] start_kernel+0x2db/0x387
[] unknown_bootoption+0x0/0x257

BUG: soft lockup detected on CPU#0!

Pid: 0, comm:  swapper
EIP: 0060:[] CPU: 0
EIP is at netif_rx+0x131/0x18b
EFLAGS: 0246Not tainted  (2.6.16.19 #1)
EAX:  EBX: f210cc80 ECX: c0514680 EDX: c3fa4560
ESI: c3fa45e0 EDI: c051a000 EBP: 0246 DS: 007b ES: 007b
CR0: 8005003b CR2: 080f5d20 CR3: 37ec2fc0 CR4: 06f0
[] e1000_clean_rx_irq+0x214/0x63b [e1000]
[] e1000_intr+0x13f/0x47d [e1000]
[] smp_apic_timer_interrupt+0x55/0x5e
[] apic_timer_interrupt+0x1c/0x24
[] do_timer+0x3b/0x439
[] handle_IRQ_event+0x2e/0x5a
[] __do_IRQ+0x91/0xe7
[] do_IRQ+0x4e/0x86
===
[] common_interrupt+0x1a/0x20
[] mwait_idle+0x2a/0x34
[] cpu_idle+0x61/0x76
[] start_kernel+0x2db/0x387
[] unknown_bootoption+0x0/0x257
BUG: soft lockup detected on CPU#0!

Pid: 0, comm:  swapper
EIP: 0060:[] CPU: 0
EIP is at kmem_cache_free+0x29/0x3a
EFLAGS: 0246Not tainted  (2.6.16.19 #1)
EAX: 000d EBX: f7fdee00 ECX: f7eec9c0 EDX: f210cc80
ESI: 0246 EDI: f210cc80 EBP: 0246 DS: 007b ES: 007b
CR0: 8005003b CR2: 080f5d20 CR3: 37ec2fc0 CR4: 06f0
[] netif_rx+0x149/0x18b
[] e1000_clean_rx_irq+0x214/0x63b [e1000]
[] e1000_intr+0x13f/0x47d [e1000]
[] smp_apic_timer_interrupt+0x55/0x5e
[] apic_timer_interrupt+0x1c/0x24
[] do_timer+0x3b/0x439
[] handle_I

Re: Intel ixgb driver bug in linux-2.6.17-rc6-mm2

2006-06-20 Thread Jesse Brandeburg

On 6/20/06, Linas Vepstas <[EMAIL PROTECTED]> wrote:


Hi,

I sat down to do some testing of the ixgb driver a few days ago, and
get failures within seconds.  From what I can tell, I'm getting either a
DMA to a bad address or some other PCI bus error, not sure which.
The problem appears to happen only for the driver that's in
2.6.17-rc6-mm2. As a sanity check, I'm testing the SuSE SLES10 beta,
which is 2.6.16 based, and it doesn't seem to have any problems.

My test is dirt-simple: telnet to the chargen port.  After an eyeblink,
I get the pci bus error, that's that. "eyeblink" is after about 300MBytes
transfered.  That was with a driver with NAPI enabled. I tried again
with NAPI disabled, and got to about 1.8 GB transfered in two eyeblinks.

To make sure that I'm not dealing with faulty hardware, I tried the same
thing w/ SLES10 2.6.16.18-1.8  and have gotten to RX bytes:20889480686
(19921.7 Mb) so far, with no problems. I don't have easy access to a PCI
bus analyzer, otherwise, I'd tell you more. Ideas? Suggestions?

I could try taking the diff between these two driver versions, and
seeing what change caused the problem, but thought I should email first,
before doing that.


For some reason I didn't get your mail at intel yet.  anyway, please
try disabling TSO using ethtool and see if that helps any.

you're running 1.0.109, correct?
what does cat /proc/interrupts say (are you running MSI?)

I'd also like to know if LLTX support (recently added) is causing you
the issue.  What hardware platform? pSeries?  does it EEH? what does
the dump say?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: soft lockup detected with 2.6.16 kernel + e1000 driver under load

2006-06-20 Thread Jesse Brandeburg

On 6/20/06, Massimiliano Poletto <[EMAIL PROTECTED]> wrote:

Shaw Vrana asked me to resend the message below (originally posted to
lkml) to this list.

Note that I managed to make the problem go away by adding the boot
options "pci=usepirqmask acpi=noirq".  I got the hint by reading dmesg
output and Documentation/kernel-parameters.txt.  Still, maybe the
information below and my apparent solution will be of use to someone
more knowledgeable in this matter than I am.


thats odd that changing the kernel interrupt setup would make a difference.

How about compiling the e1000 driver with NAPI support?  I think your
problem will go away.  You don't have any choice but to run NAPI with
tg3, so please at least compare apples to apples by enabling NAPI and
re-running your test.




BUG: soft lockup detected on CPU#0!

Pid: 0, comm:  swapper
EIP: 0060:[] CPU: 0
EIP is at netif_rx+0x131/0x18b
 EFLAGS: 0282Not tainted  (2.6.16.19 #1)
EAX:  EBX: f2781a80 ECX: c0514680 EDX: c3fa4560
ESI: c3fa45e0 EDI: c051a000 EBP: 0282 DS: 007b ES: 007b
CR0: 8005003b CR2: 081ed000 CR3: 04356da0 CR4: 06f0
 [] e1000_clean_rx_irq+0x214/0x63b [e1000]
 [] e1000_intr+0x13f/0x47d [e1000]


spinning here would pretty much be expected and is exactly what NAPI
is designed to fix.

Jesse
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [DOC]: generic netlink

2006-06-20 Thread Thomas Graf
Hello

> > > TODO:
> > > a) Add a more complete compiling kernel module with events.
> > > Have Thomas put his Mashimaro example and point to it.
> > 
> > I guess we have a legal issue here ;)
> > 
> 
> change the name ;->

Ask Mr. Mashimaro has become my replacement for 8ball. Renaming
it would lead to a serious loss of coolness ;-)

> > > b) Describe some details on how user space -> kernel works
> > > probably using libnl??
> > 
> > I'll take care of that.
> 
> Whats the plan? To add to this doc or separate doc?

The status is that the code is there including userspace tools
to query the controller. Documentation is written as part of
the API reference (coming up with -pre6), no architectural notes
yet though. I think it's best to keep it separated and refer to
it both ways.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PATCH] TIPC updates

2006-06-20 Thread David Miller
From: Per Liden <[EMAIL PROTECTED]>
Date: Tue, 20 Jun 2006 16:48:54 +0200 (CEST)

> Please pull from:
> 
>   git://tipc.cslab.ericsson.net/pub/git/tipc.git

Hi Per.

I agree with James, you should post the patches so that people
can review them.

But not all in one posting! :-)

Look at how other contributors submit their work.  They make a set of
postings to the list, one for each patch, and each posting has
a subject that begins with something like "[PATCH 1/N] " where
the "1" increases for each patch and the "N" is the total number
of patches you are submitting for review.

People typically also give an initial "[PATCH 0/N] " posting
where the location of the GIT tree containing the patches lives
and also a general summary of what the upcoming set of patches
do.

There are automated scripts which can take a GIT tree and a given
range of changesets and prebuild the list postings for you.  So
the amount of work you need to do for this is minimal.

For example, "git format-patch" and "git send-email" can help you a
lot here.

Thanks!
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ipv6 source address selection in addrconf.c (2.6.17)

2006-06-20 Thread Lukasz Stelmach
Greetings.

net/ipv6/addrconf.c:971 is
/* Rule 2: Prefer appropriate scope */
if (hiscore.rule < 2) {
hiscore.scope = __ipv6_addr_src_scope(hiscore.addr_type);
hiscore.rule++;
}

I am afraid, that it does not make any sense for I find no place where a value
is assigned to hiscore.addr_type. There are some more references to
hiscore.addr_type below but the only assignment is when the whole structure is
cleaned with memset(3)

I found it when I was trying to figure out why when trying to connect to

2001:200:0:8002:203:47ff:fea5:3085 (www.kame.net)

with two global addresses assigned to the ethernet card

fd24:6f44:46bd:face::254
2002:531f:d667:face::254

rule 8 does not work and the first address is chosen.


Please CC answers.
-- 
Było mi bardzo miło.Czwarta pospolita klęska, [...]
>Łukasz<  Już nie katolicka lecz złodziejska.  (c)PP





signature.asc
Description: OpenPGP digital signature


Re: [FORCEDETH]: Fix xmit_lock/netif_tx_lock after merge

2006-06-20 Thread Linus Torvalds


On Tue, 20 Jun 2006, Herbert Xu wrote:
> 
> [FORCEDETH]: Fix xmit_lock/netif_tx_lock after merge

Btw, please don't use attachments in vain. Now I can't see it by default, 
I can reply and quote it etc etc.

Linus
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [patch 0/6] [Network namespace] introduction

2006-06-20 Thread Daniel Lezcano

Al Viro wrote:

On Fri, Jun 09, 2006 at 11:02:02PM +0200, [EMAIL PROTECTED] wrote:
- renaming an interface in one "namespace" affects everyone.


Exact. If we ensure the interface can't be renamed if used in different 
namespace, is it really a problem ?


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [patch 0/6] [Network namespace] introduction

2006-06-20 Thread Al Viro
On Tue, Jun 20, 2006 at 11:21:43PM +0200, Daniel Lezcano wrote:
> Al Viro wrote:
> >On Fri, Jun 09, 2006 at 11:02:02PM +0200, [EMAIL PROTECTED] wrote:
> >- renaming an interface in one "namespace" affects everyone.
> 
> Exact. If we ensure the interface can't be renamed if used in different 
> namespace, is it really a problem ?

You _still_ have a single namespace; look in /sys/class/net and you'll see.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6 resend] b44: update version to 1.01

2006-06-20 Thread Gary Zambrano
Update the driver version to 1.01

Signed-off-by: Gary Zambrano <[EMAIL PROTECTED]>

diff --git a/drivers/net/b44.c b/drivers/net/b44.c
index 98c0675..a7e4ba5 100644
--- a/drivers/net/b44.c
+++ b/drivers/net/b44.c
@@ -29,8 +29,8 @@
 
 #define DRV_MODULE_NAME"b44"
 #define PFX DRV_MODULE_NAME": "
-#define DRV_MODULE_VERSION "1.00"
-#define DRV_MODULE_RELDATE "Apr 7, 2006"
+#define DRV_MODULE_VERSION "1.01"
+#define DRV_MODULE_RELDATE "Jun 16, 2006"
 
 #define B44_DEF_MSG_ENABLE   \
(NETIF_MSG_DRV  | \


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6 resend] b44: add wol for old nic

2006-06-20 Thread Gary Zambrano
This patch adds wol support for the older 440x nics that use pattern matching.
This patch is a redo thanks to feedback from Michael Chan and Francois Romieu.

Signed-off-by: Gary Zambrano  <[EMAIL PROTECTED]>

diff --git a/drivers/net/b44.c b/drivers/net/b44.c
index 12fc67a..98c0675 100644
--- a/drivers/net/b44.c
+++ b/drivers/net/b44.c
@@ -75,6 +75,15 @@
 /* minimum number of free TX descriptors required to wake up TX process */
 #define B44_TX_WAKEUP_THRESH   (B44_TX_RING_SIZE / 4)
 
+/* b44 internal pattern match filter info */
+#define B44_PATTERN_BASE   0x400
+#define B44_PATTERN_SIZE   0x80
+#define B44_PMASK_BASE 0x600
+#define B44_PMASK_SIZE 0x10
+#define B44_MAX_PATTERNS   16
+#define B44_ETHIPV6UDP_HLEN62
+#define B44_ETHIPV4UDP_HLEN42
+
 static char version[] __devinitdata =
DRV_MODULE_NAME ".c:v" DRV_MODULE_VERSION " (" DRV_MODULE_RELDATE ")\n";
 
@@ -1457,6 +1466,103 @@ static void b44_poll_controller(struct n
 }
 #endif
 
+static void bwfilter_table(struct b44 *bp, u8 *pp, u32 bytes, u32 table_offset)
+{
+   u32 i;
+   u32 *pattern = (u32 *) pp;
+
+   for (i = 0; i < bytes; i += sizeof(u32)) {
+   bw32(bp, B44_FILT_ADDR, table_offset + i);
+   bw32(bp, B44_FILT_DATA, pattern[i / sizeof(u32)]);
+   }
+}
+
+static int b44_magic_pattern(u8 *macaddr, u8 *ppattern, u8 *pmask, int offset)
+{
+   int magicsync = 6;
+   int k, j, len = offset;
+   int ethaddr_bytes = ETH_ALEN;
+
+   memset(ppattern + offset, 0xff, magicsync);
+   for (j = 0; j < magicsync; j++)
+   set_bit(len++, (unsigned long *) pmask);
+
+   for (j = 0; j < B44_MAX_PATTERNS; j++) {
+   if ((B44_PATTERN_SIZE - len) >= ETH_ALEN)
+   ethaddr_bytes = ETH_ALEN;
+   else
+   ethaddr_bytes = B44_PATTERN_SIZE - len;
+   if (ethaddr_bytes <=0)
+   break;
+   for (k = 0; k< ethaddr_bytes; k++) {
+   ppattern[offset + magicsync +
+   (j * ETH_ALEN) + k] = macaddr[k];
+   len++;
+   set_bit(len, (unsigned long *) pmask);
+   }
+   }
+   return len - 1;
+}
+
+/* Setup magic packet patterns in the b44 WOL
+ * pattern matching filter.
+ */
+static void b44_setup_pseudo_magicp(struct b44 *bp)
+{
+
+   u32 val;
+   int plen0, plen1, plen2;
+   u8 *pwol_pattern;
+   u8 pwol_mask[B44_PMASK_SIZE];
+
+   pwol_pattern = kmalloc(B44_PATTERN_SIZE, GFP_KERNEL);
+   if (!pwol_pattern) {
+   printk(KERN_ERR PFX "Memory not available for WOL\n");
+   return;
+   }
+
+   /* Ipv4 magic packet pattern - pattern 0.*/
+   memset(pwol_pattern, 0, B44_PATTERN_SIZE);
+   memset(pwol_mask, 0, B44_PMASK_SIZE);
+   plen0 = b44_magic_pattern(bp->dev->dev_addr, pwol_pattern, pwol_mask,
+ B44_ETHIPV4UDP_HLEN);
+
+   bwfilter_table(bp, pwol_pattern, B44_PATTERN_SIZE, B44_PATTERN_BASE);
+   bwfilter_table(bp, pwol_mask, B44_PMASK_SIZE, B44_PMASK_BASE);
+
+   /* Raw ethernet II magic packet pattern - pattern 1 */
+   memset(pwol_pattern, 0, B44_PATTERN_SIZE);
+   memset(pwol_mask, 0, B44_PMASK_SIZE);
+   plen1 = b44_magic_pattern(bp->dev->dev_addr, pwol_pattern, pwol_mask,
+ ETH_HLEN);
+
+   bwfilter_table(bp, pwol_pattern, B44_PATTERN_SIZE,
+  B44_PATTERN_BASE + B44_PATTERN_SIZE);
+   bwfilter_table(bp, pwol_mask, B44_PMASK_SIZE,
+  B44_PMASK_BASE + B44_PMASK_SIZE);
+
+   /* Ipv6 magic packet pattern - pattern 2 */
+   memset(pwol_pattern, 0, B44_PATTERN_SIZE);
+   memset(pwol_mask, 0, B44_PMASK_SIZE);
+   plen2 = b44_magic_pattern(bp->dev->dev_addr, pwol_pattern, pwol_mask,
+ B44_ETHIPV6UDP_HLEN);
+
+   bwfilter_table(bp, pwol_pattern, B44_PATTERN_SIZE,
+  B44_PATTERN_BASE + B44_PATTERN_SIZE + B44_PATTERN_SIZE);
+   bwfilter_table(bp, pwol_mask, B44_PMASK_SIZE,
+  B44_PMASK_BASE + B44_PMASK_SIZE + B44_PMASK_SIZE);
+
+   kfree(pwol_pattern);
+
+   /* set these pattern's lengths: one less than each real length */
+   val = plen0 | (plen1 << 8) | (plen2 << 16) | WKUP_LEN_ENABLE_THREE;
+   bw32(bp, B44_WKUP_LEN, val);
+
+   /* enable wakeup pattern matching */
+   val = br32(bp, B44_DEVCTRL);
+   bw32(bp, B44_DEVCTRL, val | DEVCTRL_PFE);
+
+}
 
 static void b44_setup_wol(struct b44 *bp)
 {
@@ -1482,7 +1588,9 @@ static void b44_setup_wol(struct b44 *bp
val = br32(bp, B44_DEVCTRL);
bw32(bp, B44_DEVCTRL, val | DEVCTRL_MPM | DEVCTRL_PFE);
 
-   }
+   } else {
+   b44_setup_pseudo_magicp(bp);
+   }
 
val = br32(bp, B44_SBTMSLOW);
bw32(bp, B44_SBTMSLOW,

[PATCH 3/6 resend] b44: add parameter

2006-06-20 Thread Gary Zambrano
This patch adds a parameter to init_hw() to not completely initialize
the nic for wol. 

Signed-off-by: Gary Zambrano <[EMAIL PROTECTED]>

diff --git a/drivers/net/b44.c b/drivers/net/b44.c
index 73ca729..12fc67a 100644
--- a/drivers/net/b44.c
+++ b/drivers/net/b44.c
@@ -101,7 +101,7 @@ MODULE_DEVICE_TABLE(pci, b44_pci_tbl);
 
 static void b44_halt(struct b44 *);
 static void b44_init_rings(struct b44 *);
-static void b44_init_hw(struct b44 *);
+static void b44_init_hw(struct b44 *, int);
 
 static int dma_desc_align_mask;
 static int dma_desc_sync_size;
@@ -873,7 +873,7 @@ static int b44_poll(struct net_device *n
spin_lock_irq(&bp->lock);
b44_halt(bp);
b44_init_rings(bp);
-   b44_init_hw(bp);
+   b44_init_hw(bp, 1);
netif_wake_queue(bp->dev);
spin_unlock_irq(&bp->lock);
done = 1;
@@ -942,7 +942,7 @@ static void b44_tx_timeout(struct net_de
 
b44_halt(bp);
b44_init_rings(bp);
-   b44_init_hw(bp);
+   b44_init_hw(bp, 1);
 
spin_unlock_irq(&bp->lock);
 
@@ -1059,7 +1059,7 @@ static int b44_change_mtu(struct net_dev
b44_halt(bp);
dev->mtu = new_mtu;
b44_init_rings(bp);
-   b44_init_hw(bp);
+   b44_init_hw(bp, 1);
spin_unlock_irq(&bp->lock);
 
b44_enable_ints(bp);
@@ -1356,13 +1356,15 @@ static int b44_set_mac_addr(struct net_d
  * packet processing.  Invoked with bp->lock held.
  */
 static void __b44_set_rx_mode(struct net_device *);
-static void b44_init_hw(struct b44 *bp)
+static void b44_init_hw(struct b44 *bp, int full_reset)
 {
u32 val;
 
b44_chip_reset(bp);
-   b44_phy_reset(bp);
-   b44_setup_phy(bp);
+   if (full_reset) {
+   b44_phy_reset(bp);
+   b44_setup_phy(bp);
+   }
 
/* Enable CRC32, set proper LED modes and power on PHY */
bw32(bp, B44_MAC_CTRL, MAC_CTRL_CRC32_ENAB | MAC_CTRL_PHY_LEDCTRL);
@@ -1376,16 +1378,21 @@ static void b44_init_hw(struct b44 *bp)
bw32(bp, B44_TXMAXLEN, bp->dev->mtu + ETH_HLEN + 8 + RX_HEADER_LEN);
 
bw32(bp, B44_TX_WMARK, 56); /* XXX magic */
-   bw32(bp, B44_DMATX_CTRL, DMATX_CTRL_ENABLE);
-   bw32(bp, B44_DMATX_ADDR, bp->tx_ring_dma + bp->dma_offset);
-   bw32(bp, B44_DMARX_CTRL, (DMARX_CTRL_ENABLE |
- (bp->rx_offset << DMARX_CTRL_ROSHIFT)));
-   bw32(bp, B44_DMARX_ADDR, bp->rx_ring_dma + bp->dma_offset);
+   if (full_reset) {
+   bw32(bp, B44_DMATX_CTRL, DMATX_CTRL_ENABLE);
+   bw32(bp, B44_DMATX_ADDR, bp->tx_ring_dma + bp->dma_offset);
+   bw32(bp, B44_DMARX_CTRL, (DMARX_CTRL_ENABLE |
+ (bp->rx_offset << DMARX_CTRL_ROSHIFT)));
+   bw32(bp, B44_DMARX_ADDR, bp->rx_ring_dma + bp->dma_offset);
 
-   bw32(bp, B44_DMARX_PTR, bp->rx_pending);
-   bp->rx_prod = bp->rx_pending;
+   bw32(bp, B44_DMARX_PTR, bp->rx_pending);
+   bp->rx_prod = bp->rx_pending;
 
-   bw32(bp, B44_MIB_CTRL, MIB_CTRL_CLR_ON_READ);
+   bw32(bp, B44_MIB_CTRL, MIB_CTRL_CLR_ON_READ);
+   } else {
+   bw32(bp, B44_DMARX_CTRL, (DMARX_CTRL_ENABLE |
+ (bp->rx_offset << DMARX_CTRL_ROSHIFT)));
+   }
 
val = br32(bp, B44_ENET_CTRL);
bw32(bp, B44_ENET_CTRL, (val | ENET_CTRL_ENABLE));
@@ -1401,7 +1408,7 @@ static int b44_open(struct net_device *d
goto out;
 
b44_init_rings(bp);
-   b44_init_hw(bp);
+   b44_init_hw(bp, 1);
 
b44_check_phy(bp);
 
@@ -1511,7 +1518,7 @@ static int b44_close(struct net_device *
netif_poll_enable(dev);
 
if (bp->flags & B44_FLAG_WOL_ENABLE) {
-   b44_init_hw(bp);
+   b44_init_hw(bp, 0);
b44_setup_wol(bp);
}
 
@@ -1786,7 +1793,7 @@ static int b44_set_ringparam(struct net_
 
b44_halt(bp);
b44_init_rings(bp);
-   b44_init_hw(bp);
+   b44_init_hw(bp, 1);
netif_wake_queue(bp->dev);
spin_unlock_irq(&bp->lock);
 
@@ -1829,7 +1836,7 @@ static int b44_set_pauseparam(struct net
if (bp->flags & B44_FLAG_PAUSE_AUTO) {
b44_halt(bp);
b44_init_rings(bp);
-   b44_init_hw(bp);
+   b44_init_hw(bp, 1);
} else {
__b44_set_flow_ctrl(bp, bp->flags);
}
@@ -2188,7 +2195,7 @@ static int b44_suspend(struct pci_dev *p
 
free_irq(dev->irq, dev);
if (bp->flags & B44_FLAG_WOL_ENABLE) {
-   b44_init_hw(bp);
+   b44_init_hw(bp, 0);
b44_setup_wol(bp);
}
pci_disable_device(pdev);
@@ -2213,7 +2220,7 @@ static int b44_resume(struct pci_dev *pd
spin_lock_irq(&bp->lock);
 
b44_init_rings(bp);
-   b44_init_hw(bp);
+   b44_init_hw(bp, 1);
netif_device_atta

[PATCH 6/6 resend] b44: update b44 Kconfig entry

2006-06-20 Thread Gary Zambrano
Deleted "EXPERIMENTAL" from b44 entry in Kconfig.

Signed-off-by: Gary Zambrano <[EMAIL PROTECTED]>

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index bdaaad8..4e57785 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -1359,8 +1359,8 @@ config APRICOT
  called apricot.
 
 config B44
-   tristate "Broadcom 4400 ethernet support (EXPERIMENTAL)"
-   depends on NET_PCI && PCI && EXPERIMENTAL
+   tristate "Broadcom 4400 ethernet support"
+   depends on NET_PCI && PCI
select MII
help
  If you have a network (Ethernet) controller of this type, say Y and



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6 resend] b44: fix manual speed/duplex/autoneg settings

2006-06-20 Thread Gary Zambrano
Fixes for speed/duplex/autoneg settings and driver settings info.
This is a redo of a previous patch thanks to feedback from Jeff Garzik.

Signed-off-by: Gary Zambrano <[EMAIL PROTECTED]>

diff --git a/drivers/net/b44.c b/drivers/net/b44.c
index d8233e0..41b1618 100644
--- a/drivers/net/b44.c
+++ b/drivers/net/b44.c
@@ -1620,8 +1620,6 @@ static int b44_get_settings(struct net_d
 {
struct b44 *bp = netdev_priv(dev);
 
-   if (!netif_running(dev))
-   return -EAGAIN;
cmd->supported = (SUPPORTED_Autoneg);
cmd->supported |= (SUPPORTED_100baseT_Half |
  SUPPORTED_100baseT_Full |
@@ -1649,6 +1647,12 @@ static int b44_get_settings(struct net_d
XCVR_INTERNAL : XCVR_EXTERNAL;
cmd->autoneg = (bp->flags & B44_FLAG_FORCE_LINK) ?
AUTONEG_DISABLE : AUTONEG_ENABLE;
+   if (cmd->autoneg == AUTONEG_ENABLE)
+   cmd->advertising |= ADVERTISED_Autoneg;
+   if (!netif_running(dev)){
+   cmd->speed = 0;
+   cmd->duplex = 0xff;
+   }
cmd->maxtxpkt = 0;
cmd->maxrxpkt = 0;
return 0;
@@ -1658,9 +1662,6 @@ static int b44_set_settings(struct net_d
 {
struct b44 *bp = netdev_priv(dev);
 
-   if (!netif_running(dev))
-   return -EAGAIN;
-
/* We do not support gigabit. */
if (cmd->autoneg == AUTONEG_ENABLE) {
if (cmd->advertising &
@@ -1677,28 +1678,39 @@ static int b44_set_settings(struct net_d
spin_lock_irq(&bp->lock);
 
if (cmd->autoneg == AUTONEG_ENABLE) {
-   bp->flags &= ~B44_FLAG_FORCE_LINK;
-   bp->flags &= ~(B44_FLAG_ADV_10HALF |
+   bp->flags &= ~(B44_FLAG_FORCE_LINK |
+  B44_FLAG_100_BASE_T |
+  B44_FLAG_FULL_DUPLEX |
+  B44_FLAG_ADV_10HALF |
   B44_FLAG_ADV_10FULL |
   B44_FLAG_ADV_100HALF |
   B44_FLAG_ADV_100FULL);
-   if (cmd->advertising & ADVERTISE_10HALF)
-   bp->flags |= B44_FLAG_ADV_10HALF;
-   if (cmd->advertising & ADVERTISE_10FULL)
-   bp->flags |= B44_FLAG_ADV_10FULL;
-   if (cmd->advertising & ADVERTISE_100HALF)
-   bp->flags |= B44_FLAG_ADV_100HALF;
-   if (cmd->advertising & ADVERTISE_100FULL)
-   bp->flags |= B44_FLAG_ADV_100FULL;
+   if (cmd->advertising == 0) {
+   bp->flags |= (B44_FLAG_ADV_10HALF |
+ B44_FLAG_ADV_10FULL |
+ B44_FLAG_ADV_100HALF |
+ B44_FLAG_ADV_100FULL);
+   } else {
+   if (cmd->advertising & ADVERTISED_10baseT_Half)
+   bp->flags |= B44_FLAG_ADV_10HALF;
+   if (cmd->advertising & ADVERTISED_10baseT_Full)
+   bp->flags |= B44_FLAG_ADV_10FULL;
+   if (cmd->advertising & ADVERTISED_100baseT_Half)
+   bp->flags |= B44_FLAG_ADV_100HALF;
+   if (cmd->advertising & ADVERTISED_100baseT_Full)
+   bp->flags |= B44_FLAG_ADV_100FULL;
+   }
} else {
bp->flags |= B44_FLAG_FORCE_LINK;
+   bp->flags &= ~(B44_FLAG_100_BASE_T | B44_FLAG_FULL_DUPLEX);
if (cmd->speed == SPEED_100)
bp->flags |= B44_FLAG_100_BASE_T;
if (cmd->duplex == DUPLEX_FULL)
bp->flags |= B44_FLAG_FULL_DUPLEX;
}
 
-   b44_setup_phy(bp);
+   if (netif_running(dev))
+   b44_setup_phy(bp);
 
spin_unlock_irq(&bp->lock);
 


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] b44: fix manual speed/duplex/autoneg settings

2006-06-20 Thread Gary Zambrano
On Tue, 2006-06-20 at 04:42 -0400, Jeff Garzik wrote:

> ACK patches 1-6, but unfortunately failed to apply against latest 
> linux-2.6.git:
> 
> > [EMAIL PROTECTED] netdev-2.6]$ git-applymbox /g/tmp/mbox ~/info/signoff.txt
> > 6 patch(es) to process.
> > 
> > Applying 'b44: fix manual speed/duplex/autoneg settings'
> > 
> > fatal: corrupt patch at line 8

Sorry about that. 
They patch ok when using patch, however, the apply failure appears to be
related to me using git-stripspace on the patches before submitting
them.

I am resending patches that have not been git-stripspaced, so you should
not have the apply problem with the resend patches.

Thanks.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6 resend] b44: add wol

2006-06-20 Thread Gary Zambrano
Adds wol to the driver.
This is a redo of a previous patch thanks to feedback from Francois Romieu.

Signed-off-by Gary Zambrano <[EMAIL PROTECTED]>

diff --git a/drivers/net/b44.c b/drivers/net/b44.c
index 41b1618..81f434e 100644
--- a/drivers/net/b44.c
+++ b/drivers/net/b44.c
@@ -1450,6 +1450,41 @@ static void b44_poll_controller(struct n
 }
 #endif
 
+
+static void b44_setup_wol(struct b44 *bp)
+{
+   u32 val;
+   u16 pmval;
+
+   bw32(bp, B44_RXCONFIG, RXCONFIG_ALLMULTI);
+
+   if (bp->flags & B44_FLAG_B0_ANDLATER) {
+
+   bw32(bp, B44_WKUP_LEN, WKUP_LEN_DISABLE);
+
+   val = bp->dev->dev_addr[2] << 24 |
+   bp->dev->dev_addr[3] << 16 |
+   bp->dev->dev_addr[4] << 8 |
+   bp->dev->dev_addr[5];
+   bw32(bp, B44_ADDR_LO, val);
+
+   val = bp->dev->dev_addr[0] << 8 |
+   bp->dev->dev_addr[1];
+   bw32(bp, B44_ADDR_HI, val);
+
+   val = br32(bp, B44_DEVCTRL);
+   bw32(bp, B44_DEVCTRL, val | DEVCTRL_MPM | DEVCTRL_PFE);
+
+   }
+
+   val = br32(bp, B44_SBTMSLOW);
+   bw32(bp, B44_SBTMSLOW, val | SBTMSLOW_PE);
+
+   pci_read_config_word(bp->pdev, SSB_PMCSR, &pmval);
+   pci_write_config_word(bp->pdev, SSB_PMCSR, pmval | SSB_PE);
+
+}
+
 static int b44_close(struct net_device *dev)
 {
struct b44 *bp = netdev_priv(dev);
@@ -1475,6 +1510,11 @@ static int b44_close(struct net_device *
 
netif_poll_enable(dev);
 
+   if (bp->flags & B44_FLAG_WOL_ENABLE) {
+   b44_init_hw(bp);
+   b44_setup_wol(bp);
+   }
+
b44_free_consistent(bp);
 
return 0;
@@ -1831,12 +1871,40 @@ static void b44_get_ethtool_stats(struct
spin_unlock_irq(&bp->lock);
 }
 
+static void b44_get_wol(struct net_device *dev, struct ethtool_wolinfo *wol)
+{
+   struct b44 *bp = netdev_priv(dev);
+
+   wol->supported = WAKE_MAGIC;
+   if (bp->flags & B44_FLAG_WOL_ENABLE)
+   wol->wolopts = WAKE_MAGIC;
+   else
+   wol->wolopts = 0;
+   memset(&wol->sopass, 0, sizeof(wol->sopass));
+}
+
+static int b44_set_wol(struct net_device *dev, struct ethtool_wolinfo *wol)
+{
+   struct b44 *bp = netdev_priv(dev);
+
+   spin_lock_irq(&bp->lock);
+   if (wol->wolopts & WAKE_MAGIC)
+   bp->flags |= B44_FLAG_WOL_ENABLE;
+   else
+   bp->flags &= ~B44_FLAG_WOL_ENABLE;
+   spin_unlock_irq(&bp->lock);
+
+   return 0;
+}
+
 static struct ethtool_ops b44_ethtool_ops = {
.get_drvinfo= b44_get_drvinfo,
.get_settings   = b44_get_settings,
.set_settings   = b44_set_settings,
.nway_reset = b44_nway_reset,
.get_link   = ethtool_op_get_link,
+   .get_wol= b44_get_wol,
+   .set_wol= b44_set_wol,
.get_ringparam  = b44_get_ringparam,
.set_ringparam  = b44_set_ringparam,
.get_pauseparam = b44_get_pauseparam,
@@ -1915,6 +1983,10 @@ static int __devinit b44_get_invariants(
/* XXX - really required?
   bp->flags |= B44_FLAG_BUGGY_TXPTR;
  */
+
+   if (ssb_get_core_rev(bp) >= 7)
+   bp->flags |= B44_FLAG_B0_ANDLATER;
+
 out:
return err;
 }
@@ -2115,6 +2187,10 @@ static int b44_suspend(struct pci_dev *p
spin_unlock_irq(&bp->lock);
 
free_irq(dev->irq, dev);
+   if (bp->flags & B44_FLAG_WOL_ENABLE) {
+   b44_init_hw(bp);
+   b44_setup_wol(bp);
+   }
pci_disable_device(pdev);
return 0;
 }
diff --git a/drivers/net/b44.h b/drivers/net/b44.h
index b178662..1f4 100644
--- a/drivers/net/b44.h
+++ b/drivers/net/b44.h
@@ -264,6 +264,8 @@
 #define  SBIDHIGH_VC_SHIFT 16
 
 /* SSB PCI config space registers.  */
+#define SSB_PMCSR  0x44
+#define  SSB_PE0x100
 #defineSSB_BAR0_WIN0x80
 #defineSSB_BAR1_WIN0x84
 #defineSSB_SPROM_CONTROL   0x88
@@ -420,6 +422,7 @@ struct b44 {
 
u32 dma_offset;
u32 flags;
+#define B44_FLAG_B0_ANDLATER   0x0001
 #define B44_FLAG_BUGGY_TXPTR   0x0002
 #define B44_FLAG_REORDER_BUG   0x0004
 #define B44_FLAG_PAUSE_AUTO0x8000
@@ -435,6 +438,7 @@ struct b44 {
 #define B44_FLAG_INTERNAL_PHY  0x1000
 #define B44_FLAG_RX_RING_HACK  0x2000
 #define B44_FLAG_TX_RING_HACK  0x4000
+#define B44_FLAG_WOL_ENABLE0x8000
 
u32 rx_offset;
 


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [patch 0/6] [Network namespace] introduction

2006-06-20 Thread Daniel Lezcano

Al Viro wrote:

On Tue, Jun 20, 2006 at 11:21:43PM +0200, Daniel Lezcano wrote:


Al Viro wrote:


On Fri, Jun 09, 2006 at 11:02:02PM +0200, [EMAIL PROTECTED] wrote:
- renaming an interface in one "namespace" affects everyone.


Exact. If we ensure the interface can't be renamed if used in different 
namespace, is it really a problem ?



You _still_ have a single namespace; look in /sys/class/net and you'll see.


Yes, that's right. The network devices namespaces are not yet 
implemented. There are potentially some conflicts with /proc and sysfs 
but we will address them in a future.


BTW, do you have some ideas on how handle these conflicts ?


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LARTC] Re: [PATCH 0/2] Runtime configuration of HTB's HYSTERESIS option

2006-06-20 Thread Russell Stuart
On Thu, 2006-06-15 at 11:49 +0200, Martin Devera wrote:
> At time of HTB implementation I needed to reach 100MBit speed on 
> relatively slow box. The hysteresis was a way. On other side I used 
> hand-made TSC based measure tool to compute exact (15%) performance 
> gain. Today I'd measure it using oprofile.
> 
> When rethinking it again I'd suggest to re-measure real performance 
> impact for both flat and deep class hierarchy and consider switching the 
> hysteresis off by default (or even to remove the code if the gain is 
> negligible). If it is the case then it is the cleanest solution IMHO.

I attended LCA 2006 this year.  There was a presentation by
a group in New Zealand using Debian running on a embedded
box to bring the Internet to rural communities.  Some of
these communities didn't have power or telephone, so the
setup ran over 802.11 over distances of up to 23Km using
solar cells for power.  I don't recall exactly, but I think
the embedded box was using a 486 equivalent.  I think they
had around 40 of these things up and going.

The point of the story is there are people out there who
use Linux on small processors, and often do imaginative 
things with them.  We would be doing them a disservice by 
ripping out the code.

> On other side I see no problem with attached patches. Have you tested 
> patched kernel with old "tc" tool ?

Yes.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Locking validator output on DCCP

2006-06-20 Thread Ian McDonald

Folks,

I am getting this when I am using DCCP with 2.6.17-rc6-mm2 with Ingo's
lock dependency patch:

Jun 21 09:38:58 localhost kernel: [  102.068588]
Jun 21 09:38:58 localhost kernel: [  102.068592]
=
Jun 21 09:38:58 localhost kernel: [  102.068602] [ INFO: possible
recursive locking detected ]
Jun 21 09:38:58 localhost kernel: [  102.068608]
-
Jun 21 09:38:58 localhost kernel: [  102.068615] idle/0 is trying to
acquire lock:
Jun 21 09:38:58 localhost kernel: [  102.068620]
(&sk->sk_lock.slock#3){-+..}, at: [] sk_clone+0x5a/0x190
Jun 21 09:38:58 localhost kernel: [  102.068644]
Jun 21 09:38:58 localhost kernel: [  102.068646] but task is already
holding lock:
Jun 21 09:38:58 localhost kernel: [  102.068651]
(&sk->sk_lock.slock#3){-+..}, at: []
sk_receive_skb+0xe6/0xfe
Jun 21 09:38:58 localhost kernel: [  102.068668]
Jun 21 09:38:58 localhost kernel: [  102.068670] other info that might
help us debug this:
Jun 21 09:38:58 localhost kernel: [  102.068676] 2 locks held by idle/0:
Jun 21 09:38:58 localhost kernel: [  102.068679]  #0:
(&tp->rx_lock){-+..}, at: [] rtl8139_poll+0x42/0x41c
[8139too]
Jun 21 09:38:58 localhost kernel: [  102.068722]  #1:
(&sk->sk_lock.slock#3){-+..}, at: []
sk_receive_skb+0xe6/0xfe
Jun 21 09:38:58 localhost kernel: [  102.068739]
Jun 21 09:38:58 localhost kernel: [  102.068741] stack backtrace:
Jun 21 09:38:58 localhost kernel: [  102.069053]  []
show_trace_log_lvl+0x53/0xff
Jun 21 09:38:58 localhost kernel: [  102.069091]  []
show_trace+0x16/0x19
Jun 21 09:38:58 localhost kernel: [  102.069121]  []
dump_stack+0x1a/0x1f
Jun 21 09:38:58 localhost kernel: [  102.069151]  []
__lock_acquire+0x8e6/0x902
Jun 21 09:38:58 localhost kernel: [  102.069363]  []
lock_acquire+0x4e/0x66
Jun 21 09:38:58 localhost kernel: [  102.069562]  []
_spin_lock+0x24/0x32
Jun 21 09:38:58 localhost kernel: [  102.069777]  []
sk_clone+0x5a/0x190
Jun 21 09:38:58 localhost kernel: [  102.071244]  []
inet_csk_clone+0xf/0x67
Jun 21 09:38:58 localhost kernel: [  102.072932]  []
dccp_create_openreq_child+0x17/0x2fe [dccp]
Jun 21 09:38:58 localhost kernel: [  102.072993]  []
dccp_v4_request_recv_sock+0x47/0x260 [dccp_ipv4]
Jun 21 09:38:58 localhost kernel: [  102.073020]  []
dccp_check_req+0x128/0x264 [dccp]
Jun 21 09:38:58 localhost kernel: [  102.073049]  []
dccp_v4_do_rcv+0x74/0x290 [dccp_ipv4]
Jun 21 09:38:58 localhost kernel: [  102.073067]  []
sk_receive_skb+0x6b/0xfe
Jun 21 09:38:58 localhost kernel: [  102.074607]  []
dccp_v4_rcv+0x4ea/0x66e [dccp_ipv4]
Jun 21 09:38:58 localhost kernel: [  102.074651]  []
ip_local_deliver+0x159/0x1f1
Jun 21 09:38:58 localhost kernel: [  102.076322]  []
ip_rcv+0x3e9/0x416
Jun 21 09:38:58 localhost kernel: [  102.077995]  []
netif_receive_skb+0x287/0x317
Jun 21 09:38:58 localhost kernel: [  102.079562]  []
rtl8139_poll+0x294/0x41c [8139too]
Jun 21 09:38:58 localhost kernel: [  102.079610]  []
net_rx_action+0x8b/0x17c
Jun 21 09:38:58 localhost kernel: [  102.081181]  []
__do_softirq+0x54/0xb3
Jun 21 09:38:58 localhost kernel: [  102.081357]  []
do_softirq+0x2f/0x47
Jun 21 09:38:58 localhost kernel: [  102.081482]  []
irq_exit+0x39/0x46
Jun 21 09:38:58 localhost kernel: [  102.081608]  [] do_IRQ+0x77/0x84
Jun 21 09:38:58 localhost kernel: [  102.081644]  []
common_interrupt+0x25/0x2c
Jun 21 09:38:58 localhost kernel: [  154.463644] CCID: Registered CCID 3 (ccid3)

The code of sk_clone (net/core/sock.c) is:

struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
{
struct sock *newsk = sk_alloc(sk->sk_family, priority, sk->sk_prot, 0);

if (newsk != NULL) {
struct sk_filter *filter;

memcpy(newsk, sk, sk->sk_prot->obj_size);

/* SANITY */
sk_node_init(&newsk->sk_node);
sock_lock_init(newsk);

The relevant code is the sock_lock_init

The code of sk_receive_skb (net/core/sock.c) is:

int sk_receive_skb(struct sock *sk, struct sk_buff *skb)
{
int rc = NET_RX_SUCCESS;

if (sk_filter(sk, skb, 0))
goto discard_and_relse;

skb->dev = NULL;

bh_lock_sock(sk);

The relevant code is the bh_lock_sock.

As I read this it is not a recursive lock as sk_clone is occurring
second and is actually creating a new socket so they are trying to
lock on different sockets.

Can someone tell me whether I am correct in my thinking or not? If I
am then I will work out how to tell the lock validator not to worry
about it.

Thanks,

Ian
--
Ian McDonald
Web: http://wand.net.nz/~iam4
Blog: http://imcdnzl.blogspot.com
WAND Network Research Group
Department of Computer Science
University of Waikato
New Zealand
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


DF, IP ID always 0 and the reassembly protections

2006-06-20 Thread Rick Jones
A while back (I cannot recall exactly when) the issue of always setting 
the IP datagram ID to zero when the DF bit was set was brought-up.  I 
suggested it might not be a good idea because there are admittedly 
broken devices out there that "helpfully" and silently clear DF and the 
start to fragment.  The counter point was that coding around such broken 
hardware was silly.


I was just writing a missive to one of my co-workers on IP 
fragmentation.  It got me to thinking about the stuff (I think it 
went-in?) to try to protect against "Frankengrams" during IP fragment 
reassembly.


Doesn't that mechanism rely on watching the IP ID's between the pair of 
IPs?  For both fragmented and non-fragmented datagrams? If so, does 
always setting the IP ID to zero when DF is set affect that mechanism?


rick jones
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2/5] [NET]: Add generic segmentation offload

2006-06-20 Thread Herbert Xu
On Tue, Jun 20, 2006 at 10:54:48AM -0700, Michael Chan wrote:
> 
> I think you need !illegal_highdma(skb->dev, skb)

Thanks for catching this.  You can tell that I don't have HIGHMEM :)
Here is the fixed version:

[NET]: Add generic segmentation offload

This patch adds the infrastructure for generic segmentation offload.
The idea is to tap into the potential savings of TSO without hardware
support by postponing the allocation of segmented skb's until just
before the entry point into the NIC driver.

The same structure can be used to support software IPv6 TSO, as well as
UFO and segmentation offload for other relevant protocols, e.g., DCCP.

Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -406,6 +406,9 @@ struct net_device
struct list_headqdisc_list;
unsigned long   tx_queue_len;   /* Max frames per queue allowed 
*/
 
+   /* Partially transmitted GSO packet. */
+   struct sk_buff  *gso_skb;
+
/* ingress path synchronizer */
spinlock_t  ingress_lock;
struct Qdisc*qdisc_ingress;
@@ -540,6 +543,7 @@ struct packet_type {
 struct net_device *,
 struct packet_type *,
 struct net_device *);
+   struct sk_buff  *(*gso_segment)(struct sk_buff *skb, int sg);
void*af_packet_priv;
struct list_headlist;
 };
@@ -690,7 +694,8 @@ extern int  dev_change_name(struct net_d
 extern int dev_set_mtu(struct net_device *, int);
 extern int dev_set_mac_address(struct net_device *,
struct sockaddr *);
-extern voiddev_queue_xmit_nit(struct sk_buff *skb, struct 
net_device *dev);
+extern int dev_hard_start_xmit(struct sk_buff *skb,
+   struct net_device *dev);
 
 extern voiddev_init(void);
 
@@ -964,6 +969,7 @@ extern int  netdev_max_backlog;
 extern int weight_p;
 extern int netdev_set_master(struct net_device *dev, struct 
net_device *master);
 extern int skb_checksum_help(struct sk_buff *skb, int inward);
+extern struct sk_buff *skb_gso_segment(struct sk_buff *skb, int sg);
 #ifdef CONFIG_BUG
 extern void netdev_rx_csum_fault(struct net_device *dev);
 #else
diff --git a/net/core/dev.c b/net/core/dev.c
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -116,6 +116,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * The list of packet types we will receive (as opposed to discard)
@@ -1048,7 +1049,7 @@ static inline void net_timestamp(struct 
  * taps currently in use.
  */
 
-void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
+static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
 {
struct packet_type *ptype;
 
@@ -1186,6 +1187,40 @@ out: 
return ret;
 }
 
+/**
+ * skb_gso_segment - Perform segmentation on skb.
+ * @skb: buffer to segment
+ * @sg: whether scatter-gather is supported on the target.
+ *
+ * This function segments the given skb and returns a list of segments.
+ */
+struct sk_buff *skb_gso_segment(struct sk_buff *skb, int sg)
+{
+   struct sk_buff *segs = ERR_PTR(-EPROTONOSUPPORT);
+   struct packet_type *ptype;
+   int type = skb->protocol;
+
+   BUG_ON(skb_shinfo(skb)->frag_list);
+   BUG_ON(skb->ip_summed != CHECKSUM_HW);
+
+   skb->mac.raw = skb->data;
+   skb->mac_len = skb->nh.raw - skb->data;
+   __skb_pull(skb, skb->mac_len);
+
+   rcu_read_lock();
+   list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & 15], list) {
+   if (ptype->type == type && !ptype->dev && ptype->gso_segment) {
+   segs = ptype->gso_segment(skb, sg);
+   break;
+   }
+   }
+   rcu_read_unlock();
+
+   return segs;
+}
+
+EXPORT_SYMBOL(skb_gso_segment);
+
 /* Take action when hardware reception checksum errors are detected. */
 #ifdef CONFIG_BUG
 void netdev_rx_csum_fault(struct net_device *dev)
@@ -1222,6 +1257,86 @@ static inline int illegal_highdma(struct
 #define illegal_highdma(dev, skb)  (0)
 #endif
 
+struct dev_gso_cb {
+   void (*destructor)(struct sk_buff *skb);
+};
+
+#define DEV_GSO_CB(skb) ((struct dev_gso_cb *)(skb)->cb)
+
+static void dev_gso_skb_destructor(struct sk_buff *skb)
+{
+   struct dev_gso_cb *cb;
+
+   do {
+   struct sk_buff *nskb = skb->next;
+
+   skb->next = nskb->next;
+   

[NET]: Make illegal_highdma more anal

2006-06-20 Thread Herbert Xu
Hi:

This patch should prevent mistakes like the one I made earlier.

[NET]: Make illegal_highdma more anal

Rather than having illegal_highdma as a macro when HIGHMEM is off, we
can turn it into an inline function that returns zero.  This will catch
callers that give it bad arguments.

Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/net/core/dev.c b/net/core/dev.c
index 9c68ab8..134d9d1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1234,7 +1234,6 @@ void netdev_rx_csum_fault(struct net_dev
 EXPORT_SYMBOL(netdev_rx_csum_fault);
 #endif
 
-#ifdef CONFIG_HIGHMEM
 /* Actually, we should eliminate this check as soon as we know, that:
  * 1. IOMMU is present and allows to map all the memory.
  * 2. No high memory really exists on this machine.
@@ -1242,6 +1241,7 @@ #ifdef CONFIG_HIGHMEM
 
 static inline int illegal_highdma(struct net_device *dev, struct sk_buff *skb)
 {
+#ifdef CONFIG_HIGHMEM
int i;
 
if (dev->features & NETIF_F_HIGHDMA)
@@ -1251,11 +1251,9 @@ static inline int illegal_highdma(struct
if (PageHighMem(skb_shinfo(skb)->frags[i].page))
return 1;
 
+#endif
return 0;
 }
-#else
-#define illegal_highdma(dev, skb)  (0)
-#endif
 
 struct dev_gso_cb {
void (*destructor)(struct sk_buff *skb);
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NET]: Prevent multiple qdisc runs

2006-06-20 Thread Herbert Xu
On Tue, Jun 20, 2006 at 10:42:06AM -0400, jamal wrote:
> 
> I apologize for hand-waving with % numbers above and using gut feeling
> instead of experimental facts - I dont have time to chase it. I have
> CCed Robert who may have time to see if this impacts forwarding
> performance for one. I will have more peace of mind to find out there is
> no impact.

Well my gut feeling is that multiple qdisc_run's on the same dev can't
be good for perfomance.  The reason is that SMP is only good when the
CPUs work on different tasks.  If you get two or more CPUs to work on
qdisc_run at the same time they can still only supply one skb to the
device at any time.  What's worse is that they will now have to fight
over the two spin locks involved which means that their cache lines
will bounce back and forth.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >