date:20070606

Re: [WIP][PATCHES] Network xmit batching

2007-06-06 Thread Krishna Kumar2

My somewhat confusing netperf script (to run on client) is attached below.
Server just requires
to run netserver. Client is not completely accurate since I am not using
netperf4 (moving to
that after some initial hiccups).

thanks,

- KK

(See attached file: netperf.scp)


J Hadi Salim <[EMAIL PROTECTED]> wrote on 06/06/2007 07:19:21 PM:

> Folks,
>
> While Krishna and I have been attempting this on the side, progress has
> been rather slow - so this is to solicit for more participation so we
> can get this over with faster. Success (myself being conservative when
> it comes to performance) requires testing on a wide variety of hardware.
>
> The results look promising - certainly from a pktgen perspective where
> performance has been known in some cases to go up over 50%.
> Tests by Sridhar on a low number of TCP flows also indicate improved
> performance as well as lowered CPU use.
>
> I have setup the current state of my patches against Linus tree at:
> git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git
>
> This is also clean against 2.6.22-rc4. So if you want just a diff that
> will work against 2.6.22-rc4 - i can send it to you.
> I also have a tree against Daves net-2.6 at
> git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-net26.git
> but iam abandoning that effort until we get this stable due to the
> occasional bug that cropped up(like e1000).
>
> I am attaching a pktgen script. There is one experimental parameter
> called "batch_low" - for starters just leave it at 0 in order to reduce
> experimental variance. If you have solid results you can muck around
> with it.
> KK has a netperf script he has been using - if you know netperf your
> help will really be appreciated in testing it on your hardware.
> KK, can you please post your script?
> Testing with forwarding and bridging will also be appreaciated.
> Above that, suggestions to changes as long as they are based on
> verifiable results or glaringly obvious changes are welcome. My
> preference at the moment is to flesh out the patch as is and then
> improve on it later if it shows it has some value on a wide variety of
> apps. As the subject is indicating this is a WIP and as all eulas
> suggest "subject to change without notice".
> If you help out, when you post your results, can you please say what
> hardware and setup was?
>
> The only real driver that has been changed is e1000 for now. KK is
> working on something infiniband related and i plan (if noone beats me)
> to get tg3 working. It would be nice if someone converted some 10G
> ethernet driver.
>
> cheers,
> jamal
> [attachment "pktgen.batch-1-1" deleted by Krishna Kumar2/India/IBM]

netperf.scp
Description: Binary data

Re: extra layer below inet layer.

2007-06-06 Thread Stephen Hemminger

On Thu, 7 Jun 2007 10:10:44 +0530
"Tej Parkash" <[EMAIL PROTECTED]> wrote:

> hi
> 
> i just want to have something like tcp layer sitting below inet layer.
> so that i can directly offload everything to NIC
> and i want it to be placed dynamically. so depending on the nic i can
> offload the feature or can make it normal flow.
> i.e. both layer should exist and normal functionality should not break.
>
NIC protocol offloading is a contentious subject since it bypasses
all the security and restricts usefulness of the device for VLAN's,
bonding, bridging, etc.


-- 
Stephen Hemminger <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

extra layer below inet layer.

2007-06-06 Thread Tej Parkash


hi

i just want to have something like tcp layer sitting below inet layer.
so that i can directly offload everything to NIC
and i want it to be placed dynamically. so depending on the nic i can
offload the feature or can make it normal flow.
i.e. both layer should exist and normal functionality should not break.

this code is in af_inet.c

static struct inet_protosw inetsw_array[] =
{
  {
  .type =   SOCK_STREAM,
  .protocol =   IPPROTO_TCP,
  .prot =   &tcp_prot,
  .ops =&inet_stream_ops,
  .capability = -1,
  .no_check =   0,
  .flags =  INET_PROTOSW_PERMANENT |
INET_PROTOSW_ICSK,
  },

  {
  .type =   SOCK_DGRAM,
  .protocol =   IPPROTO_UDP,
  .prot =   &udp_prot,
  .ops =&inet_dgram_ops,
  .capability = -1,
  .no_check =   UDP_CSUM_DEFAULT,
  .flags =  INET_PROTOSW_PERMANENT,
 },


 {
 .type =   SOCK_RAW,
 .protocol =   IPPROTO_IP,/* wild card */
 .prot =   &raw_prot,
 .ops =&inet_sockraw_ops,
 .capability = CAP_NET_RAW,
 .no_check =   UDP_CSUM_DEFAULT,
 .flags =  INET_PROTOSW_REUSE,
 }
};

i was just going through the code i found that if i can regsiter above
kind of registration dynamically i will be able to do that.
i have tried to do that but since normal tcp registered this at boot
up time i am not able to unregister this dynamically

so is it possible to unregister tcp and register my protocol dynamically.

thanks
TEJ
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Failover-friendly TCP retransmission

2007-06-06 Thread noboru . obata . ar

Hi Andi,

I thank you for your comments.

Andi Kleen <[EMAIL PROTECTED]> writes:
> > Your suggestion, to utilize NET_XMIT_* code returned from an
> > underlying layer, is done in tcp_transmit_skb.
> > 
> > But my problem is that tcp_transmit_skb is not called during a
> > certain period of time.  So I'm suggesting to cap RTO value so
> > that tcp_transmit_skb gets called more frequently.
> 
> The transmit code controls the transmission timeout. Or at least
> it could change it if it really wanted.
> 
> What I wanted to say is: if the loss still happens under control
> of the sending end device and TCP knows this then it could change
> the retransmit timer to fire earlier or even just wait for an 
> event from the device that tells it to retransmit early.

I examined your suggestion to introduce some interface so that
TCP can know, or be notified, to retransmit early.

Then there are two options; pull from TCP or notify to TCP.

The first option, pulling from TCP, must be done at the
expiration of retransmission timer because there is no other
context to do this.  But if RTO is already large, this can
easily miss events or status of underlying layer, such as
symptom of failure, failover, etc.  So I give up pulling from
TCP.

The second option, notifying to TCP, seems a bit promising.
Upon such a notification, TCP may look into a timer structure to
find retransmission events, update their timers so that they
expire earlier, and possibly reset their RTO values.  Perhaps
this should be done for all TCP packets because TCP doesn't know
which packet will be sent to the device of interest at that time.

But I don't quite see if this better solves my problem.  Such
upcalls would be more complicated than capping RTO, and thus may
be error-prone and harder to maintain.  Problems might be
solvable, but I'd prefer a simpler solution.

> The problem with capping RTO is that when there is a loss
> in the network for some other reasons (and there is no reason
> bonding can't be used when talking to the internet) you
> might be too aggressive or not aggressive enough anymore
> to get the data through.

I think capping RTO is robust and better than upcalls.  The
effects of capping RTO, as follows, are small enough.

* It just makes retransmission more frequent.  Since there is
  already a fast retransmission in TCP, retransmitting earlier
  itself does not break TCP.  (I'm going to examine every
  occurrence of TCP_RTO_MAX, though.)

* In the worst case, it does not increase the total number of
  retransmission packets, which is bounded by tcp_retries2.

  Thus the final retransmission timeout comes earlier with same
  tcp_retries2.  If this is a case, one will have to raise
  tcp_retries2.

* The average case, in a certain period of time (say 60[s]), it
  may slightly increase the number of retransmission packets.

  Starting from RTO = 0.2[s], the numbers of retransmission
  packets in first 60[s] are, 8 with TCP_RTO_MAX = 120[s], and
  15 with TCP_RTO_MAX = 5[s].

  I think that increasing several packets per minute per socket
  are acceptable.

Therefore the side effects of capping RTO, even talking to the
Internet, seems to be small enough.

-- 
OBATA Noboru ([EMAIL PROTECTED])
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: warnings in git-wireless

2007-06-06 Thread Dave Jones

On Wed, Jun 06, 2007 at 06:04:21PM -0700, Andrew Morton wrote:

 > There _should_ be some #ifdeffable thing which is being passed to cpp when
 > we run sparse (but I'm not sure what it is).

#ifdef __CHECKER__

(See include/linux/compiler.h, this is how we implement __user & friends)

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread jamal

On Wed, 2007-06-06 at 16:53 -0700, David Miller wrote:

> There are of course other uses for multiple TX queues, and in
> particular my finer-grained locking example.
>
> I'm still amazed the TX locking issue wasn't clear to you,
> too nervous about tonight's game? :)

It's too depressing - so i came back here for a break ;->
I cant even stand Don Cherry today.

As a side note: You will have to do a lot of surgery to the current code
to make tx run on multi CPUs. It needs some experimenting to get right.
And i am begining to like Herberts changes ;->
I am not against multi-rings; iam just suggesting an alternative
approach which is less disruptive.

In regards to the tx lock - my thinking is resolving that via tx
batching. You amortize the lock over multiple packets. There may be
value in fine grained locking - i need to think about it. A small
extension to the batching patches will provide the change i am
proposing.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: warnings in git-wireless

2007-06-06 Thread Andrew Morton

On Wed, 06 Jun 2007 15:33:46 -0700 James Ketrenos <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > On Wed, 06 Jun 2007 13:51:41 -0700 James Ketrenos <[EMAIL PROTECTED]> wrote:
> > 
>   * make C=2 CF=-Wall will complain if you use ARRAY_SIZE on global data
>   */
>  #define GLOBAL_ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
> >>> This is identical to ARRAY_SIZE.
> >>>
> >>> And if there's some problem with ARRAY_SIZE then fix ARRAY_SIZE!  Don't 
> >>> go 
> >>> off and create some private thing and leave everyone else twisting in the
> >>> wind.
> >> The code was resolving the sparse warnings.  
> > 
> ...
> > Your GLOBAL_ARRAY_SIZE() is, afaict, identical to ARRAY_SIZE().
> 
> >From include/linux/kernel.h
> 
> #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + 
> __must_be_array(arr))

o crap, sorry, I was looking at one of the other definitions of ARRAY_SIZE :(


> >From drivers/net/wireless/mac80211/iwlwifi/iwl-helpers.h
> 
> #define GLOBAL_ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
> 
> The '+ __must_be_array(arr)' part of ARRAY_SIZE was causing sparse to 
> complain 
> with:
> 
> drivers/net/wireless/mac80211/iwlwifi/base.c:4646:22: error: cannot size 
> expression
> ...

OK, that's a problem in sparse, I guess.

There _should_ be some #ifdeffable thing which is being passed to cpp when
we run sparse (but I'm not sure what it is).  If there is such a thing then
we could disable the __must_be_array() thing during sparse runs.

But longer-term, sparse should be taught about __must_be_array().

> When I had run my builds, I had restricted the sparse checks to just iwlwifi 
> (vs. checking the rest of the kernel).
> 
> I just ran it C=2 CF=-Wall against the rest of the kernel and do see other 
> code 
> with the same problem, eg:
> 
> sound/core/memalloc.c:521:14: error: cannot size expression
> ...
> 
> I had erroneously thought it was just a problem with iwlwifi...
> 

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread Jeff Garzik

On Wed, Jun 06, 2007 at 04:52:15PM -0700, David Miller wrote:
> For the locking is makes a ton of sense.
> 
> If you have sendmsg() calls going on N cpus, would you rather
> they:
> 
> 1) All queue up to the single netdev->tx_lock
> 
> or
> 
> 2) All take local per-hw-queue locks
> 
> to transmit the data they are sending?
> 
> I thought this was obvious... guess not :-)

Agreed ++

For my part, I definitely want to see parallel Tx as well as parallel Rx.
It's the only thing that makes sense for modern multi-core CPUs.

Two warnings flags are raised in my brain though:

1) you need (a) well-designed hardware _and_ (b) a smart driver writer
to avoid bottlenecking on internal driver locks.  As you can see we have
both (a) and (b) for tg3 ;-)  But it's up in the air whether a
multi-TX-queue scheme can be sanely locked internally on other hardware.

At the moment we have to hope Intel gets it right in their driver...

2) I fear that the getting-it-into-the-Tx-queue part will take some
thought in order to make this happen, too.  Just like you have the
SMP/SMT/Multi-core scheduler scheduling various resources, surely we
will want some smarts so that sockets are not bouncing wildly across
CPUs, absent other factors outside our control.

Otherwise you will negate a lot of the value of the nifty multi-TX-lock
driver API, by bouncing data across CPUs on each transmit anyway.

IOW, you will have to sanely fill each of the TX queues.

Jeff

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2.6.22-rc4] ehea: Fixed possible kernel panic on VLAN packet recv

2007-06-06 Thread Michael Ellerman

On Wed, 2007-06-06 at 20:53 +0200, Thomas Klein wrote:
> This patch fixes a possible kernel panic due to not checking the vlan group
> when processing received VLAN packets and a malfunction in VLAN/hypervisor
> registration.
> 
> 
> Signed-off-by: Thomas Klein <[EMAIL PROTECTED]>
> ---
> 
> 
> diff -Nurp -X dontdiff linux-2.6.22-rc4/drivers/net/ehea/ehea.h 
> patched_kernel/drivers/net/ehea/ehea.h
> --- linux-2.6.22-rc4/drivers/net/ehea/ehea.h  2007-06-05 02:57:25.0 
> +0200
> +++ patched_kernel/drivers/net/ehea/ehea.h2007-06-06 12:53:58.0 
> +0200
> @@ -39,7 +39,7 @@
>  #include 
>  
>  #define DRV_NAME "ehea"
> -#define DRV_VERSION  "EHEA_0061"
> +#define DRV_VERSION  "EHEA_0064"
>  
>  #define EHEA_MSG_DEFAULT (NETIF_MSG_LINK | NETIF_MSG_TIMER \
>   | NETIF_MSG_RX_ERR | NETIF_MSG_TX_ERR)
> diff -Nurp -X dontdiff linux-2.6.22-rc4/drivers/net/ehea/ehea_main.c 
> patched_kernel/drivers/net/ehea/ehea_main.c
> --- linux-2.6.22-rc4/drivers/net/ehea/ehea_main.c 2007-06-05 
> 02:57:25.0 +0200
> +++ patched_kernel/drivers/net/ehea/ehea_main.c   2007-06-06 
> 12:53:58.0 +0200
> @@ -1947,7 +1945,7 @@ static void ehea_vlan_rx_add_vid(struct 
>   }
>  
>   index = (vid / 64);
> - cb1->vlan_filter[index] |= ((u64)(1 << (vid & 0x3F)));
> + cb1->vlan_filter[index] |= ((u64)(0x8000 >> (vid & 0x3F)));
>  
>   hret = ehea_h_modify_ehea_port(adapter->handle, port->logical_port_id,
>  H_PORT_CB1, H_PORT_CB1_ALL, cb1);
> @@ -1982,7 +1980,7 @@ static void ehea_vlan_rx_kill_vid(struct
>   }
>  
>   index = (vid / 64);
> - cb1->vlan_filter[index] &= ~((u64)(1 << (vid & 0x3F)));
> + cb1->vlan_filter[index] &= ~((u64)(0x8000 >> (vid & 0x3F)));

These two seem ripe for splitting into some sort of helper routine.
Which would leave only one place to get it right.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person


signature.asc
Description: This is a digitally signed message part

Re: warnings in git-wireless

2007-06-06 Thread James Ketrenos

Andrew Morton wrote:
> On Wed, 06 Jun 2007 13:51:41 -0700 James Ketrenos <[EMAIL PROTECTED]> wrote:
> 
  * make C=2 CF=-Wall will complain if you use ARRAY_SIZE on global data
  */
 #define GLOBAL_ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
>>> This is identical to ARRAY_SIZE.
>>>
>>> And if there's some problem with ARRAY_SIZE then fix ARRAY_SIZE!  Don't go 
>>> off and create some private thing and leave everyone else twisting in the
>>> wind.
>> The code was resolving the sparse warnings.  
> 
...
> Your GLOBAL_ARRAY_SIZE() is, afaict, identical to ARRAY_SIZE().

>From include/linux/kernel.h

#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + __must_be_array(arr))

>From drivers/net/wireless/mac80211/iwlwifi/iwl-helpers.h

#define GLOBAL_ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))

The '+ __must_be_array(arr)' part of ARRAY_SIZE was causing sparse to complain 
with:

drivers/net/wireless/mac80211/iwlwifi/base.c:4646:22: error: cannot size 
expression
...

When I had run my builds, I had restricted the sparse checks to just iwlwifi 
(vs. checking the rest of the kernel).

I just ran it C=2 CF=-Wall against the rest of the kernel and do see other code 
with the same problem, eg:

sound/core/memalloc.c:521:14: error: cannot size expression
...

I had erroneously thought it was just a problem with iwlwifi...

Thanks,
James
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: warnings in git-wireless

2007-06-06 Thread Randy Dunlap

On Wed, 6 Jun 2007 16:35:34 -0700 Andrew Morton wrote:

> On Wed, 06 Jun 2007 13:51:41 -0700 James Ketrenos <[EMAIL PROTECTED]> wrote:
> 
> > 
> > >>  * make C=2 CF=-Wall will complain if you use ARRAY_SIZE on global data
> > >>  */
> > >> #define GLOBAL_ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
> > > 
> > > This is identical to ARRAY_SIZE.
> > >
> > > And if there's some problem with ARRAY_SIZE then fix ARRAY_SIZE!  Don't 
> > > go 
> > > off and create some private thing and leave everyone else twisting in the
> > > wind.
> > 
> > The code was resolving the sparse warnings.  GLOBAL_ARRAY_SIZE removes the 
> > part of the ARRAY_SIZE definition that causes sparse to complain ('+ 
> > __must_be_array(arr)').  I don't know enough about how sparse works to fix 
> > sparse, or to know if its a problem with __must_be_array.  The code itself 
> > was fine -- using an array with ARRAY_SIZE.
> 
> (These 340-column emails are rather hard to reply to)
> 
> Your GLOBAL_ARRAY_SIZE() is, afaict, identical to ARRAY_SIZE().
> 
> If ARRAY_SIZE() is spitting some sparse warning then please report it and
> we'll take a look into it.

Looks like a sparse problem.  See
  http://marc.info/?l=linux-sparse&m=118105261224378&w=2


> > Agreed; iwlwifi should be using lib/hexcump.c
> 
> The hexdump.c interfaces are about to change, so I wouldn't do anything
> with this until after 2.6.23-rc1.

James, hexdump.c changes are already in 2.6.22-rc4-mm1 just in case
you want to see them.

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread David Miller

From: jamal <[EMAIL PROTECTED]>
Date: Wed, 06 Jun 2007 19:54:47 -0400

> On Wed, 2007-06-06 at 16:48 -0700, Rick Jones wrote:
> >   > RX queues - yes, I can see;  TX queues, it doesnt make sense to put
> > > different rings on different CPUs.
> > 
> > To what extent might that preclude some cachelines bouncing hither and 
> > yon between the CPUs?
> 
> I think the bouncing will exist a lot more with the multi CPUs. But one
> would assume if you go that path, you would also parallelize the stack
> on egress to reduce such an effect. I guess the point i am not seeing is
> the value. The tx, once hitting the NIC is an IO issue not a CPU issue.

Disagred, that single TX lock kills cpu cycles.

If all of the TX queues are independantly programmable of one
another, the single TX lock kills performance.

> off for the night.

Enjoy the game.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread David Miller

From: Rick Jones <[EMAIL PROTECTED]>
Date: Wed, 06 Jun 2007 16:48:59 -0700

>   > RX queues - yes, I can see;  TX queues, it doesnt make sense to put
> > different rings on different CPUs.
> 
> To what extent might that preclude some cachelines bouncing hither and 
> yon between the CPUs?

I think per-TX-queue locking takes locality as another advantage.
You only touch the TX descriptors for queue N, rather than a single
globally shared one.

Same goes for RX.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread David Miller

From: jamal <[EMAIL PROTECTED]>
Date: Wed, 06 Jun 2007 19:35:46 -0400

> There is no potential for parallelizing on transmit that i can think of.
> Dave, please explain it slowly so i can understand it.
> 
> There is huge potential for parallelizing on receive. But i am certainly
> missing the value in the transmit.

I gave an example in another response, you have N processes
queueing up data for TCP or UDP or whatever in parallel on
different cpus, all going out the same 10gbit device.

All of them enter into ->hard_start_xmit(), and thus all of them try
to take the same netdev->tx_lock

If they have multiple TX queues, independantly programmable, that
single lock is stupid.

We could use per-queue TX locks for such hardware, but we can't
support that currently.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread jamal

On Wed, 2007-06-06 at 16:48 -0700, Rick Jones wrote:
>   > RX queues - yes, I can see;  TX queues, it doesnt make sense to put
> > different rings on different CPUs.
> 
> To what extent might that preclude some cachelines bouncing hither and 
> yon between the CPUs?

I think the bouncing will exist a lot more with the multi CPUs. But one
would assume if you go that path, you would also parallelize the stack
on egress to reduce such an effect. I guess the point i am not seeing is
the value. The tx, once hitting the NIC is an IO issue not a CPU issue.
OTOH, the receive path once a packet is received, that is a CPU problem
(and therefore multi CPUs help).

To be fair to Peter, that is not what his patches are trying to address
(and infact, they cant solve that problem).

off for the night.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread David Miller

From: jamal <[EMAIL PROTECTED]>
Date: Wed, 06 Jun 2007 19:32:46 -0400

> So use a different scheduler. Dont use strict prio. Strict prio will
> guarantee starvation of low prio packets as long as there are high prio
> packets. Thats the intent.

Ok, point taken.

There are of course other uses for multiple TX queues, and in
particular my finer-grained locking example.

I'm still amazed the TX locking issue wasn't clear to you,
too nervous about tonight's game? :)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread David Miller

From: jamal <[EMAIL PROTECTED]>
Date: Wed, 06 Jun 2007 19:32:46 -0400

> On Wed, 2007-06-06 at 15:35 -0700, David Miller wrote:
> > With the above for transmit, and having N "struct napi_struct"
> > instances for MSI-X directed RX queues, we'll have no problem keeping
> > a 10gbit (or even faster) port completely full with lots of cpu to
> > spare on multi-core boxes.
> > 
> 
> RX queues - yes, I can see;  TX queues, it doesnt make sense to put
> different rings on different CPUs.

For the locking is makes a ton of sense.

If you have sendmsg() calls going on N cpus, would you rather
they:

1) All queue up to the single netdev->tx_lock

or

2) All take local per-hw-queue locks

to transmit the data they are sending?

I thought this was obvious... guess not :-)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread Rick Jones


 > RX queues - yes, I can see;  TX queues, it doesnt make sense to put

different rings on different CPUs.


To what extent might that preclude some cachelines bouncing hither and 
yon between the CPUs?


rick jones

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

bnx2 driver reports WoL not supported when it actually works

2007-06-06 Thread Tim Mann

Hi, Michael.  We have some IBM blades with a BCM5708-based dual-port
NIC that the bnx2 driver reports as not supporting wake-on-LAN.
(Ethtool says "Supports Wake-on: d".)  However, IBM says that WoL is
supported by these NICs, and in fact when I tried putting the blade
into soft-off and sending a magic packet to it, it did wake up!  I
tried both ports, and they both worked.

Some more details:  The driver identifies the NIC as follows:

Broadcom NetXtreme II BCM5708 1000Base-SX (B2) PCI-X 64-bit 133MHz found at mem 
da00, IRQ 153, node addr  00145 ed6f8ea

Here's the full ethtool output:

Supported ports: [ FIBRE ]
Supported link modes:   1000baseT/Full 
Supports auto-negotiation: Yes
Advertised link modes:  1000baseT/Full 
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: FIBRE
PHYAD: 2
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: d
Wake-on: d
Cannot get message level: Function not implemented
Link detected: yes

I added a printk to get the full contents of the chip_id register; it's
0x57081021.  So the driver is deciding that the NIC doesn't support WoL
because the SERDES bit is set in the CHIP_BOND_ID.  (It's not the later
CHIP_ID test since this NIC is a 5708_B2.)

/* Disable WOL support if we are running on a SERDES chip. */
if (CHIP_BOND_ID(bp) & CHIP_BOND_ID_SERDES_BIT) {
bp->phy_flags |= PHY_SERDES_FLAG;
bp->flags |= NO_WOL_FLAG;
if (CHIP_NUM(bp) == CHIP_NUM_5708) {
bp->phy_addr = 2;
reg = REG_RD_IND(bp, bp->shmem_base +
 BNX2_SHARED_HW_CFG_CONFIG);
if (reg & BNX2_SHARED_HW_CFG_PHY_2_5G)
bp->phy_flags |= PHY_2_5G_CAPABLE_FLAG;
}
}

if ((CHIP_ID(bp) == CHIP_ID_5708_A0) ||
(CHIP_ID(bp) == CHIP_ID_5708_B0) ||
(CHIP_ID(bp) == CHIP_ID_5708_B1))
bp->flags |= NO_WOL_FLAG;

It would be easy to "fix" this by commenting out the line that sets the
NO_WOL_FLAG if the SERDES bit is set, but I assume the line is there
for a reason -- I imagine there are *some* NICs that don't support WoL
if the SERDES bit is set (?).

Can you find out what the right fix is?  (Or do we need to ask IBM?)
It's important to me because my software is conservative and does not
put a machine to sleep if ethtool says that the machine doesn't have a
NIC that supports WoL.

Thanks,
-- 
Tim Mann  work: [EMAIL PROTECTED]  home: [EMAIL PROTECTED]
  http://www.vmware.com  http://tim-mann.org
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread jamal

On Wed, 2007-06-06 at 15:40 -0700, David Miller wrote:

> There are two core issues in my mind:
> 
> 1) multi-queue on both RX and TX is going to be very pervasive very
>soon, everyone is putting this into silicon.
> 
>The parallelization gain potential is enormous, and we have to
>design for this.
> 

There is no potential for parallelizing on transmit that i can think of.
Dave, please explain it slowly so i can understand it.

There is huge potential for parallelizing on receive. But i am certainly
missing the value in the transmit.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread Jeff Garzik

On Wed, Jun 06, 2007 at 04:14:12PM -0700, Waskiewicz Jr, Peter P wrote:
> > While I am growing in support of your changes, there are two things:
> > 
> > 1) I want to study them more and hear more about what Patrick has to
> >say about them when he returns from his trip on Sunday
> > 
> > 2) I don't want to open up a net-2.6.23 tree yet so that people
> >concentrate on bug fixes and regressions, pick an open bug or
> >regression report and help out if you want net-2.6.23 cut faster
> >:-)
> 
> Where can I find such reports?

Michal Piotrowski's list is a good place to start, while not
network-specific, is turning into _the_ place to list regressions.

Jeff



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: warnings in git-wireless

2007-06-06 Thread Andrew Morton

On Wed, 06 Jun 2007 13:51:41 -0700 James Ketrenos <[EMAIL PROTECTED]> wrote:

> 
> >>  * make C=2 CF=-Wall will complain if you use ARRAY_SIZE on global data
> >>  */
> >> #define GLOBAL_ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
> > 
> > This is identical to ARRAY_SIZE.
> >
> > And if there's some problem with ARRAY_SIZE then fix ARRAY_SIZE!  Don't go 
> > off and create some private thing and leave everyone else twisting in the
> > wind.
> 
> The code was resolving the sparse warnings.  GLOBAL_ARRAY_SIZE removes the 
> part of the ARRAY_SIZE definition that causes sparse to complain ('+ 
> __must_be_array(arr)').  I don't know enough about how sparse works to fix 
> sparse, or to know if its a problem with __must_be_array.  The code itself 
> was fine -- using an array with ARRAY_SIZE.

(These 340-column emails are rather hard to reply to)

Your GLOBAL_ARRAY_SIZE() is, afaict, identical to ARRAY_SIZE().

If ARRAY_SIZE() is spitting some sparse warning then please report it and
we'll take a look into it.

> 
> >> /* Debug and printf string expansion helpers for printing bitfields */
> >> #define BIT_FMT8 "%c%c%c%c-%c%c%c%c"
> >> #define BIT_FMT16 BIT_FMT8 ":" BIT_FMT8
> >> #define BIT_FMT32 BIT_FMT16 " " BIT_FMT16
> ...
> >> #define BIT_ARG32(x) \
> >> BITC(x,31),BITC(x,30),BITC(x,29),BITC(x,28),\
> >> BITC(x,27),BITC(x,26),BITC(x,25),BITC(x,24),\
> >> BITC(x,23),BITC(x,22),BITC(x,21),BITC(x,20),\
> >> BITC(x,19),BITC(x,18),BITC(x,17),BITC(x,16),\
> >> BIT_ARG16(x)
> > 
> > None of the above is appropriate to a driver-private header.
> 
> Where would be the appropriate place?

Dunno.  include/linux/bitfield-helpers.h?

>  We use it in with iwlwifi; I don't know if others need it or want to use it. 
>  Do you know of other projects using something similar?

I haven't looked.

> >> #define KELVIN_TO_CELSIUS(x) ((x)-273)
> > 
> > Nor is that.
> 
> Where would the appropriate place be?

include/linux/temperature.h?  acpi could use it, and there are other things
we could put into temperature.h

> >> static inline const struct ieee80211_hw_mode *iwl_get_hw_mode(struct 
> >> iwl_priv
> >>  *priv, int mode)
> >> {
> >>int i;
> >>
> >>for (i = 0; i < 3; i++)
> >>if (priv->modes[i].mode == mode)
> >>return &priv->modes[i];
> >>
> >>return NULL;
> >> }
> > 
> > Far too large to inline, has five callsites.
> 
> Currently CodingStyle states to use inline where you might have otherwise 
> used a macro, and then later if the function is not overly complex (citing "3 
> lines" as a guideline).  Is this too long because it has a for loop in it?  
> Or a loop and a branch?

Anything more than 10-20 instructions turns out to be too large.

> Removing static inline from the functions declared in header files means they 
> need to be moved to .c files, declared as extern, and pollute the namespace.  
> In prior drivers we had been beaten up about doing that...

You were mis-beaten-up.  Choose an appropriate namespace (iwl_* sounds OK),
stick to it and you'll be fine.

> >> #define WLAN_FC_GET_TYPE(fc)(((fc) & IEEE80211_FCTL_FTYPE))
> >> #define WLAN_FC_GET_STYPE(fc)   (((fc) & IEEE80211_FCTL_STYPE))
> >> #define WLAN_GET_SEQ_FRAG(seq)  ((seq) & 0x000f)
> >> #define WLAN_GET_SEQ_SEQ(seq)   ((seq) >> 4)
> > 
> > These don't need to be macros
> 
> What would you like these to be?

/*
 * comment goes here
 */
static inline suitable_return_type wlan_fc_get_type(whatever_type_fc_has fc)
{
return fc & IEEE80211_FCTL_FTYPE;
}

> These currently exist as macros in ieee80211.h and there are other examples 
> in the kernel of similar macros.  If a goal is to remove *all macros* then 
> that should be stated in CodingStyle, preferably in a way that helps 
> developers understand how they are supposed to write their code.

These things come up again and again in lkml code-review.  Could be that
CodingStyle doesn't cover everything.  Common sense and taste applies too.

> >> #define QOS_CONTROL_LEN 2
> >>
> >> static inline u16 *ieee80211_get_qos_ctrl(struct ieee80211_hdr *hdr)
> >> {
> >>int hdr_len = ieee80211_get_hdrlen(hdr->frame_control);
> >>if (hdr->frame_control & IEEE80211_STYPE_QOS_DATA)
> >>return (u16 *) ((u8 *) hdr + (hdr_len) - QOS_CONTROL_LEN);
> >>return NULL;
> >> }
> > 
> > Two callsites, too large to inline.
> 
> If something is used more than once then it is unsuitable for an inline?  
> Again maybe updating CodingStyle would be helpful?
> 
> Change:
>   "Generally, inline functions are preferable to macros resembling 
> functions."
> 
> To:
>   "Macros resembling functions and inline functions should *NOT* be used."
> 
> IIRC, ieee80211_get_qos_ctrl used to be a macros, and was then changed to an 
> inline per CodingStyle.  Now we don't want inlines, and instead want pure 
> functions (and consequently polluted namespace--or do we want to add to 
> CodingStyle that all function

Re: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread jamal

On Wed, 2007-06-06 at 15:35 -0700, David Miller wrote:
> From: jamal <[EMAIL PROTECTED]>
> Date: Wed, 06 Jun 2007 18:13:40 -0400

> There are other reasons to do interesting things in this area,
> purely for parallelization reasons.
> 
> For example, consider a chip that has N totally independant TX packet
> queues going out to the same ethernet port.  You can lock and transmit
> on them independantly, and the chip internally arbitrates using DRR or
> whatever to blast the queues out to the physical port in some fair'ish
> manner.
> 
> In that case you'd want to be able to do something like:
> 
>   struct mydev_tx_queue *q = &mydev->tx_q[smp_processor_id() % N];
> 
> or similar in the ->hard_start_xmit() driver.  But something generic
> to support this kind of parallelization would be great (and necessary)
> because the TX lock is unary per netdev and destroys all of the
> parallelization possible with something like the above.
> 

I cant think of any egress scheduler that will benefit from that
approach. The scheduler is the decider of which packet goes out next
on the wire.

> With the above for transmit, and having N "struct napi_struct"
> instances for MSI-X directed RX queues, we'll have no problem keeping
> a 10gbit (or even faster) port completely full with lots of cpu to
> spare on multi-core boxes.
> 

RX queues - yes, I can see;  TX queues, it doesnt make sense to put
different rings on different CPUs.

> However, I have to disagree with your analysis of the multi-qdisc
> situation, and I tend to agree with Patrick.
> If you only have one qdisc to indicate status on, when is the queue
> full?  That is the core issue. 

I just described why it is not an issue. If you make the assumption it
is an issue, then it becomes one. 


>  Indicating full status when any of
> the hardware queues are full is broken, because we should never
> block out queuing of higher priority packets just because the
> low priority queue can't take any more frames, _and_ vice versa.

Dave, you didnt read anything i said ;-> The situation you describe is
impossible. low prio will never block high prio.

> I really want to believe your proofs but they are something out of
> a fairy tale :-)

They are a lot real than it seems. Please read again what i typed in ;->
And i will produce patches since this seems to be complex to explain.

> > The only way PHL will ever shutdown the path to the hardware is when
> > there are sufficient PHL packets.
> > Corrollary,
> > The only way PSL will ever shutdown the path to the hardware is when
> > there are _NO_ PSH packets.
> 
> The problem with this line of thinking is that it ignores the fact
> that it is bad to not queue to the device when there is space
> available, _even_ for lower priority packets.

So use a different scheduler. Dont use strict prio. Strict prio will
guarantee starvation of low prio packets as long as there are high prio
packets. Thats the intent.

> The more you keep all available TX queues full, the less likely
> delays in CPU processing will lead to a device with nothing to
> do.

It is design intent - thats how the specific scheduler works. 

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 2/3] qla3xxx: cleanup checksum offload code

2007-06-06 Thread Ron Mercer

Jeff,
You've already appled patch 1/3 in Stephen's series.  I tested this one
(2/3) and it looks good.  I can resubmit this if you want.
Regards, Ron Mercer

> -Original Message-
> From: Stephen Hemminger [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, May 30, 2007 2:23 PM
> To: Jeff Garzik; Linux Driver
> Cc: netdev@vger.kernel.org
> Subject: [PATCH 2/3] qla3xxx: cleanup checksum offload code
> 
> The code for checksum is more complex than needed when 
> dealing with VLAN's;
> the higher layers already pass down the location of the IP header.
> 
> Compile tested only, no hardware available.
> 
> Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>
> 
> ---
>  drivers/net/qla3xxx.c |   37 +++--
>  1 file changed, 11 insertions(+), 26 deletions(-)
> 
> --- a/drivers/net/qla3xxx.c   2007-05-30 14:09:25.0 -0700
> +++ b/drivers/net/qla3xxx.c   2007-05-30 14:09:31.0 -0700
> @@ -2433,37 +2433,22 @@ static int ql_get_seg_count(struct ql3_a
>   return -1;
>  }
>  
> -static void ql_hw_csum_setup(struct sk_buff *skb,
> +static void ql_hw_csum_setup(const struct sk_buff *skb,
>struct ob_mac_iocb_req *mac_iocb_ptr)
>  {
> - struct ethhdr *eth;
> - struct iphdr *ip = NULL;
> - u8 offset = ETH_HLEN;
> -
> - eth = (struct ethhdr *)(skb->data);
> -
> - if (eth->h_proto == __constant_htons(ETH_P_IP)) {
> - ip = (struct iphdr *)&skb->data[ETH_HLEN];
> - } else if (eth->h_proto == htons(ETH_P_8021Q) &&
> -((struct vlan_ethhdr *)skb->data)->
> -h_vlan_encapsulated_proto == 
> __constant_htons(ETH_P_IP)) {
> - ip = (struct iphdr *)&skb->data[VLAN_ETH_HLEN];
> - offset = VLAN_ETH_HLEN;
> - }
> -
> - if (ip) {
> - if (ip->protocol == IPPROTO_TCP) {
> - mac_iocb_ptr->flags1 |= 
> OB_3032MAC_IOCB_REQ_TC | 
> + const struct iphdr *ip = ip_hdr(skb);
> +
> + mac_iocb_ptr->ip_hdr_off = skb_network_offset(skb);
> + mac_iocb_ptr->ip_hdr_len = ip->ihl;
> +
> + if (ip->protocol == IPPROTO_TCP) {
> + mac_iocb_ptr->flags1 |= OB_3032MAC_IOCB_REQ_TC |
>   OB_3032MAC_IOCB_REQ_IC;
> - mac_iocb_ptr->ip_hdr_off = offset;
> - mac_iocb_ptr->ip_hdr_len = ip->ihl;
> - } else if (ip->protocol == IPPROTO_UDP) {
> - mac_iocb_ptr->flags1 |= 
> OB_3032MAC_IOCB_REQ_UC | 
> + } else {
> + mac_iocb_ptr->flags1 |= OB_3032MAC_IOCB_REQ_UC |
>   OB_3032MAC_IOCB_REQ_IC;
> - mac_iocb_ptr->ip_hdr_off = offset;
> - mac_iocb_ptr->ip_hdr_len = ip->ihl;
> - }
>   }
> +
>  }
>  
>  /*
> 
> -- 
> Stephen Hemminger <[EMAIL PROTECTED]>
> 
> 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread Waskiewicz Jr, Peter P

> While I am growing in support of your changes, there are two things:
> 
> 1) I want to study them more and hear more about what Patrick has to
>say about them when he returns from his trip on Sunday
> 
> 2) I don't want to open up a net-2.6.23 tree yet so that people
>concentrate on bug fixes and regressions, pick an open bug or
>regression report and help out if you want net-2.6.23 cut faster
>:-)

Where can I find such reports?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread David Miller

From: "Waskiewicz Jr, Peter P" <[EMAIL PROTECTED]>
Date: Wed, 6 Jun 2007 15:57:35 -0700

>   Can we move forward on this please?  If you are comfortable
> that my patches give the kernel the ability to manage hardware
> queues sufficiently, I'd like to request that 2.6.23 be opened (wink
> wink) so I can submit the patches for inclusion to that kernel.

While I am growing in support of your changes, there are
two things:

1) I want to study them more and hear more about what Patrick has to
   say about them when he returns from his trip on Sunday

2) I don't want to open up a net-2.6.23 tree yet so that people
   concentrate on bug fixes and regressions, pick an open bug or
   regression report and help out if you want net-2.6.23 cut faster
   :-)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread Waskiewicz Jr, Peter P

> However, I have to disagree with your analysis of the 
> multi-qdisc situation, and I tend to agree with Patrick.
> 
> If you only have one qdisc to indicate status on, when is the 
> queue full?  That is the core issue.  Indicating full status 
> when any of the hardware queues are full is broken, because 
> we should never block out queuing of higher priority packets 
> just because the low priority queue can't take any more 
> frames, _and_ vice versa.
> 
> I really want to believe your proofs but they are something 
> out of a fairy tale :-)

Dave,
Can we move forward on this please?  If you are comfortable that
my patches give the kernel the ability to manage hardware queues
sufficiently, I'd like to request that 2.6.23 be opened (wink wink) so I
can submit the patches for inclusion to that kernel.

Thanks,

-PJ Waskiewicz
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: warnings in git-wireless

2007-06-06 Thread James Ketrenos

Andrew Morton wrote:

> i386 allmodconfig isn't that hard, guys.
> 
...
> drivers/net/wireless/mac80211/iwlwifi/iwl-4965.c:2050: warning: left shift 
> count >= width of type

My mistake.  I ran that and even fixed the warning at one point...  anyway, 
I've committed a patch to our tree to fix the code and also added comments to 
the iwl_{get,set}_bits code.

>> /*
>>  * make C=2 CF=-Wall will complain if you use ARRAY_SIZE on global data
>>  */
>> #define GLOBAL_ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
> 
> This is identical to ARRAY_SIZE.
>
> And if there's some problem with ARRAY_SIZE then fix ARRAY_SIZE!  Don't go 
> off and create some private thing and leave everyone else twisting in the
> wind.

The code was resolving the sparse warnings.  GLOBAL_ARRAY_SIZE removes the part 
of the ARRAY_SIZE definition that causes sparse to complain ('+ 
__must_be_array(arr)').  I don't know enough about how sparse works to fix 
sparse, or to know if its a problem with __must_be_array.  The code itself was 
fine -- using an array with ARRAY_SIZE.

>> /* Debug and printf string expansion helpers for printing bitfields */
>> #define BIT_FMT8 "%c%c%c%c-%c%c%c%c"
>> #define BIT_FMT16 BIT_FMT8 ":" BIT_FMT8
>> #define BIT_FMT32 BIT_FMT16 " " BIT_FMT16
...
>> #define BIT_ARG32(x) \
>> BITC(x,31),BITC(x,30),BITC(x,29),BITC(x,28),\
>> BITC(x,27),BITC(x,26),BITC(x,25),BITC(x,24),\
>> BITC(x,23),BITC(x,22),BITC(x,21),BITC(x,20),\
>> BITC(x,19),BITC(x,18),BITC(x,17),BITC(x,16),\
>> BIT_ARG16(x)
> 
> None of the above is appropriate to a driver-private header.

Where would be the appropriate place?  We use it in with iwlwifi; I don't know 
if others need it or want to use it.  Do you know of other projects using 
something similar?

>> #define KELVIN_TO_CELSIUS(x) ((x)-273)
> 
> Nor is that.

Where would the appropriate place be?

>> static inline const struct ieee80211_hw_mode *iwl_get_hw_mode(struct iwl_priv
>>*priv, int mode)
>> {
>>  int i;
>>
>>  for (i = 0; i < 3; i++)
>>  if (priv->modes[i].mode == mode)
>>  return &priv->modes[i];
>>
>>  return NULL;
>> }
> 
> Far too large to inline, has five callsites.

Currently CodingStyle states to use inline where you might have otherwise used 
a macro, and then later if the function is not overly complex (citing "3 lines" 
as a guideline).  Is this too long because it has a for loop in it?  Or a loop 
and a branch?

Removing static inline from the functions declared in header files means they 
need to be moved to .c files, declared as extern, and pollute the namespace.  
In prior drivers we had been beaten up about doing that...

>> #define WLAN_FC_GET_TYPE(fc)(((fc) & IEEE80211_FCTL_FTYPE))
>> #define WLAN_FC_GET_STYPE(fc)   (((fc) & IEEE80211_FCTL_STYPE))
>> #define WLAN_GET_SEQ_FRAG(seq)  ((seq) & 0x000f)
>> #define WLAN_GET_SEQ_SEQ(seq)   ((seq) >> 4)
> 
> These don't need to be macros

What would you like these to be?

These currently exist as macros in ieee80211.h and there are other examples in 
the kernel of similar macros.  If a goal is to remove *all macros* then that 
should be stated in CodingStyle, preferably in a way that helps developers 
understand how they are supposed to write their code.

>> #define QOS_CONTROL_LEN 2
>>
>> static inline u16 *ieee80211_get_qos_ctrl(struct ieee80211_hdr *hdr)
>> {
>>  int hdr_len = ieee80211_get_hdrlen(hdr->frame_control);
>>  if (hdr->frame_control & IEEE80211_STYPE_QOS_DATA)
>>  return (u16 *) ((u8 *) hdr + (hdr_len) - QOS_CONTROL_LEN);
>>  return NULL;
>> }
> 
> Two callsites, too large to inline.

If something is used more than once then it is unsuitable for an inline?  Again 
maybe updating CodingStyle would be helpful?

Change:
"Generally, inline functions are preferable to macros resembling 
functions."

To:
"Macros resembling functions and inline functions should *NOT* be used."

IIRC, ieee80211_get_qos_ctrl used to be a macros, and was then changed to an 
inline per CodingStyle.  Now we don't want inlines, and instead want pure 
functions (and consequently polluted namespace--or do we want to add to 
CodingStyle that all functions should be implemented in a single file and 
marked static?)

"A reasonable rule of thumb is to not put inline at functions that have 
more
 than 3 lines of code in them. "

An assignment and a branch return isn't overly complex; it does call another 
function... is that the problem?

If callsites is a key indicator of when to and not to use inline, please add 
that to CodingStyle.  We look at other drivers and headers, we see what they 
do, and we replicate.  We read CodingStyle and use it as a guideline.

...
>> #define ieee80211_is_reassoc_response(fc) \
>>((WLAN_FC_GET_TYPE(fc) == IEEE80211_FTYPE_MGMT) && \
>>(WLAN_FC_GET_STYPE(fc) == IEEE80211_STYPE_REASSOC_RESP))
> 
> None of the above should be macros.

So inline wit

Re: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread David Miller

From: "Waskiewicz Jr, Peter P" <[EMAIL PROTECTED]>
Date: Wed, 6 Jun 2007 15:30:39 -0700

> > [Which of course leads to the complexity (and not optimizing 
> > for the common - which is single ring NICs)].
> 
> The common for 100 Mbit and older 1Gbit is single ring NICs.  Newer
> PCI-X and PCIe NICs from 1Gbit to 10Gbit support multiple rings in the
> hardware, and it's all headed in that direction, so it's becoming the
> common case.

I totally agree.  No modern commodity 1gb and faster card is going
to be without many queues on both TX and RX.

> > Iam actually not against the subqueue control - i know Peter 
> > needs it for certain hardware; i am just against the mucking 
> > around of the common case (single ring NIC) just to get that working. 
> 
> Single-ring NICs see no difference here.  Please explain why using my
> patches with pfifo_fast, sch_prio, or any other existing qdisc will
> change the behavior for single-ring NICs?

I agree with the implication here, there is no penalty for existing
devices.

There are two core issues in my mind:

1) multi-queue on both RX and TX is going to be very pervasive very
   soon, everyone is putting this into silicon.

   The parallelization gain potential is enormous, and we have to
   design for this.

2) Queues are meant to be filled as much as possible, you can't do
   that by having only one qdisc attached to the device indicating
   unary full status, you simply can't.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread David Miller

From: jamal <[EMAIL PROTECTED]>
Date: Wed, 06 Jun 2007 18:13:40 -0400

> From the above you can see they are simple. I am working on a couple of
> things (batching and recovering pktgen ipsec patches)- I will work on
> those patches soon after.
> 
> Iam actually not against the subqueue control - i know Peter needs it
> for certain hardware; i am just against the mucking around of the common
> case (single ring NIC) just to get that working. 

There are other reasons to do interesting things in this area,
purely for parallelization reasons.

For example, consider a chip that has N totally independant TX packet
queues going out to the same ethernet port.  You can lock and transmit
on them independantly, and the chip internally arbitrates using DRR or
whatever to blast the queues out to the physical port in some fair'ish
manner.

In that case you'd want to be able to do something like:

struct mydev_tx_queue *q = &mydev->tx_q[smp_processor_id() % N];

or similar in the ->hard_start_xmit() driver.  But something generic
to support this kind of parallelization would be great (and necessary)
because the TX lock is unary per netdev and destroys all of the
parallelization possible with something like the above.

With the above for transmit, and having N "struct napi_struct"
instances for MSI-X directed RX queues, we'll have no problem keeping
a 10gbit (or even faster) port completely full with lots of cpu to
spare on multi-core boxes.

However, I have to disagree with your analysis of the multi-qdisc
situation, and I tend to agree with Patrick.

If you only have one qdisc to indicate status on, when is the queue
full?  That is the core issue.  Indicating full status when any of
the hardware queues are full is broken, because we should never
block out queuing of higher priority packets just because the
low priority queue can't take any more frames, _and_ vice versa.

I really want to believe your proofs but they are something out of
a fairy tale :-)

> The only way PHL will ever shutdown the path to the hardware is when
> there are sufficient PHL packets.
> Corrollary,
> The only way PSL will ever shutdown the path to the hardware is when
> there are _NO_ PSH packets.

The problem with this line of thinking is that it ignores the fact
that it is bad to not queue to the device when there is space
available, _even_ for lower priority packets.

The more you keep all available TX queues full, the less likely
delays in CPU processing will lead to a device with nothing to
do.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread Waskiewicz Jr, Peter P

> [Which of course leads to the complexity (and not optimizing 
> for the common - which is single ring NICs)].

The common for 100 Mbit and older 1Gbit is single ring NICs.  Newer
PCI-X and PCIe NICs from 1Gbit to 10Gbit support multiple rings in the
hardware, and it's all headed in that direction, so it's becoming the
common case.

> Infact for the wired case i see little value (there is some) 
> in using multiple rings. In the case of wireless (which is 
> strict prio based) it provides more value.

There is value, hence why NIC manufacturers are building wired parts
that support multiple rings today.  And wireless may not want strict
prio in software, and may just want round-robin from the stack.  Either
way, Yi Zhu has represented the wireless side in this discussion
agreeing with these per-queue control patches.  Is wireless not a common
case to be considered?

> > I would love to see your alternative patches.
> 
> >From the above you can see they are simple.

The description above won't provide what I'm trying to solve and what
wireless has stated they want.

> Iam actually not against the subqueue control - i know Peter 
> needs it for certain hardware; i am just against the mucking 
> around of the common case (single ring NIC) just to get that working. 

Single-ring NICs see no difference here.  Please explain why using my
patches with pfifo_fast, sch_prio, or any other existing qdisc will
change the behavior for single-ring NICs?  If the driver doesn't call
alloc_etherdev_mq() explicity, or use the new sch_rr qdisc, then the Tx
path is identical to the kernel today.  What am I mucking around with?
And these patches are not for specific hardware; rather they're for all
the NICs today that have multiple rings, and want to control them in the
OS instead of the driver, which is most of wireless and a handful of
NICs from Intel and Neterion afaik.

-PJ Waskiewicz
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread jamal

On Wed, 2007-06-06 at 17:11 +0200, Patrick McHardy wrote:

> I haven't followed the entire discussion, but I still don't see a
> alternative to touching the qdisc layer - multiple hardware queues
> need multiple queue states if you want to avoid a busy hardware
> queue stopping the qdisc entirely 

If you start with the above premise then ...

> and thereby preventing the qdisc
> to continue feeding packets to other active HW queues. And to make
> use of the multiple queue states you need multiple queues.

 you will logically lead to the above conclusion. 

[Which of course leads to the complexity (and not optimizing for the
common - which is single ring NICs)].

The problem is the premise is _innacurate_.
Since you havent followed the discussion, i will try to be brief (which
is hard).
If you want verbosity it is in my previous emails:

Consider a simple example of strict prio qdisc which is mirror
configuration of a specific hardware. 
Then for sake of discussion, assume two prio queues in the qdisc - PSL
and PSH and two hardware queues/rings in a NIC which does strict prio
with queues PHL and PHH.
The mapping is as follows:
PSL --- maps to --- PHL
PSH --- maps to --- PHH

Assume the PxH has a higher prio than PxL.
Strict prio will always favor H over L.

Two scenarios:
a) a lot of packets for PSL arriving on the stack.
They only get sent from PSL -> PHL if and only if there are no
packets from PSH->PHH.
b)a lot of packets for PSH arriving from the stack.
They will always be favored over PSL in sending to the hardware.

>From the above:
The only way PHL will ever shutdown the path to the hardware is when
there are sufficient PHL packets.
Corrollary,
The only way PSL will ever shutdown the path to the hardware is when
there are _NO_ PSH packets.

So there is no need to do per queue control because the schedule will
ensure things work out fine as long as what you have the correct qdisc;
and it is a qdisc that will work just fine with a single ring with zero
mods.
What you need is a driver API to ask it to select the ring given an
index. This is similar to the qdisc filter used to select a queue.

You can extend the use case i described above to N queues. You can
extend it to other schedulers (WRR or any non-work conserving queues)
etc. It is consistent. Of course if you configure CBQ for a hardware
that does strict prio - that is a misconfig etc.

Infact for the wired case i see little value (there is some) in using
multiple rings. In the case of wireless (which is strict prio based) it
provides more value.

> I would love to see your alternative patches.

>From the above you can see they are simple. I am working on a couple of
things (batching and recovering pktgen ipsec patches)- I will work on
those patches soon after.

Iam actually not against the subqueue control - i know Peter needs it
for certain hardware; i am just against the mucking around of the common
case (single ring NIC) just to get that working. 

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/4] b44: timer power saving

2007-06-06 Thread Stephen Hemminger


> I don't think we need +1, if you need to fire immediately
> (on the next tick). The timer core will always fire
> timers that are in the past immediately.

Thanks, but is it worth bothering to change it again?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Devel] Re: [PATCH] Virtual ethernet tunnel

2007-06-06 Thread David Miller

From: Daniel Lezcano <[EMAIL PROTECTED]>
Date: Wed, 06 Jun 2007 22:38:11 +0200

> Perhaps, a name like "epipe" or "npipe", which reflects what does the 
> device, is more appropriate ?

'npipe' (Network PIPE) or 'epipe' (Ethernet PIPE) are fine with me.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Devel] Re: [PATCH] Virtual ethernet tunnel

2007-06-06 Thread Daniel Lezcano

David Miller wrote:

From: Pavel Emelianov <[EMAIL PROTECTED]>
Date: Wed, 06 Jun 2007 19:11:38 +0400

Veth stands for Virtual ETHernet. It is a simple tunnel driver
that works at the link layer and looks like a pair of ethernet
devices interconnected with each other.

I would suggest choosing a different name.

'veth' is also the name of the virtualized ethernet device
found on IBM machines, driven by driver/net/ibmveth.[ch]

Eric Biederman proposed the name "etun" which stands for "Ethernet TUNnel".

The goal of the pair device is to pass network packets between namespaces.
That reminds me the well known "pipe" which pass data between processes.

All network devices have a name describing what they do : "eth" for 
ethernet,

"dummy" for a device doing nothing, loopback, ...

Perhaps, a name like "epipe" or "npipe", which reflects what does the 
device, is more appropriate ?

-- Daniel

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PCNET32] Lock solid with netconsole

2007-06-06 Thread Francois Romieu

Emmanuel Fusté <[EMAIL PROTECTED]> :
[...]
> Did you plan to submit the "spin_lock_irqsave" patch to mainline ?

No. I will not have enough time to figure why/if it fixes things.

-- 
Ueimor
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Virtual ethernet tunnel

2007-06-06 Thread David Miller

From: Pavel Emelianov <[EMAIL PROTECTED]>
Date: Wed, 06 Jun 2007 19:11:38 +0400

> Veth stands for Virtual ETHernet. It is a simple tunnel driver
> that works at the link layer and looks like a pair of ethernet
> devices interconnected with each other.

I would suggest choosing a different name.

'veth' is also the name of the virtualized ethernet device
found on IBM machines, driven by driver/net/ibmveth.[ch]

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] NetXen: Fix link status messages

2007-06-06 Thread wendy xiong

On Sun, 2007-06-03 at 11:51 -0400, Jeff Garzik wrote:
> Mithlesh Thukral wrote:
> > NetXen: Correct link status messages.
> > This patch will fix the problem of wrong link status messages
> > that were reported. 
> > 
> > Signed-off by: Wen Xiong <[EMAIL PROTECTED]>
> > Signed-off by: Mithlesh Thukral <[EMAIL PROTECTED]>
> > ---
> > 
> >  drivers/net/netxen/netxen_nic.h  |1 +
> >  drivers/net/netxen/netxen_nic_init.c |   21 +
> >  drivers/net/netxen/netxen_nic_isr.c  |   23 +++
> >  3 files changed, 37 insertions(+), 8 deletions(-)
> 
> dropped due to previous patches being dropped
Hi Jeff, 

Thanks for your review the patch!
We are going to re-submit the patch with correct problem description.

Thanks,
wendy

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC RTNETLINK 00/09]: Netlink link creation API

2007-06-06 Thread Alexey Kuznetsov

Hello!

> Good point, I didn't think of that. Is there a version of this patch
> that already uses different namespaces so I can look at it?

Pavel does not like the idea. It looks "not exactly pretty", like you said. :-)

The alternative is to create pair in main namespace and then move
one end to another namespace renaming it and changing index.
Why I do not like it? Because this makes RTM_NEWLINK just useless step,
all its work is undone and real work is remade when the device moves,
with all the unrettiness moved to another place.

>From another hand, some move operation is required in any case.
Right now in openvz the problem is solved in tricky, but quite
inerseting way: all the devices in main namespace are assigned
with odd index, child devices get odd index. So that, when a device
moves from main namespace to child, openvz does not need to change
ifindex, conflict is impossible. Well, it is working approach.
But it is not pretty either.


> Are network namespace completely seperated or is there some hierarchy
> with all lower namespaces visible above or something like that?

Right now they are completely separate.

It is possible to make child devices visible in parent namespace
like it is done for process pids: i.e. there is an abstract identity
which is seen under different names and indices in different namespaces.
Sounds cool, but this add a lot of complexity, which has no meaning
outside of context of device creation, I do not think it is worth to do.




> The identity of the main device has no meaning within a different
> namespace, but are there other reasons for hiding it?

Perhaps, security. It is not a good idea to leak information
about parent namespace to child namespace.

Also, people will want to see emulated ethernet inside namespace
looking exactly like ethernet. No freaking additional attributes.

Alexey
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/4] b44: timer power saving

2007-06-06 Thread Michael Buesch

On Monday 04 June 2007 22:25:37 Stephen Hemminger wrote:
> Make the PHY and statistic timer run on one second boundary
> for powersaving.
> 
> On resume, the driver should check for link up immediately, to
> get online faster (rather than waiting for the next second).
> 
> Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>
> 
> 
> ---
>  drivers/net/b44.c |9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> --- a/drivers/net/b44.c   2007-06-04 12:31:27.0 -0700
> +++ b/drivers/net/b44.c   2007-06-04 12:31:34.0 -0700
> @@ -599,8 +599,7 @@ static void b44_timer(unsigned long __op
>  
>   spin_unlock_irq(&bp->lock);
>  
> - bp->timer.expires = jiffies + HZ;
> - add_timer(&bp->timer);
> + mod_timer(&bp->timer, round_jiffies(jiffies + HZ));
>  }
>  
>  static void b44_tx(struct b44 *bp)
> @@ -2348,11 +2347,11 @@ static int b44_resume(struct pci_dev *pd
>   netif_device_attach(bp->dev);
>   spin_unlock_irq(&bp->lock);
>  
> - bp->timer.expires = jiffies + HZ;
> - add_timer(&bp->timer);
> -
>   b44_enable_ints(bp);
>   netif_wake_queue(dev);
> +
> + mod_timer(&bp->timer, jiffies + 1);

I don't think we need +1, if you need to fire immediately
(on the next tick). The timer core will always fire
timers that are in the past immediately.

-- 
Greetings Michael.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] NetXen: Fix ping issue

2007-06-06 Thread Jeff Garzik


You'll need to resend the patch...

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2.6.22-rc4] ehea: Fixed possible kernel panic on VLAN packet recv

2007-06-06 Thread Thomas Klein

This patch fixes a possible kernel panic due to not checking the vlan group
when processing received VLAN packets and a malfunction in VLAN/hypervisor
registration.


Signed-off-by: Thomas Klein <[EMAIL PROTECTED]>
---


diff -Nurp -X dontdiff linux-2.6.22-rc4/drivers/net/ehea/ehea.h 
patched_kernel/drivers/net/ehea/ehea.h
--- linux-2.6.22-rc4/drivers/net/ehea/ehea.h2007-06-05 02:57:25.0 
+0200
+++ patched_kernel/drivers/net/ehea/ehea.h  2007-06-06 12:53:58.0 
+0200
@@ -39,7 +39,7 @@
 #include 
 
 #define DRV_NAME   "ehea"
-#define DRV_VERSION"EHEA_0061"
+#define DRV_VERSION"EHEA_0064"
 
 #define EHEA_MSG_DEFAULT (NETIF_MSG_LINK | NETIF_MSG_TIMER \
| NETIF_MSG_RX_ERR | NETIF_MSG_TX_ERR)
diff -Nurp -X dontdiff linux-2.6.22-rc4/drivers/net/ehea/ehea_main.c 
patched_kernel/drivers/net/ehea/ehea_main.c
--- linux-2.6.22-rc4/drivers/net/ehea/ehea_main.c   2007-06-05 
02:57:25.0 +0200
+++ patched_kernel/drivers/net/ehea/ehea_main.c 2007-06-06 12:53:58.0 
+0200
@@ -451,7 +451,8 @@ static struct ehea_cqe *ehea_proc_rwqes(
processed_rq3++;
}
 
-   if (cqe->status & EHEA_CQE_VLAN_TAG_XTRACT)
+   if ((cqe->status & EHEA_CQE_VLAN_TAG_XTRACT)
+   && port->vgrp)
vlan_hwaccel_receive_skb(skb, port->vgrp,
 cqe->vlan_tag);
else
@@ -1910,10 +1911,7 @@ static void ehea_vlan_rx_register(struct
goto out;
}
 
-   if (grp)
-   memset(cb1->vlan_filter, 0, sizeof(cb1->vlan_filter));
-   else
-   memset(cb1->vlan_filter, 0xFF, sizeof(cb1->vlan_filter));
+   memset(cb1->vlan_filter, 0, sizeof(cb1->vlan_filter));
 
hret = ehea_h_modify_ehea_port(adapter->handle, port->logical_port_id,
   H_PORT_CB1, H_PORT_CB1_ALL, cb1);
@@ -1947,7 +1945,7 @@ static void ehea_vlan_rx_add_vid(struct 
}
 
index = (vid / 64);
-   cb1->vlan_filter[index] |= ((u64)(1 << (vid & 0x3F)));
+   cb1->vlan_filter[index] |= ((u64)(0x8000 >> (vid & 0x3F)));
 
hret = ehea_h_modify_ehea_port(adapter->handle, port->logical_port_id,
   H_PORT_CB1, H_PORT_CB1_ALL, cb1);
@@ -1982,7 +1980,7 @@ static void ehea_vlan_rx_kill_vid(struct
}
 
index = (vid / 64);
-   cb1->vlan_filter[index] &= ~((u64)(1 << (vid & 0x3F)));
+   cb1->vlan_filter[index] &= ~((u64)(0x8000 >> (vid & 0x3F)));
 
hret = ehea_h_modify_ehea_port(adapter->handle, port->logical_port_id,
   H_PORT_CB1, H_PORT_CB1_ALL, cb1);

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] NetXen: Add NETXEN prefix to macros

2007-06-06 Thread wendy xiong

On Wed, 2007-06-06 at 14:05 -0400, Jeff Garzik wrote:
> On Wed, Jun 06, 2007 at 12:52:44PM -0500, wendy xiong wrote:
> > On Sun, 2007-06-03 at 11:50 -0400, Jeff Garzik wrote:
> > > Mithlesh Thukral wrote:
> > > > NetXen: Add NETXEN prefix to a macro
> > > > This patch will add the "NETXEN" prefix to "USER_START" macro.
> > > > 
> > > > Signed-off by: Wen Xiong <[EMAIL PROTECTED]>
> > > > Signed-off by: Mithlesh Thukral <[EMAIL PROTECTED]>
> > > > ---
> > > > 
> > > >  drivers/net/netxen/netxen_nic.h |4 ++--
> > > >  drivers/net/netxen/netxen_nic_ethtool.c |2 +-
> > > >  drivers/net/netxen/netxen_nic_hw.c  |4 ++--
> > > >  drivers/net/netxen/netxen_nic_init.c|4 ++--
> > > >  4 files changed, 7 insertions(+), 7 deletions(-)
> > > 
> > > Your patch description is useless.  Clearly we know -what- it does, 
> > > simply by reading the patch.
> > > 
> > > But it does not answer the simple question:  why?  why is this needed in 
> > > a bug fix Release Candidate series?
> > 
> > Hi Jeff,
> > 
> > When we backported netxen driver to 2.6.9 kernel on ppc, we saw the
> > compile errors for "USER_START". Because there is another definition for
> > USER_START which is used by arch/ppc64/mm/hash_utils.c and native.c.
> > 
> > So we like to add NETXEN before USER_START in netxen device driver, then
> > netxen driver can be compiled correctly in 2.6.9 kernel.
> > 
> > Let me know if you any question for this patch.
> 
> That answers my questions, but doesn't address the issue:  You need
> to regenerate the patch with the description improved as described.
> 
> The patch description is copied verbatim into the kernel changelog, and
> preserved verbatim for all eternity.  Four years from now, we need to be
> able to read the patch description and understand -why- the NETXEN_xxx
> prefix was added.  Simply stating that the NETXEN_xxx prefix was added
> is useless, as it is self-evident from reading the patch itself.
> 
>   Jeff
> 
> 
> 
Jeff, 
Thanks.
We will re-generate the patch with correct issue title and description.

Thanks,
wendy

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] NetXen: Fix ping issue

2007-06-06 Thread wendy xiong

On Sun, 2007-06-03 at 11:51 -0400, Jeff Garzik wrote:
> Mithlesh Thukral wrote:
> > NetXen: Fix initialization and subsequent ping issue
> > This patch will fix the initialization and ping issues seen on
> > certain PPC architecture blades.
> > 
> > Signed-off by: Wen Xiong <[EMAIL PROTECTED]>
> > Signed-off by: Mithlesh Thukral <[EMAIL PROTECTED]>
> > ---
> > 
> >  drivers/net/netxen/netxen_nic_main.c |7 +++
> >  drivers/net/netxen/netxen_nic_niu.c  |8 ++--
> >  2 files changed, 9 insertions(+), 6 deletions(-)
> 
> Again, your patch description is useless.
> 
> You should describe the problem being fixed, and how/why the changes 
> seen in the patch actually fix the issue.
> 
Hi Jeff,

Ping problem description:
After we moved up netxen adapter's firmware to 3.4.19, we saw this ping
problem on x/pBlade.  After configured interface up, ping -c 1
10.10.10.10 failed. Netxen adapter couldn't accept ARP broadcast packet
somehow. If I manually added MAC address in the ARP table, then ping
start working.

netxen adapter should finish initilization after system boot. But on
some platform, looks netxen adapter didn't initilization correctly after
system boot up, so have to re-load the firmware again in probe routine.
Also re-initilization netxen_config_0 and netxen_config_1 registers.

Let me know if you have any question for this patch.

Thanks
Wendy Xiong

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] NetXen: Add NETXEN prefix to macros

2007-06-06 Thread Jeff Garzik

On Wed, Jun 06, 2007 at 12:52:44PM -0500, wendy xiong wrote:
> On Sun, 2007-06-03 at 11:50 -0400, Jeff Garzik wrote:
> > Mithlesh Thukral wrote:
> > > NetXen: Add NETXEN prefix to a macro
> > > This patch will add the "NETXEN" prefix to "USER_START" macro.
> > > 
> > > Signed-off by: Wen Xiong <[EMAIL PROTECTED]>
> > > Signed-off by: Mithlesh Thukral <[EMAIL PROTECTED]>
> > > ---
> > > 
> > >  drivers/net/netxen/netxen_nic.h |4 ++--
> > >  drivers/net/netxen/netxen_nic_ethtool.c |2 +-
> > >  drivers/net/netxen/netxen_nic_hw.c  |4 ++--
> > >  drivers/net/netxen/netxen_nic_init.c|4 ++--
> > >  4 files changed, 7 insertions(+), 7 deletions(-)
> > 
> > Your patch description is useless.  Clearly we know -what- it does, 
> > simply by reading the patch.
> > 
> > But it does not answer the simple question:  why?  why is this needed in 
> > a bug fix Release Candidate series?
> 
> Hi Jeff,
> 
> When we backported netxen driver to 2.6.9 kernel on ppc, we saw the
> compile errors for "USER_START". Because there is another definition for
> USER_START which is used by arch/ppc64/mm/hash_utils.c and native.c.
> 
> So we like to add NETXEN before USER_START in netxen device driver, then
> netxen driver can be compiled correctly in 2.6.9 kernel.
> 
> Let me know if you any question for this patch.

That answers my questions, but doesn't address the issue:  You need
to regenerate the patch with the description improved as described.

The patch description is copied verbatim into the kernel changelog, and
preserved verbatim for all eternity.  Four years from now, we need to be
able to read the patch description and understand -why- the NETXEN_xxx
prefix was added.  Simply stating that the NETXEN_xxx prefix was added
is useless, as it is self-evident from reading the patch itself.

Jeff



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC RTNETLINK 04/09]: Link creation API

2007-06-06 Thread Eric W. Biederman

Patrick McHardy <[EMAIL PROTECTED]> writes:

> Eric W. Biederman wrote:
>>>+if (linkinfo[IFLA_INFO_NAME]) {
>>>+nla_strlcpy(name, linkinfo[IFLA_INFO_NAME], sizeof(name));
>>>+ops = rtnl_link_ops_get(name);
>> 
>> 
>> Ugh.  Shouldn't we have the request_module logic here?
>> Otherwise it looks like we can skip the validate method and 
>> have other weird interactions.
>
>
> Good catch. The easiest solution seems be to simply replay the
> request after successful module load, which also avoids the
> device lookup race.

Looks reasonable to me.   There might be a variable or two that
we need to make certain is initialized but at a quick glance it
looks ok.

Eric


> Something like this (untested).
>
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index 8d2f817..f2868b0 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -930,6 +930,7 @@ static int rtnl_newlink(struct sk_buff *skb, struct 
> nlmsghdr
> *nlh, void *arg)
>   struct nlattr *linkinfo[IFLA_INFO_MAX+1];
>   int err;
>  
> +replay:
>   err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy);
>   if (err < 0)
>   return err;
> @@ -1012,19 +1013,19 @@ static int rtnl_newlink(struct sk_buff *skb, struct
> nlmsghdr *nlh, void *arg)
>   if (tb[IFLA_ADDRESS] || tb[IFLA_BROADCAST] || tb[IFLA_MAP])
>   return -EOPNOTSUPP;
>  
> + if (!ops) {
>  #ifdef CONFIG_KMOD
> - if (!ops && kind[0]) {
> - /* race condition: device may be created while rtnl is
> -  * unlocked, final register_netdevice will catch it.
> -  */
> - __rtnl_unlock();
> - request_module("rtnl-link-%s", kind);
> - rtnl_lock();
> - ops = rtnl_link_ops_get(kind);
> - }
> + if (kind[0]) {
> + __rtnl_unlock();
> + request_module("rtnl-link-%s", kind);
> + rtnl_lock();
> + ops = rtnl_link_ops_get(kind);
> + if (ops)
> + goto replay;
> + }
>  #endif
> - if (!ops)
>   return -EOPNOTSUPP;
> + }
>  
>   if (!ifname[0])
>   snprintf(ifname, IFNAMSIZ, "%s%%d", ops->kind);
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] NetXen: Add NETXEN prefix to macros

2007-06-06 Thread wendy xiong

On Sun, 2007-06-03 at 11:50 -0400, Jeff Garzik wrote:
> Mithlesh Thukral wrote:
> > NetXen: Add NETXEN prefix to a macro
> > This patch will add the "NETXEN" prefix to "USER_START" macro.
> > 
> > Signed-off by: Wen Xiong <[EMAIL PROTECTED]>
> > Signed-off by: Mithlesh Thukral <[EMAIL PROTECTED]>
> > ---
> > 
> >  drivers/net/netxen/netxen_nic.h |4 ++--
> >  drivers/net/netxen/netxen_nic_ethtool.c |2 +-
> >  drivers/net/netxen/netxen_nic_hw.c  |4 ++--
> >  drivers/net/netxen/netxen_nic_init.c|4 ++--
> >  4 files changed, 7 insertions(+), 7 deletions(-)
> 
> Your patch description is useless.  Clearly we know -what- it does, 
> simply by reading the patch.
> 
> But it does not answer the simple question:  why?  why is this needed in 
> a bug fix Release Candidate series?

Hi Jeff,

When we backported netxen driver to 2.6.9 kernel on ppc, we saw the
compile errors for "USER_START". Because there is another definition for
USER_START which is used by arch/ppc64/mm/hash_utils.c and native.c.

So we like to add NETXEN before USER_START in netxen device driver, then
netxen driver can be compiled correctly in 2.6.9 kernel.

Let me know if you any question for this patch.

Thanks,
wendy Xiong



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC RTNETLINK 00/09]: Netlink link creation API

2007-06-06 Thread Eric W. Biederman

Patrick McHardy <[EMAIL PROTECTED]> writes:

> Alexey Kuznetsov wrote:
>>>  I just suggested to
>>>Pavel to create only a single device per newlink operation and binding
>>>them later,
>> 
>> 
>> I see some logical inconsistency here.
>> 
>> Look, the second end is supposed to be in another namespace.
>> It will have identity, which cannot be expressed in any way in namespace,
>> which is allowed to create the pair: name, ifindex - nothing
>> is shared between namespaces.
>
>
> Good point, I didn't think of that. Is there a version of this patch
> that already uses different namespaces so I can look at it?

We have posted patches a couple of times, and veth or etun were always
a part of it.  But except for some book keeping details we really don't
care.

> Are network namespace completely seperated or is there some hierarchy
> with all lower namespaces visible above or something like that?

Completely separated.  The goal is to look like two separate machines
to user space, with respect to the network stack.

There is a bit of a hierarchy usage wise.  Because frequently only
one namespace will have real hardware devices in it.  So everything
needs to route through there.  But that detail is a usage detail
and is easiest not to reflect in the actual implementation.

> I imagined it more as a bind operation, pretty similar to enslave, so
> it would only contain an ifindex, no parameters. But as you say that
> doesn't work, so I guess we'd have to nest an entire ifinfomsg + the
> attributes for the partner device under it .. not exactly pretty.

In the model I'm working in, is that there is a separate operation:
"move device to other namespace", which should work for any network device.

So there should be an interval immediately after device creation
when both devices are in the same namespace, and then one of the
pair is moved to another namespace.

>> As response to this action two replies are generated: one RTM_NEWLINK
>> for one end of device with the whole desciption of partnet
>> is broadcasted inside this namespace, another RTM_NETLINK with index/name
>> of partner device is broadcasted inside the second namespace
>> (and, probably, some attributes, which must be hidden inside namespace,
>> f.e. identity of main device is suppressed). 
>
>
> The identity of the main device has no meaning within a different
> namespace, but are there other reasons for hiding it?

Not really.  We can already recognized the type of the device.

Eric

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC RTNETLINK 04/09]: Link creation API

2007-06-06 Thread Patrick McHardy

Eric W. Biederman wrote:
>>+ if (linkinfo[IFLA_INFO_NAME]) {
>>+ nla_strlcpy(name, linkinfo[IFLA_INFO_NAME], sizeof(name));
>>+ ops = rtnl_link_ops_get(name);
> 
> 
> Ugh.  Shouldn't we have the request_module logic here?
> Otherwise it looks like we can skip the validate method and 
> have other weird interactions.


Good catch. The easiest solution seems be to simply replay the
request after successful module load, which also avoids the
device lookup race.

Something like this (untested).

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 8d2f817..f2868b0 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -930,6 +930,7 @@ static int rtnl_newlink(struct sk_buff *skb, struct 
nlmsghdr *nlh, void *arg)
struct nlattr *linkinfo[IFLA_INFO_MAX+1];
int err;
 
+replay:
err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy);
if (err < 0)
return err;
@@ -1012,19 +1013,19 @@ static int rtnl_newlink(struct sk_buff *skb, struct 
nlmsghdr *nlh, void *arg)
if (tb[IFLA_ADDRESS] || tb[IFLA_BROADCAST] || tb[IFLA_MAP])
return -EOPNOTSUPP;
 
+   if (!ops) {
 #ifdef CONFIG_KMOD
-   if (!ops && kind[0]) {
-   /* race condition: device may be created while rtnl is
-* unlocked, final register_netdevice will catch it.
-*/
-   __rtnl_unlock();
-   request_module("rtnl-link-%s", kind);
-   rtnl_lock();
-   ops = rtnl_link_ops_get(kind);
-   }
+   if (kind[0]) {
+   __rtnl_unlock();
+   request_module("rtnl-link-%s", kind);
+   rtnl_lock();
+   ops = rtnl_link_ops_get(kind);
+   if (ops)
+   goto replay;
+   }
 #endif
-   if (!ops)
return -EOPNOTSUPP;
+   }
 
if (!ifname[0])
snprintf(ifname, IFNAMSIZ, "%s%%d", ops->kind);

Re: [RFC RTNETLINK 04/09]: Link creation API

2007-06-06 Thread Eric W. Biederman


> +static int rtnl_newlink(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
> +{
> + struct rtnl_link_ops *ops;
> + struct net_device *dev;
> + struct ifinfomsg *ifm;
> + char name[MODULE_NAME_LEN];
> + char ifname[IFNAMSIZ];
> + struct nlattr *tb[IFLA_MAX+1];
> + struct nlattr *linkinfo[IFLA_INFO_MAX+1];
> + int err;
> +
> + err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy);
> + if (err < 0)
> + return err;
> +
> + if (tb[IFLA_IFNAME])
> + nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);
> + else
> + ifname[0] = '\0';
> +
> + ifm = nlmsg_data(nlh);
> + if (ifm->ifi_index > 0)
> + dev = __dev_get_by_index(ifm->ifi_index);
> + else if (ifname[0])
> + dev = __dev_get_by_name(ifname);
> + else
> + dev = NULL;
> +
> + if (tb[IFLA_LINKINFO]) {
> + err = nla_parse_nested(linkinfo, IFLA_INFO_MAX,
> +tb[IFLA_LINKINFO], ifla_info_policy);
> + if (err < 0)
> + return err;
> + } else
> + memset(linkinfo, 0, sizeof(linkinfo));
> +
> + if (linkinfo[IFLA_INFO_NAME]) {
> + nla_strlcpy(name, linkinfo[IFLA_INFO_NAME], sizeof(name));
> + ops = rtnl_link_ops_get(name);

Ugh.  Shouldn't we have the request_module logic here?
Otherwise it looks like we can skip the validate method and 
have other weird interactions.


> + } else {
> + name[0] = '\0';
> + ops = NULL;
> + }
> +
> + if (1) {
> + struct nlattr *attr[ops ? ops->maxtype + 1 : 0], **data = NULL;
> +
> + if (ops) {
> + if (ops->maxtype && linkinfo[IFLA_INFO_DATA]) {
> + err = nla_parse_nested(attr, ops->maxtype,
> + linkinfo[IFLA_INFO_DATA],
> +ops->policy);
> + if (err < 0)
> + return err;
> + data = attr;
> + }
> + if (ops->validate) {
> + err = ops->validate(tb, data);
> + if (err < 0)
> + return err;
> + }
> + }
> +
> + if (dev) {
> + int modified = 0;
> +
> + if (nlh->nlmsg_flags & NLM_F_EXCL)
> + return -EEXIST;
> + if (nlh->nlmsg_flags & NLM_F_REPLACE)
> + return -EOPNOTSUPP;
> +
> + if (linkinfo[IFLA_INFO_DATA]) {
> + if (!ops || ops != dev->rtnl_link_ops ||
> + !ops->changelink)
> + return -EOPNOTSUPP;
> +
> + err = ops->changelink(dev, tb, data);
> + if (err < 0)
> + return err;
> + modified = 1;
> + }
> +
> + return do_setlink(dev, ifm, tb, ifname, modified);
> + }
> +
> + if (!(nlh->nlmsg_flags & NLM_F_CREATE))
> + return -ENODEV;
> +
> + if (ifm->ifi_index)
> + return -EINVAL;
> + if (tb[IFLA_ADDRESS] || tb[IFLA_BROADCAST] || tb[IFLA_MAP])
> + return -EOPNOTSUPP;
> +
> +#ifdef CONFIG_KMOD
> + if (!ops && name[0]) {
> + /* race condition: device may be created while rtnl is
> +  * unlocked, final register_netdevice will catch it.
> +  */
> + __rtnl_unlock();
> + request_module("rtnl-link-%s", name);
> + rtnl_lock();
> + ops = rtnl_link_ops_get(name);
> + }
> +#endif
> + if (!ops)
> + return -EOPNOTSUPP;

> +
> + if (!ifname[0])
> + snprintf(ifname, IFNAMSIZ, "%s%%d", ops->name);
> + dev = alloc_netdev(ops->priv_size, ifname, ops->setup);
> + if (!dev)
> + return -ENOMEM;
> +
> + if (strchr(dev->name, '%')) {
> + err = dev_alloc_name(dev, dev->name);
> + if (err < 0)
> + goto err_free;
> + }
> + dev->rtnl_link_ops = ops;
> +
> + if (tb[IFLA_MTU])
> + dev->mtu = nla_get_u32(tb[IFLA_MTU]);
> + if (tb[IFLA_TXQLEN])
> + dev->tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]);
> + if (tb[IFLA_WEIGHT])
> + dev->weight = nla_get_u32(tb[IFLA_WEIGHT]);
> + if (tb[IFLA_OPERSTATE])
> + set_ope

Re: [RFC RTNETLINK 00/09]: Netlink link creation API

2007-06-06 Thread Patrick McHardy

Alexey Kuznetsov wrote:
>>   I just suggested to
>>Pavel to create only a single device per newlink operation and binding
>>them later,
> 
> 
> I see some logical inconsistency here.
> 
> Look, the second end is supposed to be in another namespace.
> It will have identity, which cannot be expressed in any way in namespace,
> which is allowed to create the pair: name, ifindex - nothing
> is shared between namespaces.

Good point, I didn't think of that. Is there a version of this patch
that already uses different namespaces so I can look at it?

Are network namespace completely seperated or is there some hierarchy
with all lower namespaces visible above or something like that?

> Moreover, do not forget we have two netlink spaces as well.
> Events happening in one namespace are reported only inside that namespace.
> 
> Actually, the only self-consistent solution, which I see right now
> (sorry, did not think that much) is to create the whole pair as
> one operation; required parameters (name of partner, identity of namespace)
> can be passed as attributes. I guess IFLA_PARTNER approach suggests
> the same thing, right?

I imagined it more as a bind operation, pretty similar to enslave, so
it would only contain an ifindex, no parameters. But as you say that
doesn't work, so I guess we'd have to nest an entire ifinfomsg + the
attributes for the partner device under it .. not exactly pretty.

> As response to this action two replies are generated: one RTM_NEWLINK
> for one end of device with the whole desciption of partnet
> is broadcasted inside this namespace, another RTM_NETLINK with index/name
> of partner device is broadcasted inside the second namespace
> (and, probably, some attributes, which must be hidden inside namespace,
> f.e. identity of main device is suppressed). 

The identity of the main device has no meaning within a different
namespace, but are there other reasons for hiding it?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Virtual ethernet tunnel

2007-06-06 Thread Stephen Hemminger

On Wed, 06 Jun 2007 19:11:38 +0400
Pavel Emelianov <[EMAIL PROTECTED]> wrote:

> Veth stands for Virtual ETHernet. It is a simple tunnel driver
> that works at the link layer and looks like a pair of ethernet
> devices interconnected with each other.
> 
> Mainly it allows to communicate between network namespaces but
> it can be used as is as well.
> 
> Eric recently sent a similar driver called etun. This
> implementation uses another interface - the RTM_NRELINK
> message introduced by Patric. The patch fits today netdev
> tree with Patrick's patches.
> 
> The newlink callback is organized that way to make it easy
> to create the peer device in the separate namespace when we
> have them in kernel.
> 
> The patch for an ip utility is also provided.
> 
> Eric, since ethtool interface was from your patch, I add
> your Signed-off-by line.
> 
> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>
> Signed-off-by: Pavel Emelianov <[EMAIL PROTECTED]>
> 
> ---
> 
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 7d57f4a..7e144be 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -119,6 +119,12 @@ config TUN
>  
> If you don't know what to use this for, you don't need it.
>  
> +config VETH
> + tristate "Virtual ethernet device"
> + ---help---
> +   The device is an ethernet tunnel. Devices are created in pairs. When
> +   one end receives the packet it appears on its pair and vice versa.
> +
>  config NET_SB1000
>   tristate "General Instruments Surfboard 1000"
>   depends on PNP
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index a77affa..4764119 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -185,6 +185,7 @@ obj-$(CONFIG_MACSONIC) += macsonic.o
>  obj-$(CONFIG_MACMACE) += macmace.o
>  obj-$(CONFIG_MAC89x0) += mac89x0.o
>  obj-$(CONFIG_TUN) += tun.o
> +obj-$(CONFIG_VETH) += veth.o
>  obj-$(CONFIG_NET_NETX) += netx-eth.o
>  obj-$(CONFIG_DL2K) += dl2k.o
>  obj-$(CONFIG_R8169) += r8169.o
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> new file mode 100644
> index 000..6746c91
> --- /dev/null
> +++ b/drivers/net/veth.c
> @@ -0,0 +1,391 @@
> +/*
> + *  drivers/net/veth.c
> + *
> + *  Copyright (C) 2007 OpenVZ http://openvz.org, SWsoft Inc
> + *
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +
> +#define DRV_NAME "veth"
> +#define DRV_VERSION  "1.0"
> +
> +struct veth_priv {
> + struct net_device *peer;
> + struct net_device *dev;
> + struct list_head list;
> + struct net_device_stats stats;
> + unsigned ip_summed;
> +};
> +
> +static LIST_HEAD(veth_list);
> +
> +/*
> + * ethtool interface
> + */
> +
> +static struct {
> + const char string[ETH_GSTRING_LEN];
> +} ethtool_stats_keys[] = {
> + { "peer_ifindex" },
> +};

Seems like a good usage of sysfs attributes, rather than ethtool.

Then you can get rid of all the useless ethtool for what is
basically a virtual device.

> +/*
> + * xmit
> + */
> +
> +static int veth_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> + struct net_device *rcv = NULL;
> + struct veth_priv *priv, *rcv_priv;
> + int length;
> +
> + skb_orphan(skb);
> +
> + priv = netdev_priv(dev);
> + rcv = priv->peer;
> + rcv_priv = netdev_priv(rcv);
> +
> + if (!(rcv->flags & IFF_UP))
> + goto outf;
> +
> + skb->dev = rcv;
> + skb->pkt_type = PACKET_HOST;
> + skb->protocol = eth_type_trans(skb, rcv);
> + if (dev->features & NETIF_F_NO_CSUM)
> + skb->ip_summed = rcv_priv->ip_summed;
> +
> + dst_release(skb->dst);
> + skb->dst = NULL;
> +
> + secpath_reset(skb);
> + nf_reset(skb);
> +
> + length = skb->len;
> +
> + priv->stats.tx_bytes += length;
> + priv->stats.tx_packets++;
> +
> + rcv_priv->stats.rx_bytes += length;
> + rcv_priv->stats.rx_packets++;

Per-cpu stats? This will cacheline thrash.

> + netif_rx(skb);
> + return 0;
> +
> +outf:
> + kfree_skb(skb);
> + priv->stats.tx_dropped++;
> + return 0;
> +}



-- 
Stephen Hemminger <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC RTNETLINK 00/09]: Netlink link creation API

2007-06-06 Thread Patrick McHardy

Eric W. Biederman wrote:
> Patrick McHardy <[EMAIL PROTECTED]> writes:
> 
> 
>>>I still think adding a IFLA_PARTNER or a custom attribute is cleaner
>>>in this case.  Slight semantic mismatches are the worst design bugs
>>>to correct.
>>
>>
>>Indeed, IFLA_PARTNER sounds like a better idea. I just suggested to
>>Pavel to create only a single device per newlink operation and binding
>>them later, what do you think about that?
> 
> 
> I don't think it solves much because we still need a way to report the
> partner device.

I was thinking of something like this:

ip link add veth0 type veth
ip link add veth1 partner veth0 type veth

ip would resolve veth0 to an ifindex and set IFLA_PARTNER. But Alexey
just raised a few good points, so this might not work.

> On the actual using side I think it makes the core of the driver much
> more difficult to get right.  
> 
> Basically if we can't count on having a partner device we have to
> add NULL pointer checks and locking to the packet dispatch which
> is otherwise unnecessary.

All you'd need to do is keep the queue stopped until the device
is bound. No changes to rx or tx path neccessary.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC RTNETLINK 00/09]: Netlink link creation API

2007-06-06 Thread Eric W. Biederman

Patrick McHardy <[EMAIL PROTECTED]> writes:

>> I still think adding a IFLA_PARTNER or a custom attribute is cleaner
>> in this case.  Slight semantic mismatches are the worst design bugs
>> to correct.
>
>
> Indeed, IFLA_PARTNER sounds like a better idea. I just suggested to
> Pavel to create only a single device per newlink operation and binding
> them later, what do you think about that?

I don't think it solves much because we still need a way to report the
partner device.

On the actual using side I think it makes the core of the driver much
more difficult to get right.  

Basically if we can't count on having a partner device we have to
add NULL pointer checks and locking to the packet dispatch which
is otherwise unnecessary.

Eric

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: warnings in git-wireless

2007-06-06 Thread James Ketrenos

Christoph Hellwig wrote:
> On Tue, Jun 05, 2007 at 01:12:03PM -0700, James Ketrenos wrote:
>> Yes, we certainly don't want a driver to be "near mainline" that does 
>> things that the rest of the kernel and other drivers are doing.  We should 
>> force them to stay out-of-tree until any and everything is resolved.  
>> Heaven forbid that the code should be merged, contributed, and improved 
>> upon as a community.
>>
>> You'd think these guys were trying to enable folks with a driver so they 
>> can use the hardware they bought or something.  Its almost laughable.  What 
>> idiots.
> 
> Maybe you should make sure your companies publishes specs

I'm flattered that you think I am high enough at Intel to do that.  If it was 
in my power, accurate and complete specifications would be available for all of 
Intel's hardware--for everyone.  I doubt you'll find a single Linux developer 
at Intel who is not fighting to try and reach that goal, including myself.

The reality is, however, (and this may shock you) I am not a vice president or 
head of a major division at Intel.  Its true.  Intel doesn't do what I want 
just because I snap my fingers and say it should be so.

Instead, we try and do what we can -- enable people with a driver (and a 
community to contribute into) so that the hardware they purchased can operate 
to its fullest potential on Linux.  We realize within the community that some 
areas are difficult for people outside of Intel to initially contribute to 
given the lack of specs.  However, there are vast quantities of features and 
functionality that can be enabled and improved given nothing more than the 
driver sources.

Anyway, its obvious from your emails that our efforts are not sufficient and 
you seem to resent that we're even trying.  I apologize.

If you'd like to submit patches to help us make the perfect driver, we're very 
willing to review and merge those in.

James
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC RTNETLINK 00/09]: Netlink link creation API

2007-06-06 Thread Alexey Kuznetsov

Hello!

>I just suggested to
> Pavel to create only a single device per newlink operation and binding
> them later,

I see some logical inconsistency here.

Look, the second end is supposed to be in another namespace.
It will have identity, which cannot be expressed in any way in namespace,
which is allowed to create the pair: name, ifindex - nothing
is shared between namespaces.

Moreover, do not forget we have two netlink spaces as well.
Events happening in one namespace are reported only inside that namespace.

Actually, the only self-consistent solution, which I see right now
(sorry, did not think that much) is to create the whole pair as
one operation; required parameters (name of partner, identity of namespace)
can be passed as attributes. I guess IFLA_PARTNER approach suggests
the same thing, right?

As response to this action two replies are generated: one RTM_NEWLINK
for one end of device with the whole desciption of partnet
is broadcasted inside this namespace, another RTM_NETLINK with index/name
of partner device is broadcasted inside the second namespace
(and, probably, some attributes, which must be hidden inside namespace,
f.e. identity of main device is suppressed). 

Alexey


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Virtual ethernet tunnel

2007-06-06 Thread Patrick McHardy

Pavel Emelianov wrote:

> +MODULE_DESCRIPTION("Virtual Ethernet Tunnel");
> +MODULE_LICENSE("GPL v2");

This seems to be missing MODULE_ALIAS_RTNL_LINK("veth");
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC RTNETLINK 00/09]: Netlink link creation API

2007-06-06 Thread Patrick McHardy

Eric W. Biederman wrote:
> Patrick McHardy <[EMAIL PROTECTED]> writes:
> 
>>You don't really need to patch it, installing a new iplink_XXX.so
>>file is enough. Generalizing driver specific options more than
>>what we currently have doesn't look very promising. I think your
>>driver was simple enough to get along with the generic device
>>attributes though (IFLA_LINK or IFLA_MASTER).
> 
> 
> I need to know the other device in the pair of devices I am creating.
> If ifindex was selectable IFLA_LINK or IFLA_MASTER might be
> interesting however they are currently are not, and I'm not quite
> certain about letting a user select the ifindex.  Although there may
> come a point when dealing with migration when it makes sense.


It shouldn't be very hard to implement, so far I just didn't see
any use for it.

> Hmm.  I guess if I had a reasonable default I could find out the
> pair device by looking at the returned NEW_LINK message.
> 
> Looking more close.  IFLA_MASTER does not work because I don't
> have a master/slave relationship.
> 
> IFLA_LINK looks like it will work.  I don't precisely match the
> semantics though which makes me nervous.  In particular my other
> device is not something I am sending through but what I am sending
> to.  The way the IPv6 code uses iflink to get the link local address
> starting with the hardware address of the iflink would be completely
> the wrong thing to do in my case.  Now my device won't have the magic
> IPv6 tunnel arp type so that code won't trigger.  Still it is
> a challenge.
> 
> I still think adding a IFLA_PARTNER or a custom attribute is cleaner
> in this case.  Slight semantic mismatches are the worst design bugs
> to correct.


Indeed, IFLA_PARTNER sounds like a better idea. I just suggested to
Pavel to create only a single device per newlink operation and binding
them later, what do you think about that?

>>>I do think we should specify the IFLA_KIND (was: IFLA_NAME) values in
>>>a header file.  So it is easy to get a list of all of the different
>>>kinds and so we don't conflict.
>>
>>
>>I don't think conflicts are going to be a problem (it would be
>>nice if modpost would warn about duplicate aliases though).
>>How is listing IFLA_KIND types in a header file going to help
>>get a list? Presuming the user knows what kind of device he
>>wants to set up and is not just looking for things to play
>>around with I also don't see any real value in this.
> 
> 
> This isn't about the user this is about maintaining the ABI.
> 
> We have to control set of strings for IFLA_KIND.  Having them all
> in a single header file means that we can easily look when we add
> support for a new kind to see if some other driver has already
> used that kind.
> 
> The same reason we stick the rest of the enumerations into a header
> file.
> 
> Strings don't conflict as easily as small integers do, but it is still
> possible to have a conflict, so having something like an ifla_kind.h to
> hold them would be useful.


Mhh .. we have multiple string based APIs that do just fine. I'd
prefer having someone adding a new driver do a quick grep for
MODULE_ALIAS_RTNL_LINK to adding unused definitions.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Virtual ethernet tunnel

2007-06-06 Thread Patrick McHardy

Pavel Emelianov wrote:
> Veth stands for Virtual ETHernet. It is a simple tunnel driver
> that works at the link layer and looks like a pair of ethernet
> devices interconnected with each other.
> 
> Mainly it allows to communicate between network namespaces but
> it can be used as is as well.
> 
> Eric recently sent a similar driver called etun. This
> implementation uses another interface - the RTM_NRELINK
> message introduced by Patric. The patch fits today netdev
> tree with Patrick's patches.
> 
> The newlink callback is organized that way to make it easy
> to create the peer device in the separate namespace when we
> have them in kernel.
> 

> +struct veth_priv {
> + struct net_device *peer;
> + struct net_device *dev;
> + struct list_head list;
> + struct net_device_stats stats;


You can use dev->stats instead.

> +static int veth_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> + struct net_device *rcv = NULL;
> + struct veth_priv *priv, *rcv_priv;
> + int length;
> +
> + skb_orphan(skb);
> +
> + priv = netdev_priv(dev);
> + rcv = priv->peer;
> + rcv_priv = netdev_priv(rcv);
> +
> + if (!(rcv->flags & IFF_UP))
> + goto outf;
> +
> + skb->dev = rcv;

eth_type_trans already sets skb->dev.

> + skb->pkt_type = PACKET_HOST;
> + skb->protocol = eth_type_trans(skb, rcv);
> + if (dev->features & NETIF_F_NO_CSUM)
> + skb->ip_summed = rcv_priv->ip_summed;
> +
> + dst_release(skb->dst);
> + skb->dst = NULL;
> +
> + secpath_reset(skb);
> + nf_reset(skb);


Is skb->mark supposed to survive communication between different
namespaces?

> +static const struct nla_policy veth_policy[VETH_INFO_MAX] = {
> + [VETH_INFO_MAC] = { .type = NLA_BINARY, .len = ETH_ALEN },
> + [VETH_INFO_PEER]= { .type = NLA_STRING },
> + [VETH_INFO_PEER_MAC]= { .type = NLA_BINARY, .len = ETH_ALEN },
> +};


The rtnl_link codes looks fine. I don't like the VETH_INFO_MAC attribute
very much though, we already have a generic device attribute for MAC
addresses. Of course that only allows you to supply one MAC address, so
I'm wondering what you think of allocating only a single device per
newlink operation and binding them in a seperate enslave operation?

> +enum {
> + VETH_INFO_UNSPEC,
> + VETH_INFO_MAC,
> + VETH_INFO_PEER,
> + VETH_INFO_PEER_MAC,
> +
> + VETH_INFO_MAX
> +};

Please follow the

#define VETH_INFO_MAX   (__VETH_INFO_MAX - 1)

convention here.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC RTNETLINK 00/09]: Netlink link creation API

2007-06-06 Thread Eric W. Biederman

Patrick McHardy <[EMAIL PROTECTED]> writes:

> Eric W. Biederman wrote:
>> Reading through the patches they look usable to me.
>> 
>> Having to patch iproute to create the more interesting network
>> devices sucks, but that problem seems fundamental.  We might
>> be able to avoid it if we allowed fields to be reused between
>> different types of devices but that makes the error checking
>> trickier, and we aren't likely to have that many types of
>> devices so there likely isn't much value in generalizing.
>
>
> You don't really need to patch it, installing a new iplink_XXX.so
> file is enough. Generalizing driver specific options more than
> what we currently have doesn't look very promising. I think your
> driver was simple enough to get along with the generic device
> attributes though (IFLA_LINK or IFLA_MASTER).

I need to know the other device in the pair of devices I am creating.
If ifindex was selectable IFLA_LINK or IFLA_MASTER might be
interesting however they are currently are not, and I'm not quite
certain about letting a user select the ifindex.  Although there may
come a point when dealing with migration when it makes sense.

Hmm.  I guess if I had a reasonable default I could find out the
pair device by looking at the returned NEW_LINK message.

Looking more close.  IFLA_MASTER does not work because I don't
have a master/slave relationship.

IFLA_LINK looks like it will work.  I don't precisely match the
semantics though which makes me nervous.  In particular my other
device is not something I am sending through but what I am sending
to.  The way the IPv6 code uses iflink to get the link local address
starting with the hardware address of the iflink would be completely
the wrong thing to do in my case.  Now my device won't have the magic
IPv6 tunnel arp type so that code won't trigger.  Still it is
a challenge.

I still think adding a IFLA_PARTNER or a custom attribute is cleaner
in this case.  Slight semantic mismatches are the worst design bugs
to correct.

To some extent this is a practical problematic point for me, because
in the context of multiple network namespaces I could theoretically
have both network devices have the same name and the same ifindex in
different network namespaces.  Although it really doesn't matter
unless they are in the same network namespace in which case they
can't have the same ifindex or ifname.

>> I do think we should specify the IFLA_KIND (was: IFLA_NAME) values in
>> a header file.  So it is easy to get a list of all of the different
>> kinds and so we don't conflict.
>
>
> I don't think conflicts are going to be a problem (it would be
> nice if modpost would warn about duplicate aliases though).
> How is listing IFLA_KIND types in a header file going to help
> get a list? Presuming the user knows what kind of device he
> wants to set up and is not just looking for things to play
> around with I also don't see any real value in this.

This isn't about the user this is about maintaining the ABI.

We have to control set of strings for IFLA_KIND.  Having them all
in a single header file means that we can easily look when we add
support for a new kind to see if some other driver has already
used that kind.

The same reason we stick the rest of the enumerations into a header
file.

Strings don't conflict as easily as small integers do, but it is still
possible to have a conflict, so having something like an ifla_kind.h to
hold them would be useful.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Module for ip utility to support veth device

2007-06-06 Thread Patrick McHardy

Pavel Emelianov wrote:
> diff --git a/ip/iplink.c b/ip/iplink.c
> index 5170419..6975990 100644
> --- a/ip/iplink.c
> +++ b/ip/iplink.c
> @@ -287,7 +287,7 @@ static int iplink_modify(int cmd, unsign
>strlen(type));
>  
>   lu = get_link_type(type);
> - if (lu) {
> + if (lu && argc) {
>   struct rtattr * data = NLMSG_TAIL(&req.n);
>   addattr_l(&req.n, sizeof(req), IFLA_INFO_DATA, NULL, 0);

I've folded this part into my iproute patch, thanks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Module for ip utility to support veth device

2007-06-06 Thread Pavel Emelianov

The usage is
# ip link add [name] type veth [peer ] [mac ] [peer_mac ]

The Makefile is maybe not as beautiful as it could be. It
is to be discussed.

One thing I noticed during testing is the following. When launching
this with link_veth.so module and not specifying any module specific
parameters, the kernel refuses to accept the packet when parsing the
IFLA_LINKINFO. So the hunk for ip/iplink.c doesn't add an empty extra 
header if no extra data expected.

Signed-off-by: Pavel Emelianov <[EMAIL PROTECTED]>

---

diff --git a/ip/Makefile b/ip/Makefile
index 9a5bfe3..b46bce3 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -8,8 +8,9 @@ RTMONOBJ=rtmon.o
 ALLOBJ=$(IPOBJ) $(RTMONOBJ)
 SCRIPTS=ifcfg rtpr routel routef
 TARGETS=ip rtmon
+LIBS=link_veth.so
 
-all: $(TARGETS) $(SCRIPTS)
+all: $(TARGETS) $(SCRIPTS) $(LIBS)
 
 ip: $(IPOBJ) $(LIBNETLINK) $(LIBUTIL)
 
@@ -24,3 +25,6 @@ clean:
 
 LDLIBS += -ldl
 LDFLAGS+= -Wl,-export-dynamic
+
+%.so: %.c
+   $(CC) $(CFLAGS) -shared $< -o $@
diff --git a/ip/iplink.c b/ip/iplink.c
index 5170419..6975990 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -287,7 +287,7 @@ static int iplink_modify(int cmd, unsign
 strlen(type));
 
lu = get_link_type(type);
-   if (lu) {
+   if (lu && argc) {
struct rtattr * data = NLMSG_TAIL(&req.n);
addattr_l(&req.n, sizeof(req), IFLA_INFO_DATA, NULL, 0);
 
diff --git a/ip/link_veth.c b/ip/link_veth.c
new file mode 100644
index 000..adfdef6
--- /dev/null
+++ b/ip/link_veth.c
@@ -0,0 +1,77 @@
+#include 
+
+#include "utils.h"
+#include "ip_common.h"
+#include "veth.h"
+
+#define ETH_ALEN   6
+
+static void usage(void)
+{
+   printf("Usage: ip link add ... "
+   "[peer ] [mac ] [peer_mac ]\n");
+}
+
+static int veth_parse_opt(struct link_util *lu, int argc, char **argv,
+   struct nlmsghdr *hdr)
+{
+   __u8 mac[ETH_ALEN];
+
+   for (; argc != 0; argv++, argc--) {
+   if (strcmp(*argv, "peer") == 0) {
+   argv++;
+   argc--;
+   if (argc == 0) {
+   usage();
+   return -1;
+   }
+
+   addattr_l(hdr, 1024, VETH_INFO_PEER,
+   *argv, strlen(*argv));
+
+   continue;
+   }
+
+   if (strcmp(*argv, "mac") == 0) {
+   argv++;
+   argc--;
+   if (argc == 0) {
+   usage();
+   return -1;
+   }
+
+   if (hexstring_a2n(*argv, mac, sizeof(mac)) == NULL)
+   return -1;
+
+   addattr_l(hdr, 1024, VETH_INFO_MAC,
+   mac, ETH_ALEN);
+   continue;
+   }
+
+   if (strcmp(*argv, "peer_mac") == 0) {
+   argv++;
+   argc--;
+   if (argc == 0) {
+   usage();
+   return -1;
+   }
+
+   if (hexstring_a2n(*argv, mac, sizeof(mac)) == NULL)
+   return -1;
+
+   addattr_l(hdr, 1024, VETH_INFO_PEER_MAC,
+   mac, ETH_ALEN);
+   continue;
+   }
+
+   usage();
+   return -1;
+   }
+
+   return 0;
+}
+
+struct link_util veth_link_util = {
+   .id = "veth",
+   .parse_opt = veth_parse_opt,
+};
diff --git a/ip/veth.h b/ip/veth.h
new file mode 100644
index 000..74c8e1e
--- /dev/null
+++ b/ip/veth.h
@@ -0,0 +1,13 @@
+#ifndef __NET_VETH_H__
+#define __NET_VETH_H__
+
+enum {
+   VETH_INFO_UNSPEC,
+   VETH_INFO_MAC,
+   VETH_INFO_PEER,
+   VETH_INFO_PEER_MAC,
+
+   VETH_INFO_MAX
+};
+
+#endif
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET: Multiqueue network device support.

2007-06-06 Thread Patrick McHardy

jamal wrote:
> There will be no issue if a) multiple APIs would be allowed for driver
> multi-rings[1] and b) you didnt touch the qdiscs.
>  
> Given that #a is not a sensible thing to do since there can only be one
> API and for #b you are not compromising, what do you want me to do?

I haven't followed the entire discussion, but I still don't see a
alternative to touching the qdisc layer - multiple hardware queues
need multiple queue states if you want to avoid a busy hardware
queue stopping the qdisc entirely and thereby preventing the qdisc
to continue feeding packets to other active HW queues. And to make
use of the multiple queue states you need multiple queues.

I would love to see your alternative patches.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Virtual ethernet tunnel

2007-06-06 Thread Pavel Emelianov

Veth stands for Virtual ETHernet. It is a simple tunnel driver
that works at the link layer and looks like a pair of ethernet
devices interconnected with each other.

Mainly it allows to communicate between network namespaces but
it can be used as is as well.

Eric recently sent a similar driver called etun. This
implementation uses another interface - the RTM_NRELINK
message introduced by Patric. The patch fits today netdev
tree with Patrick's patches.

The newlink callback is organized that way to make it easy
to create the peer device in the separate namespace when we
have them in kernel.

The patch for an ip utility is also provided.

Eric, since ethtool interface was from your patch, I add
your Signed-off-by line.

Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>
Signed-off-by: Pavel Emelianov <[EMAIL PROTECTED]>

---

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 7d57f4a..7e144be 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -119,6 +119,12 @@ config TUN
 
  If you don't know what to use this for, you don't need it.
 
+config VETH
+   tristate "Virtual ethernet device"
+   ---help---
+ The device is an ethernet tunnel. Devices are created in pairs. When
+ one end receives the packet it appears on its pair and vice versa.
+
 config NET_SB1000
tristate "General Instruments Surfboard 1000"
depends on PNP
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index a77affa..4764119 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -185,6 +185,7 @@ obj-$(CONFIG_MACSONIC) += macsonic.o
 obj-$(CONFIG_MACMACE) += macmace.o
 obj-$(CONFIG_MAC89x0) += mac89x0.o
 obj-$(CONFIG_TUN) += tun.o
+obj-$(CONFIG_VETH) += veth.o
 obj-$(CONFIG_NET_NETX) += netx-eth.o
 obj-$(CONFIG_DL2K) += dl2k.o
 obj-$(CONFIG_R8169) += r8169.o
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
new file mode 100644
index 000..6746c91
--- /dev/null
+++ b/drivers/net/veth.c
@@ -0,0 +1,391 @@
+/*
+ *  drivers/net/veth.c
+ *
+ *  Copyright (C) 2007 OpenVZ http://openvz.org, SWsoft Inc
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#define DRV_NAME   "veth"
+#define DRV_VERSION"1.0"
+
+struct veth_priv {
+   struct net_device *peer;
+   struct net_device *dev;
+   struct list_head list;
+   struct net_device_stats stats;
+   unsigned ip_summed;
+};
+
+static LIST_HEAD(veth_list);
+
+/*
+ * ethtool interface
+ */
+
+static struct {
+   const char string[ETH_GSTRING_LEN];
+} ethtool_stats_keys[] = {
+   { "peer_ifindex" },
+};
+
+static int veth_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+{
+   cmd->supported  = 0;
+   cmd->advertising= 0;
+   cmd->speed  = SPEED_1;
+   cmd->duplex = DUPLEX_FULL;
+   cmd->port   = PORT_TP;
+   cmd->phy_address= 0;
+   cmd->transceiver= XCVR_INTERNAL;
+   cmd->autoneg= AUTONEG_DISABLE;
+   cmd->maxtxpkt   = 0;
+   cmd->maxrxpkt   = 0;
+   return 0;
+}
+
+static void veth_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo 
*info)
+{
+   strcpy(info->driver, DRV_NAME);
+   strcpy(info->version, DRV_VERSION);
+   strcpy(info->fw_version, "N/A");
+}
+
+static void veth_get_strings(struct net_device *dev, u32 stringset, u8 *buf)
+{
+   switch(stringset) {
+   case ETH_SS_STATS:
+   memcpy(buf, ðtool_stats_keys, sizeof(ethtool_stats_keys));
+   break;
+   }
+}
+
+static int veth_get_stats_count(struct net_device *dev)
+{
+   return ARRAY_SIZE(ethtool_stats_keys);
+}
+
+static void veth_get_ethtool_stats(struct net_device *dev,
+   struct ethtool_stats *stats, u64 *data)
+{
+   struct veth_priv *priv;
+
+   priv = netdev_priv(dev);
+   data[0] = priv->peer->ifindex;
+}
+
+static u32 veth_get_rx_csum(struct net_device *dev)
+{
+   struct veth_priv *priv;
+
+   priv = netdev_priv(dev);
+   return priv->ip_summed == CHECKSUM_UNNECESSARY;
+}
+
+static int veth_set_rx_csum(struct net_device *dev, u32 data)
+{
+   struct veth_priv *priv;
+
+   priv = netdev_priv(dev);
+   priv->ip_summed = data ? CHECKSUM_UNNECESSARY : CHECKSUM_NONE;
+   return 0;
+}
+
+static u32 veth_get_tx_csum(struct net_device *dev)
+{
+   return (dev->features & NETIF_F_NO_CSUM) != 0;
+}
+
+static int veth_set_tx_csum(struct net_device *dev, u32 data)
+{
+   if (data)
+   dev->features |= NETIF_F_NO_CSUM;
+   else
+   dev->features &= ~NETIF_F_NO_CSUM;
+   return 0;
+}
+
+static struct ethtool_ops veth_ethtool_ops = {
+   .get_settings   = veth_get_settings,
+   .get_drvinfo= veth_get_drvinfo,
+   .get_link   = ethtool_op_get_link,
+   .get_rx_csum= veth_get_rx_csum,
+   .set_rx_csum= vet

Re: [PATCH] acenic: SET_NETDEV_DEV is always there these days

2007-06-06 Thread Jes Sorensen

> "Geert" == Geert Uytterhoeven <[EMAIL PROTECTED]> writes:

Geert> acenic: SET_NETDEV_DEV is always there these days

Geert> Signed-off-by: Geert Uytterhoeven <[EMAIL PROTECTED]>

Signed-off-by: Jes Sorensen <[EMAIL PROTECTED]>

Geert> --- Disclaimer: not tested at all

Geert> --- a/drivers/net/acenic.c +++ b/drivers/net/acenic.c @@ -159,10
Geert> +159,6 @@ static struct pci_device_id acenic_pci_t };
Geert> MODULE_DEVICE_TABLE(pci, acenic_pci_tbl);
 
Geert> -#ifndef SET_NETDEV_DEV -#define SET_NETDEV_DEV(net, pdev) do{}
Geert> while(0) -#endif - #define ace_sync_irq(irq) synchronize_irq(irq)
 
Geert>  #ifndef offset_in_page

Geert> Gr{oetje,eeting}s,

Geert>  Geert

Geert> -- Geert Uytterhoeven -- Sony Network and Software Technology
Geert> Center Europe (NSCE) [EMAIL PROTECTED] --- The
Geert> Corporate Village, Da Vincilaan 7-D1 Voice +32-2-7008453 Fax
Geert> +32-2-7008622  B-1935 Zaventem, Belgium
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] acenic: SET_NETDEV_DEV is always there these days

2007-06-06 Thread Geert Uytterhoeven

acenic: SET_NETDEV_DEV is always there these days

Signed-off-by: Geert Uytterhoeven <[EMAIL PROTECTED]>

---
Disclaimer: not tested at all

--- a/drivers/net/acenic.c
+++ b/drivers/net/acenic.c
@@ -159,10 +159,6 @@ static struct pci_device_id acenic_pci_t
 };
 MODULE_DEVICE_TABLE(pci, acenic_pci_tbl);
 
-#ifndef SET_NETDEV_DEV
-#define SET_NETDEV_DEV(net, pdev)  do{} while(0)
-#endif
-
 #define ace_sync_irq(irq)  synchronize_irq(irq)
 
 #ifndef offset_in_page

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- Sony Network and Software Technology Center Europe (NSCE)
[EMAIL PROTECTED] --- The Corporate Village, Da Vincilaan 7-D1
Voice +32-2-7008453 Fax +32-2-7008622  B-1935 Zaventem, Belgium
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[WIP][PATCHES] Network xmit batching

2007-06-06 Thread jamal

Folks,

While Krishna and I have been attempting this on the side, progress has
been rather slow - so this is to solicit for more participation so we
can get this over with faster. Success (myself being conservative when
it comes to performance) requires testing on a wide variety of hardware.

The results look promising - certainly from a pktgen perspective where
performance has been known in some cases to go up over 50%.
Tests by Sridhar on a low number of TCP flows also indicate improved
performance as well as lowered CPU use.
 
I have setup the current state of my patches against Linus tree at:
git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git

This is also clean against 2.6.22-rc4. So if you want just a diff that
will work against 2.6.22-rc4 - i can send it to you.
I also have a tree against Daves net-2.6 at 
git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-net26.git
but iam abandoning that effort until we get this stable due to the
occasional bug that cropped up(like e1000).

I am attaching a pktgen script. There is one experimental parameter
called "batch_low" - for starters just leave it at 0 in order to reduce
experimental variance. If you have solid results you can muck around
with it.
KK has a netperf script he has been using - if you know netperf your
help will really be appreciated in testing it on your hardware.
KK, can you please post your script?
Testing with forwarding and bridging will also be appreaciated.
Above that, suggestions to changes as long as they are based on
verifiable results or glaringly obvious changes are welcome. My
preference at the moment is to flesh out the patch as is and then
improve on it later if it shows it has some value on a wide variety of
apps. As the subject is indicating this is a WIP and as all eulas
suggest "subject to change without notice". 
If you help out, when you post your results, can you please say what
hardware and setup was?

The only real driver that has been changed is e1000 for now. KK is
working on something infiniband related and i plan (if noone beats me)
to get tg3 working. It would be nice if someone converted some 10G
ethernet driver.

cheers,
jamal


pktgen.batch-1-1
Description: application/shellscript

Re: [RFC RTNETLINK 00/09]: Netlink link creation API

2007-06-06 Thread Patrick McHardy

Patrick McHardy wrote:
> The following patches contain the rtnetlink link creation API I promised,
> as well as two simple driver conversion to use the API as an example.
> I've also converted VLAN as a more complex example, but these patches
> need some more work and are most likely not interesting to all the CCed
> parties, so I'm sending them seperately.

I've updated the patches to remove the broken VLAN ID change, added back
some consts, renamed IFLA_INFO_NAME to IFLA_INFO_KIND and rebased to
current net-2.6. The current patches and -git trees can be found at
http://people.netfilter.org/kaber/rtnl_link/

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Socket hangs in read() when putting interface down

2007-06-06 Thread chris schlund

>> This may be ok on a major interface like eth0 but it is a problem when
>> using a ppp interface via GSM data or GPRS connection.
>> Killing the pppd while the socket waits in read() produces the same
>> behaviour. Also a broken connection looks the same.
>> It is reproducable on several 2.6 kernel versions including the lastest
>> stable 2.6.21.
>>
>> On my understanding "ifconfig down" should lead into a read() error.
> 
> Why?  Can't traffic for the endpoint arrive via some other interface on
> the system?  I would think that getting out of a blocking read()
> requires the application to either have a timer running (and so get out
> of read() with EINTR) or to have keepalives enabled on the TCP
> connection (I'm assuming TCP for an AF_INET, SOCK_STREAM although I
> suppose it could be SCTP)
> 
> rick jones

Of course, you are right.
I am so focused to my small embedded world, that I simply have forgotten
how IP-traffic works in big systems (to clarify my first post: I'm using
TCP).

Your hint with the self produced signal to get out of read() was very
helpful for me. Thats what I was missing.

I also learned another think, while reading the list FAQ:
PEBCAC (=Problem exists between chair and computer)
Obviously, thats what best described me!

Thank you very much!

Chris Schlund



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC RTNETLINK 00/09]: Netlink link creation API

2007-06-06 Thread Patrick McHardy

Eric W. Biederman wrote:
> Reading through the patches they look usable to me.
> 
> Having to patch iproute to create the more interesting network
> devices sucks, but that problem seems fundamental.  We might
> be able to avoid it if we allowed fields to be reused between
> different types of devices but that makes the error checking
> trickier, and we aren't likely to have that many types of
> devices so there likely isn't much value in generalizing.

You don't really need to patch it, installing a new iplink_XXX.so
file is enough. Generalizing driver specific options more than
what we currently have doesn't look very promising. I think your
driver was simple enough to get along with the generic device
attributes though (IFLA_LINK or IFLA_MASTER).

> I do think we should specify the IFLA_KIND (was: IFLA_NAME) values in
> a header file.  So it is easy to get a list of all of the different
> kinds and so we don't conflict.

I don't think conflicts are going to be a problem (it would be
nice if modpost would warn about duplicate aliases though).
How is listing IFLA_KIND types in a header file going to help
get a list? Presuming the user knows what kind of device he
wants to set up and is not just looking for things to play
around with I also don't see any real value in this.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC RTNETLINK 04/09]: Link creation API

2007-06-06 Thread Patrick McHardy

Stephen Hemminger wrote:
> On Wed, 06 Jun 2007 01:17:11 +0200
> Patrick McHardy <[EMAIL PROTECTED]> wrote:
> 
>>>If you want I'll extend existing bridge netlink to use these.
>>
>>
>>Are you talking about brige-port information or bridge device
>>configuration? So far the API is not suitable for anything that
>>currently uses IFLA_PROTINFO because the sender is not the driver
>>which created the device and doesn't use AF_UNSPEC. For bridge
>>device configuration it would certainly be nice to have, but I'm
>>not sure yet how to handle enslave operations. So far my favourite
>>idea is to add enslave/release operations to rtnl_link_ops and call
>>them when IFLA_MASTER is set (so the netlink message would look like
>>this: ifindex: eth0 master: br0 nlmsg_type: RTM_NETLINK). But I
>>haven't really thought this through yet.
> 
> 
> Was thinking AF_BRIDGE (we have it already so use it), and both
> add/remove bridge, and enslave/unslave device.


I think we should use two seperate families for bridge devices
and bridge ports. I'll think about the enslave operation some
more and try to add it next week.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 5/7] CAN: Add virtual CAN netdevice driver

2007-06-06 Thread Patrick McHardy

Urs Thuermann wrote:
> Patrick McHardy <[EMAIL PROTECTED]> writes:
> 
>>I don't get why you can't directly check the socket option on the
>>TX path.
> 
> 
> We have several types of sockets in the PF_CAN family, two of which
> are GPL'ed and which are in the patch series.  These are CAN_RAW and
> CAN_BCM.  The protocol implementations use can_send() in af_can.c to
> send a CAN frame and indicate to can_send() in an int argument,
> whether this frame should be looped back.  Only the raw protocol has a
> socket option (setsockopt(2)) in struct raw_sock for this, bcm always
> sets this to 1 to have the frame looped back.  There is no option in
> struct bcm_sock for this.  In can_send() and in the driver we don't
> know what type of socket skb->sk points to and can't check that
> option.  Changing this would mean we have to add such an option in the
> same position in all CAN socket types and set it to fixed values in
> some of them (e.g. to 1 for bcm).  While it's doable, I wouldn't like
> that very much.
> 
> Is there anything that prevents can_send() from using skb->pkt_type to
> pass the loopback flag down to the driver?


No, that should be fine, in fact it should set it anyway even if you
can't use it for this purpose.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] NetXen: Add NETXEN prefix to macros

2007-06-06 Thread Mithlesh Thukral

He seems to be the same guy who has dropped the patches from RHEL5 bugzilla 
for now.

--
Mithlesh Thukral

On Tuesday 05 June 2007 00:28, Andy Gospodarek wrote:
> On Sun, Jun 03, 2007 at 11:50:29AM -0400, Jeff Garzik wrote:
> > Mithlesh Thukral wrote:
> > >NetXen: Add NETXEN prefix to a macro
> > >This patch will add the "NETXEN" prefix to "USER_START" macro.
> > >
> > >Signed-off by: Wen Xiong <[EMAIL PROTECTED]>
> > >Signed-off by: Mithlesh Thukral <[EMAIL PROTECTED]>
> > >---
> > >
> > > drivers/net/netxen/netxen_nic.h |4 ++--
> > > drivers/net/netxen/netxen_nic_ethtool.c |2 +-
> > > drivers/net/netxen/netxen_nic_hw.c  |4 ++--
> > > drivers/net/netxen/netxen_nic_init.c|4 ++--
> > > 4 files changed, 7 insertions(+), 7 deletions(-)
> >
> > Your patch description is useless.  Clearly we know -what- it does,
> > simply by reading the patch.
> >
> > But it does not answer the simple question:  why?  why is this needed in
> > a bug fix Release Candidate series?
>
> I can't see why this is needed in an RC branch, but the use of
> generically named macros/variables has been problematic with this driver
> on more 'obscure' arches.
>
> I submitted a bigger cleanup patch a in March that never got taken so
> this is still a problem.  Here is that patch:
>
> Signed-off-by: Andy Gospodarek <[EMAIL PROTECTED]>
> ---
>
>  netxen_nic.h |   47
> --- netxen_nic_ethtool.c |8
> 
>  netxen_nic_hw.c  |   10 +-
>  netxen_nic_init.c|   23 ---
>  4 files changed, 45 insertions(+), 43 deletions(-)
>
> diff --git a/drivers/net/netxen/netxen_nic.h
> b/drivers/net/netxen/netxen_nic.h index dd8ce35..d5f0c06 100644
> --- a/drivers/net/netxen/netxen_nic.h
> +++ b/drivers/net/netxen/netxen_nic.h
> @@ -68,9 +68,10 @@
>  #define _NETXEN_NIC_LINUX_SUBVERSION 3
>  #define NETXEN_NIC_LINUX_VERSIONID  "3.3.3"
>
> -#define NUM_FLASH_SECTORS (64)
> -#define FLASH_SECTOR_SIZE (64 * 1024)
> -#define FLASH_TOTAL_SIZE  (NUM_FLASH_SECTORS * FLASH_SECTOR_SIZE)
> +#define NETXEN_NUM_FLASH_SECTORS (64)
> +#define NETXEN_FLASH_SECTOR_SIZE (64 * 1024)
> +#define NETXEN_FLASH_TOTAL_SIZE  (NETXEN_NUM_FLASH_SECTORS \
> + * NETXEN_FLASH_SECTOR_SIZE)
>
>  #define PHAN_VENDOR_ID 0x4040
>
> @@ -671,28 +672,28 @@ struct netxen_new_user_info {
>
>  /* Flash memory map */
>  typedef enum {
> - CRBINIT_START = 0,  /* Crbinit section */
> - BRDCFG_START = 0x4000,  /* board config */
> - INITCODE_START = 0x6000,/* pegtune code */
> - BOOTLD_START = 0x1, /* bootld */
> - IMAGE_START = 0x43000,  /* compressed image */
> - SECONDARY_START = 0x20, /* backup images */
> - PXE_START = 0x3E,   /* user defined region */
> - USER_START = 0x3E8000,  /* User defined region for new boards */
> - FIXED_START = 0x3F  /* backup of crbinit */
> + NETXEN_CRBINIT_START = 0,   /* Crbinit section */
> + NETXEN_BRDCFG_START = 0x4000,   /* board config */
> + NETXEN_INITCODE_START = 0x6000, /* pegtune code */
> + NETXEN_BOOTLD_START = 0x1,  /* bootld */
> + NETXEN_IMAGE_START = 0x43000,   /* compressed image */
> + NETXEN_SECONDARY_START = 0x20,  /* backup images */
> + NETXEN_PXE_START = 0x3E,/* user defined region */
> + NETXEN_USER_START = 0x3E8000,   /* User defined region for new boards */
> + NETXEN_FIXED_START = 0x3F   /* backup of crbinit */
>  } netxen_flash_map_t;
>
> -#define USER_START_OLD PXE_START /* for backward compatibility */
> -
> -#define FLASH_START  (CRBINIT_START)
> -#define INIT_SECTOR  (0)
> -#define PRIMARY_START(BOOTLD_START)
> -#define FLASH_CRBINIT_SIZE   (0x4000)
> -#define FLASH_BRDCFG_SIZE(sizeof(struct netxen_board_info))
> -#define FLASH_USER_SIZE  (sizeof(struct 
> netxen_user_info)/sizeof(u32))
> -#define FLASH_SECONDARY_SIZE (USER_START-SECONDARY_START)
> -#define NUM_PRIMARY_SECTORS  (0x20)
> -#define NUM_CONFIG_SECTORS   (1)
> +#define NETXEN_USER_START_OLD NETXEN_PXE_START   /* for backward
> compatibility */ +
> +#define NETXEN_FLASH_START   (NETXEN_CRBINIT_START)
> +#define NETXEN_INIT_SECTOR   (0)
> +#define NETXEN_PRIMARY_START (NETXEN_BOOTLD_START)
> +#define NETXEN_FLASH_CRBINIT_SIZE(0x4000)
> +#define NETXEN_FLASH_BRDCFG_SIZE (sizeof(struct netxen_board_info))
> +#define NETXEN_FLASH_USER_SIZE   (sizeof(struct
> netxen_user_info)/sizeof(u32)) +#define NETXEN_FLASH_SECONDARY_SIZE
>   (NETXEN_USER_START-NETXEN_SECONDARY_START) +#define
> NETXEN_NUM_PRIMARY_SECTORS(0x20)
> +#define NETXEN_NUM_CONFIG_SECTORS(1)
>  #define PFX "NetXen: "
>  extern char netxen_nic_driver_name[];
>
> diff --git a/drivers/net/netxen/netxen_nic_ethtool.c
> b/drivers/net/netxen/netxen_nic_ethtool.c index ee1b5a2..4dfa76b 100644
> --- a/drivers/net/

[GENETLINK] get pid from userspace

2007-06-06 Thread Ariane Keller


Hi,

I have a user space program which "connects" to different kernel 
modules with generic netlink. Each module provides its own generic 
netlink family.


For each module to "connect" to I create a socket and bind it:
fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
struct sockaddr_nl addr;
memset(&addr, 0, sizeof(addr));
bind(fd, (struct sockaddr *)&addr, sizeof(addr));

As far as I understand, this lets the generic netlink controller choose 
a unique pid, which shows up in the kernel module in the callback 
function as genl_info.snd_pid.


Is it possible to get this pid in userspace (without sending a message 
to the kernel module)?


Thanks!

Ariane




-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)

2007-06-06 Thread Milton Miller



On Jun 5, 2007, at 9:26 PM, Kok, Auke wrote:


Kok, Auke wrote:


Hmm git-revert seems to do the job right. I checked it with git-show 
| patch -p1 -R and the results look OK. The two patches on top of the 
one we want to revert are unrelated enough to apply (manually it 
shows some fuzz, but otherwise it's OK).
Jeff, please `git-revert d52df4a35af569071fda3f4eb08e47cc7023f094` to 
revert the following patch for now:

---
commit  d52df4a35af569071fda3f4eb08e47cc7023f094
Author: Scott Feldman <[EMAIL PROTECTED]>
Date:   Wed Nov 9 02:18:52 2005 -0500
 [netdrvr e100] experiment with doing RX in a similar manner to 
eepro100
 I was going to say that eepro100's speedo_rx_link() does the 
same DMA
 abuse as e100, but then I noticed one little detail: eepro100 
sets  both
 EL (end of list) and S (suspend) bits in the RFD as it chains it 
 to the
 RFD list.  e100 was only setting the EL bit.  Hmmm, that's  
interesting.
 That means that if HW reads a RFD with the S-bit set,  it'll 
process
 that RFD and then suspend the receive unit.  The  receive unit 
will
 resume when SW clears the S-bit.  There is no need  for SW to 
restart
 the receive unit.  Which means a lot of the receive  unit state 
tracking

 code in the driver goes away.
 So here's a patch against 2.6.14.  (Sorry for inlining it; the 
mailer
 I'm using now will mess with the word wrap).  I can't test this 
on
 XScale (unless someone has an e100 module for Gumstix :) .  It 
should
 be doing exactly what eepro100 does with RFDs.  I don't believe 
this
 change will introduce a performance hit because the S-bit and 
EL-bit  go
 hand-in-hand meaning if we're going to suspend because of the S- 
bit,
 we're on the last resource anyway, so we'll have to wait for SW  
to

 replenish.
 (cherry picked from 29e79da9495261119e3b2e4e7c72507348e75976 
commit)

---


A little bit more is needed to explain why we're reverting it for now. 
Jeff, please insert this into the revert commit.


Auke

--
This patch attempted to fix e100 for non-cache coherent memory 
architectures by using the cb style code that eepro100 had and using 
the EL and s bits from the RFD list. Unfortunately the hardware 
doesn't work exactly like this and therefore this patch actually 
breaks e100 on those systems.


on all systems.  (Both the &| typo and the removed restart logic).

Reverting the change brings it back to the previously known good state 
for 2.6.22. The pending rewrite in progress to this code can then be 
safely merged later.


Signed-off-by: Auke Kok <[EMAIL PROTECTED]>
---


milton

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fix race in AF_UNIX

2007-06-06 Thread David Miller

From: Miklos Szeredi <[EMAIL PROTECTED]>
Date: Wed, 06 Jun 2007 10:08:29 +0200

> > > > Holding a global mutex over recvmsg() calls under AF_UNIX is pretty
> > > > much a non-starter, this will kill performance for multi-threaded
> > > > apps.
> > > 
> > > That's an rwsem held for read.  It's held for write in unix_gc() only
> > > for a short duration, and unix_gc() should only rarely be called.  So
> > > I don't think there's any performance problem here.
> > 
> > It pulls a non-local cacheline into the local thread, that's extremely
> > expensive on SMP.
> 
> OK, here's an updated patch, that uses ->readlock, and passes my
> testing.

Thanks a lot, I'll review this as soon as possible unless
someone else beats me to it :)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fix race in AF_UNIX

2007-06-06 Thread Miklos Szeredi

> > > Holding a global mutex over recvmsg() calls under AF_UNIX is pretty
> > > much a non-starter, this will kill performance for multi-threaded
> > > apps.
> > 
> > That's an rwsem held for read.  It's held for write in unix_gc() only
> > for a short duration, and unix_gc() should only rarely be called.  So
> > I don't think there's any performance problem here.
> 
> It pulls a non-local cacheline into the local thread, that's extremely
> expensive on SMP.

OK, here's an updated patch, that uses ->readlock, and passes my
testing.


From: Miklos Szeredi <[EMAIL PROTECTED]>

There are races involving the garbage collector, that can throw away
perfectly good packets with AF_UNIX sockets in them.

The problems arise when a socket goes from installed to in-flight or
vice versa during garbage collection.  Since gc is done with a
spinlock held, this only shows up on SMP.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux-2.6.22-rc2/net/unix/garbage.c
===
--- linux-2.6.22-rc2.orig/net/unix/garbage.c2007-06-03 23:58:11.0 
+0200
+++ linux-2.6.22-rc2/net/unix/garbage.c 2007-06-06 09:48:36.0 +0200
@@ -186,7 +186,21 @@ void unix_gc(void)
 
forall_unix_sockets(i, s)
{
-   unix_sk(s)->gc_tree = GC_ORPHAN;
+   struct unix_sock *u = unix_sk(s);
+
+   u->gc_tree = GC_ORPHAN;
+
+   /*
+* Hold ->readlock to protect against sockets going from
+* in-flight to installed
+*
+* Can't sleep on this, because
+*   a) we are under spinlock
+*   b) skb_recv_datagram() could be waiting for a packet that
+*  is to be sent by this thread
+*/
+   if (!mutex_trylock(&u->readlock))
+   goto lock_failed;
}
/*
 *  Everything is now marked
@@ -207,8 +221,6 @@ void unix_gc(void)
 
forall_unix_sockets(i, s)
{
-   int open_count = 0;
-
/*
 *  If all instances of the descriptor are not
 *  in flight we are in use.
@@ -218,10 +230,20 @@ void unix_gc(void)
 *  In this case (see unix_create1()) we set artificial
 *  negative inflight counter to close race window.
 *  It is trick of course and dirty one.
+*
+*  Get the inflight counter first, then the open
+*  counter.  This avoids problems if racing with
+*  sendmsg
+*
+*  If just created socket is not yet attached to
+*  a file descriptor, assume open_count of 1
 */
+   int inflight_count = atomic_read(&unix_sk(s)->inflight);
+   int open_count = 1;
+
if (s->sk_socket && s->sk_socket->file)
open_count = file_count(s->sk_socket->file);
-   if (open_count > atomic_read(&unix_sk(s)->inflight))
+   if (open_count > inflight_count)
maybe_unmark_and_push(s);
}
 
@@ -300,6 +322,7 @@ void unix_gc(void)
spin_unlock(&s->sk_receive_queue.lock);
}
u->gc_tree = GC_ORPHAN;
+   mutex_unlock(&u->readlock);
}
spin_unlock(&unix_table_lock);
 
@@ -309,4 +332,19 @@ void unix_gc(void)
 
__skb_queue_purge(&hitlist);
mutex_unlock(&unix_gc_sem);
+   return;
+
+ lock_failed:
+   {
+   struct sock *s1;
+   forall_unix_sockets(i, s1) {
+   if (s1 == s) {
+   spin_unlock(&unix_table_lock);
+   mutex_unlock(&unix_gc_sem);
+   return;
+   }
+   mutex_unlock(&unix_sk(s1)->readlock);
+   }
+   BUG();
+   }
 }

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: sending IPv6 packets via kern_sendmsg

2007-06-06 Thread Roar Bjørgum Rotvik

Anton wrote:
> Hi,
> 
> Following on from a previous (now lost :-)) posting, I have been trying 
> to send out IPv6 packets from a kernel module using the kern_sendmsg() 
> function. Since in theory I need this function to be started in 
> interrupt context (actually, softirq context), but since this is 
> impossible because kern_sendmsg() needs to sleep, I have created a work 
> queue which calls the kern_sendmsg() function separately. The work queue 
> is scheduled from the softirq context (actually, at the moment this 
> happens from a Netfilter hook). The work queue function creates a msghdr 
> structure, fills in a sin6_addr structure, calls sock_create_kern() and 
> then uses this socket to send an IPv6 packet, which consists of a header 
> (struct ipv6hdr *iphdr) and some data following on from this. The above 
> packet is placed in the msghdr structure by setting (after the 
> appropriate initializations):
> msg.msg_iov->iov_base = (char *) ip6hdr;
> msg.msg_iov->iov_len = sizeof( struct ipv6hdr + ntohs( 
> ip6hdr->payload_len ) );

Shouldn't this be:
msg.msg_iov->iov_len = sizeof (struct ipv6hdr) + ntohs (ip6hdr->payload_len);

Seems you use sizeof (nthos (...)), and that seems wrong.

-- 
Roar B. Rotvik
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][RFC] network splice receive

2007-06-06 Thread Jens Axboe

On Tue, Jun 05 2007, Evgeniy Polyakov wrote:
> On Tue, Jun 05, 2007 at 10:05:43AM +0200, Jens Axboe ([EMAIL PROTECTED]) 
> wrote:
> > Here's an implementation of tcp network splice receive support. It's
> > originally based on the patch set that Intel posted some time ago, but
> > has been (close to) 100% reworked.
> > 
> > Now, I'm not a networking guru by any stretch of the imagination, so I'd
> > like some input on the direction of the main patch. Is the approach
> > feasible? Glaring errors? Missing bits?
> 
>   263.709262] [ cut here ]
>   [  263.713932] kernel BUG at include/linux/mm.h:285!
>   [  263.718678] invalid opcode:  [1] PREEMPT SMP 
>   [  263.723561] CPU 0 
>   [  263.725665] Modules linked in: button loop snd_intel8x0
>   snd_ac97_codec ac97_bus snd_pcm snd_timer snd soundcore psmouse
>   snd_page_alloc k8temp i2c_nforcen
>   [  263.755666] Pid: 2709, comm: splice-fromnet Not tainted
>   2.6.22-rc4-splice #2
>   [  263.762759] RIP: 0010:[]  []
>   skb_splice_bits+0xac/0x1c9
>   [  263.771212] RSP: 0018:81003c79fc88  EFLAGS: 00010246
>   [  263.776564] RAX:  RBX: 05a8 RCX:
>   81003ff04778
>   [  263.783743] RDX: 81003ff04778 RSI: 0ab2 RDI:
>   0003d52d
>   [  263.790925] RBP: 81003c79fdd8 R08:  R09:
>   81003d927b78
>   [  263.798104] R10: 803b0181 R11: 81003c79fde8 R12:
>   81003d52d000
>   [  263.805284] R13: 054e R14: 81003d927b78 R15:
>   81003bbc6ea0
>   [  263.812463] FS:  2ac4089a86d0() GS:804fb000()
>   knlGS:
>   [  263.820611] CS:  0010 DS:  ES:  CR0: 8005003b
>   [  263.826396] CR2: 2ac4088320e0 CR3: 3c987000 CR4:
>   06e0
>   [  263.833578] Process splice-fromnet (pid: 2709, threadinfo
>   81003c79e000, task 81003755c380)
>   [  263.842591] Stack:  81003ff04720 
>   81003755c380 0046
>   [  263.850897]  00c6 0046 81003b0428b8
>   81003d0b5b10
>   [  263.858543]  00c6 81003d0b5b10 81003b0428b8
>   81003d0b5b10
>   [  263.865957] Call Trace:
>   [  263.868683]  [] _read_unlock_irq+0x31/0x4e
>   [  263.874393]  [] tcp_splice_data_recv+0x20/0x22
>   [  263.880447]  [] tcp_read_sock+0xa2/0x1ab
>   [  263.885983]  [] tcp_splice_data_recv+0x0/0x22
>   [  263.891951]  [] tcp_splice_read+0xae/0x1a3
>   [  263.897655]  [] sock_def_readable+0x0/0x6f
>   [  263.903366]  [] sock_splice_read+0x15/0x17
>   [  263.909072]  [] do_splice_to+0x76/0x88
>   [  263.914432]  [] sys_splice+0x1a8/0x232
>   [  263.919795]  [] system_call+0x7e/0x83
>   [  263.925067] 
>   [  263.926606] 
>   [  263.926607] Code: 0f 0b eb fe 44 89 e6 81 e6 ff 0f 00 00 90 ff 42
>   08 48 63 55 
>   [  263.936418] RIP  [] skb_splice_bits+0xac/0x1c9
>   [  263.942516]  RSP 
> 
> This a vm_bug_on in get_page().
> 
> > +static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page 
> > *page,
> > +   unsigned int len, unsigned int offset)
> > +{
> > +   struct page *p;
> > +
> > +   if (unlikely(spd->nr_pages == PIPE_BUFFERS))
> > +   return 1;
> > +
> > +#ifdef NET_COPY_SPLICE
> > +   p = alloc_pages(GFP_KERNEL, 0);
> > +   if (!p)
> > +   return 1;
> > +
> > +   memcpy(page_address(p) + offset, page_address(page) + offset, len);
> > +#else
> > +   p = page;
> > +   get_page(p);
> > +#endif
> 
> Some pages have zero reference counter here.
> 
> Is commented NET_COPY_SPLICE part from old implementation?
> It will be always slower than existing approach due to allocation
> overhead.

NET_COPY_SPLICE is not supposed to be enabled, it's just a demonstration
of a copy operation if you wanted to test functionality without just
linking in the skb pages. At least that would allow you to test
correctness of the rest of the code, since I don't know if the skb page
linking is entirely correct yet.

But it's somewhat annoying to get pages with zero reference counts
there, I wonder how that happens. I guess if the skb->data originated
from kmalloc() then we don't really know. The main intent there was just
to ensure the page wasn't going away, but clearly it's not good enough
to ensure that reuse isn't taking place.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: warnings in git-wireless

2007-06-06 Thread Christoph Hellwig

On Tue, Jun 05, 2007 at 01:12:03PM -0700, James Ketrenos wrote:
> Yes, we certainly don't want a driver to be "near mainline" that does 
> things that the rest of the kernel and other drivers are doing.  We should 
> force them to stay out-of-tree until any and everything is resolved.  
> Heaven forbid that the code should be merged, contributed, and improved 
> upon as a community.
> 
> You'd think these guys were trying to enable folks with a driver so they 
> can use the hardware they bought or something.  Its almost laughable.  What 
> idiots.

Maybe you should make sure your companies publishes specs or useable drivers
earlier, and you should stop writing crap code instead of these hypocritical 
comments.  We're a big project and we need to be up to some standards before
we can put code in that we'll have to maintain then.

> 
> James
> -
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
---end quoted text---
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][RFC] network splice receive

2007-06-06 Thread Jens Axboe

On Tue, Jun 05 2007, jamal wrote:
> On Tue, 2007-05-06 at 14:20 +0200, Jens Axboe wrote:
> 
> > > 
> > >  1800 4ff3 937f e000 6381 7275 0008
> > > 
> > > Perhaps that hex pattern rings a bell with someone intimate with the
> > > networking. The remaining wrong bytes don't seem to have anything in
> > > common.
> > 
> > Ok, the source mac address is 00:18:F3:4F:7F:93 and the destination is
> > 00:E0:81:63:75:72 which are the middle 12 bytes of the 16.
> > 
> 
> It appears you may have endianness issues and perhaps a 16 bit
> mis-offset. the 0008 at the end looks like a swapped 0x800 which
> is the ethernet type for IPV4.

Yeah, it looks like the first part of the ethernet frame. I'm using an
e1000 on the receive side and a sky2 on the sender.

> > Hope that helps someone clue me in as to which network part is reusing
> > the data. Do I need to 'pin' the sk_buff until the pipe data has been
> > consumed?
> 
> I would worry about the driver level first - likely thats where your
> corruption is.

I'd just be a heck-of-a-lot more inclined to suspect my code! Hence I
thought that data reuse for perhaps another packet would be the likely
explanation, especially since it looked like valid network driver data
ending up where it should not.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

typo in via-velocity.c

2007-06-06 Thread Dave Jones

http://bugzilla.kernel.org/show_bug.cgi?id=8160

Signed-off-by: Dave Jones <[EMAIL PROTECTED]>

diff --git a/drivers/net/via-velocity.c b/drivers/net/via-velocity.c
index 25b75b6..b670b97 100644
--- a/drivers/net/via-velocity.c
+++ b/drivers/net/via-velocity.c
@@ -1562,7 +1562,7 @@ static void velocity_print_link_status(struct 
velocity_info *vptr)
if (vptr->mii_status & VELOCITY_LINK_FAIL) {
VELOCITY_PRT(MSG_LEVEL_INFO, KERN_NOTICE "%s: failed to detect 
cable link\n", vptr->dev->name);
} else if (vptr->options.spd_dpx == SPD_DPX_AUTO) {
-   VELOCITY_PRT(MSG_LEVEL_INFO, KERN_NOTICE "%s: Link 
autonegation", vptr->dev->name);
+   VELOCITY_PRT(MSG_LEVEL_INFO, KERN_NOTICE "%s: Link 
auto-negotiation", vptr->dev->name);
 
if (vptr->mii_status & VELOCITY_SPEED_1000)
VELOCITY_PRT(MSG_LEVEL_INFO, " speed 1000M bps");

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Remove incorrect comment from hamradio/scc.

2007-06-06 Thread Dave Jones

scc_rxint doesn't call this function at all.
http://bugzilla.kernel.org/show_bug.cgi?id=8146

Signed-off-by: Dave Jones <[EMAIL PROTECTED]>

diff --git a/drivers/net/hamradio/scc.c b/drivers/net/hamradio/scc.c
index 6fdaad5..30bed2a 100644
--- a/drivers/net/hamradio/scc.c
+++ b/drivers/net/hamradio/scc.c
@@ -1610,7 +1610,7 @@ static int scc_net_close(struct net_device *dev)
return 0;
 }
 
-/* > receive frame, called from scc_rxint() < */
+/* > receive frame < */
 
 static void scc_net_rx(struct scc_channel *scc, struct sk_buff *skb)
 {

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

83 matches

Mail list logo