Re: Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user].

2006-05-11 Thread Evgeniy Polyakov
On Wed, May 10, 2006 at 12:58:48PM -0700, David S. Miller ([EMAIL PROTECTED]) 
wrote:
 From: Evgeniy Polyakov [EMAIL PROTECTED]
 Date: Mon, 8 May 2006 16:24:22 +0400
 
  I hope he does not take offence at name shortening :)
 
 Perhaps you are still not convinced how truly expensive the code path
 from netif_receive_skb() to the protocol receive processing really is.

That is why UDP was selected - it is itself does not cost anything,
ip_rcv() + netif_receive_skb() will be in any channels, but instead of
searching through unified cache with src/port/dst/port/proto we search
through src/port/dst/port + through proto in ip_rcv().
There are no locks there except disabled preemption, those codepath
_never_ showed in profiles.
Grand unified cache is of course a good idea, but it will not bring new
performance gain to Linux.
It _is_ much more convenient and code path will be shorter, but only
because route/dst lookup will be hidden in unified cache.

Memory copy and context switch were eliminated in net channel, and that 
trashed any cache much more than than removing 50 lines of code accessed
parts of skb-data.

 It is absolutely necessary to find ways to get rid of these layering
 costs.  Layering is how you design networking protocols, not how you
 implement them.

If I provide a patch which will allow to mark special socket as
no-protocol-and-any-upper-layer-lookup, but instead process skb-data
(like copying to userspace, or just allow recv() to return without any
copy) and performance will not differ from what we have with layers, 
will it justify that not abstract cache trashing and lookup split into
socket/route are not the problem?

Or have you switched from engineering to researching mode? :)

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][SECMARK 03/08] Add xtables SECMARK target

2006-05-11 Thread Patrick McHardy
James Morris wrote:
 On Wed, 10 May 2006, Patrick McHardy wrote:
 
 
The netfilter parts all look fine too me (just one question,
see below). Shall I add the userspace parts to SVN or do you
want to do it yourself?
 
 
 Might be better if you do it, although I'm still looking into one issue at 
 this stage.

Just tell me when you want me to add it.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user].

2006-05-11 Thread David S. Miller
From: Evgeniy Polyakov [EMAIL PROTECTED]
Date: Thu, 11 May 2006 10:40:37 +0400

  It is absolutely necessary to find ways to get rid of these layering
  costs.  Layering is how you design networking protocols, not how you
  implement them.
 
 If I provide a patch which will allow to mark special socket as
 no-protocol-and-any-upper-layer-lookup, but instead process skb-data
 (like copying to userspace, or just allow recv() to return without any
 copy) and performance will not differ from what we have with layers, 
 will it justify that not abstract cache trashing and lookup split into
 socket/route are not the problem?
 
 Or have you switched from engineering to researching mode? :)

You test with single socket and single source ID, what do you expect?
Everything is hot in the cache, as expected.

It is not research, I did put cycle counter sampling all over these
spots on sparc64 a long time ago just to familiarize myself with where
cpu spends most of it's time in softint processing when there are lots
of sockets and unique remote addresses.

And most of the time from netif_receive_skb() to the meat of
{udp,tcp}_rcv() is touching the routing cache and socket demux hash
tables.  Add bonus costs to netfilter if that is enabled too.  Once
you are past that point, for TCP, tcp_ack() is the primary cpu cycle
eater.

You can test with single stream, but then you are only testing
in-cache case.  Try several thousand sockets and real load from many
unique source systems, it becomes interesting then.

From profiles of heavily used web server, what shows up is bulk of cpu
being in socket demux and tcp_ack().  Next bubble is routing cache.
I have not seen good profiles from a heavy web server employing any
real use of netfilter, that would be interesting as well.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 34/35] Add the Xen virtual network device driver.

2006-05-11 Thread Keir Fraser


On 11 May 2006, at 01:33, Herbert Xu wrote:


But if sampling virtual events for randomness is really unsafe (is it
really?) then native guests in Xen would also get bad random numbers
and this would need to be somehow addressed.


Good point.  I wonder what VMWare does in this situation.


Well, there's not much they can do except maybe jitter interrupt 
delivery. I doubt they do that though.


The original complaint in our case was that we take entropy from 
interrupts caused by other local VMs, as well as external sources. 
There was a feeling that the former was more predictable and could form 
the basis of an attack. I have to say I'm unconvinced: I don't really 
see that it's significantly easier to inject precisely-timed interrupts 
into a local VM. Certainly not to better than +/- a few microseconds. 
As long as you add cycle-counter info to the entropy pool, the least 
significant bits of that will always be noise.


The alternatives are unattractive:
 1. We have no good way to distinguish interrupts caused by packets 
from local VMs versus packets from remote hosts. Both get muxed on the 
same virtual interface.
 2. An entropy front/back is tricky -- how do we decide how much 
entropy to pull from domain0? How much should domain0 be prepared to 
give other domains? How easy is it to DoS domain0 by draining its 
entropy pool? Yuk.


 -- Keir

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 34/35] Add the Xen virtual network device driver.

2006-05-11 Thread Herbert Xu
On Thu, May 11, 2006 at 08:49:04AM +0100, Keir Fraser wrote:
 
 The alternatives are unattractive:
  1. We have no good way to distinguish interrupts caused by packets 
 from local VMs versus packets from remote hosts. Both get muxed on the 
 same virtual interface.
  2. An entropy front/back is tricky -- how do we decide how much 
 entropy to pull from domain0? How much should domain0 be prepared to 
 give other domains? How easy is it to DoS domain0 by draining its 
 entropy pool? Yuk.

IMHO there just isn't enough real entropy to go around in one physical
machine without a proper HRNG.  So either use urandom in all the guests
or for those that really have to use /dev/random, install a hardware
RNG (or wait for it :).

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bcm43xx: Fix array overrun in bcm43xx_geo_init

2006-05-11 Thread Michael Buesch
On Thursday 11 May 2006 05:42, you wrote:
 Michael Buesch [EMAIL PROTECTED] wrote:
 
  The problem here is that the bcm34xx driver and the ieee80211
  stack do not agree on what channels are possible for 802.11a.
  The ieee80211 stack only wants channels between 34 and 165, while
  the bcm43xx driver accepts anything from 0 to 200. I made the
  bcm43xx driver comply with the ieee80211 stack expectations, by
  using the proper constants.
  
  Signed-off-by: Jean Delvare [EMAIL PROTECTED]
  
  [mb]: Reduce stack usage by kzalloc-ing ieee80211_geo
  
  Signed-off-by: Michael Buesch [EMAIL PROTECTED]
 
 I find this changelog confusing.  We seem to have two patches, one written
 by Jean and one by yourself, perhaps?  And the fact that the changlog
 didn't start with

I simply added one or two lines of code.

 From: Jean Delvare [EMAIL PROTECTED]
 
 indicates that you are to be considered the primary author?

No, I forgot about that line.

 btw, we seem to have a number of bcm43xx patches banking up.  I don't know
 if John has merged them because we're back in the situation where some of
 John's tree has been merged into Jeff's tree but hasn't gone upstream - so
 my git-wireless.patch generates a massive reject storm against 
 git-netdev.patch
 
 So I suspect that all these bcm43xx might not be making it into 2.6.17.

I think most of the patches should already be merged by john. But I did not
recheck this. It would be bad, if they won't make it for 2.6.17, though,
as they are all heavy bugfixes that prevent hard oopses for people
with special cards, that the developers did not have.

-- 
Greetings Michael.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user].

2006-05-11 Thread Evgeniy Polyakov
On Thu, May 11, 2006 at 12:07:21AM -0700, David S. Miller ([EMAIL PROTECTED]) 
wrote:
 You can test with single stream, but then you are only testing
 in-cache case.  Try several thousand sockets and real load from many
 unique source systems, it becomes interesting then.

Route lookup is _additional_ cost for the system, but, as far as I
understand, netchannels are supposed to help with data processing, not
with destination point selection. It must have route lookups,
socket(netchannel) lookups, they just will be in other place.

I can test system with large number of streams, but unfortunately only
from small number of different src/dst ip addresses, so I can not
benchmark route lookup performance in layered design.

 From profiles of heavily used web server, what shows up is bulk of cpu
 being in socket demux and tcp_ack().  Next bubble is routing cache.
 I have not seen good profiles from a heavy web server employing any
 real use of netfilter, that would be interesting as well.

I have several oprofiles of static test web server which does 2.5k
requests/sec with about 3000 sockets created/removed per second. All
connections are very tiny.
Machines are in LAN, so no heavy route lookups, but socket lookup is quite
heavy. The most heavyweight network function is tcp_v4_rcv() (number 15),
next one is __alloc_skb() (25'th place), __kfree_skb() (35'th place).
netif_receive_skb() at 63, ip_rcv() - 80'th place.
tcp_ack() at 99. No *inet_lookup at all.
I do understand that it is synthetic benchmark, but it is not so rare
usage case.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 34/35] Add the Xen virtual network device driver.

2006-05-11 Thread Andi Kleen
On Thursday 11 May 2006 09:49, Keir Fraser wrote:
 On 11 May 2006, at 01:33, Herbert Xu wrote:
  But if sampling virtual events for randomness is really unsafe (is it
  really?) then native guests in Xen would also get bad random numbers
  and this would need to be somehow addressed.
 
  Good point.  I wonder what VMWare does in this situation.

 Well, there's not much they can do except maybe jitter interrupt
 delivery. I doubt they do that though.

 The original complaint in our case was that we take entropy from
 interrupts caused by other local VMs, as well as external sources.
 There was a feeling that the former was more predictable and could form
 the basis of an attack. I have to say I'm unconvinced: I don't really
 see that it's significantly easier to inject precisely-timed interrupts
 into a local VM. Certainly not to better than +/- a few microseconds.
 As long as you add cycle-counter info to the entropy pool, the least
 significant bits of that will always be noise.

I think I agree - e.g. i would expect the virtual interrupts to have
enough jitter too. Maybe it would be good if someone could
run a few statistics on the resulting numbers?

Ok the randomness added doesn't consist only of the least significant
bits. Currently it adds jiffies+full 32bit cycle count.  I guess if it was
a real problem the code could be changed to leave out the jiffies and 
only add maybe a 8 bit word from the low bits. But that would only
help for the para case because the algorithm for native guests
cannot be changed.

   2. An entropy front/back is tricky -- how do we decide how much
 entropy to pull from domain0? How much should domain0 be prepared to
 give other domains? How easy is it to DoS domain0 by draining its
 entropy pool? Yuk.

I claim (without having read any code) that in theory you need to have solved 
that problem already in the vTPM @)

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH wireless-dev] d80211: Add support for user space clientMLME

2006-05-11 Thread Johannes Berg
On Wed, 2006-05-10 at 10:17 -0700, Jouni Malinen wrote:

 This is still somewhat open, but at minimum, there needs to be a
 mechanism for receiving and sending management frames from user space.
 d80211 uses a management netdev for this currently (the same one that
 was used before with hostapd for AP mode). In addition to that,
 wpa_supplicant expect to be able to read list of support channels and TX
 rates (get_hw_feature_data handler in driver wrapper) and to be able to
 add and remove STA entries (e.g., for TX rate control and association
 status validation in kernel code).

Right.

 Currently, scanning is done simply by setting the channel with
 SIOCSIWFREQ and listening for management frames. However, the goal is to
 provide simple atomic operations that user space programs can request
 the kernel code to do. This would cover not only scanning, but also some
 other needs like IEEE 802.11k radio measurements. These operations could
 be something like stop transmit, move to channel 5, report received
 management frames, record noise level, do this for 5 ms, return to
 operational channel, enable transmit.

Yeah, we talked about that. I suppose softmac will never support the
current interface, we'll just design a new one and use that.

johannes


signature.asc
Description: This is a digitally signed message part


Re: IPv6 connect() from site-local to global IPv6 address.

2006-05-11 Thread Kazunori Miyazawa

Hi,

I lost some mails on the list because of my network trouble.
This might not be a correct thread to reply. Sorry.

Anyway, I traced the probelem.

My test environment likes:

The host in my network could reach to the global network via NAT-T on IPv4.
But it could not reach to the global network on IPv6 because router did not have
the default route.
The radvd on the router advertised a global scope prefix for the hosts.
The hosts accordingly have a global scoped address.
DNS server returned both  and A of the target host(server).

The router and the host is Ubuntu 5.10 with the kernel 2.6.16.9.

I tested with these steps:

1. I did dig to check that the host could got  address.
2. I did ping6 to check that the host received dest unreach
   with no route from the router.
3. I run evalution and connected to my server which have  and A.

It falled back to IPv4 and it could connect the server via IPv4 network.
It also falled back by Time exceeded. My evolution version is 2.4.1.
Of course, when I added IPv6 default route, it connected via IPv6 network.

I think they works well as far as ICMPv6 error messages can be received.
I did not test in the environment in which ICMPv6 error messages are filtered.

Best regards,

David Woodhouse wrote:

On Mon, 2006-05-08 at 09:44 -0700, Rick Jones wrote:

Or get the applications fixed no?  Kludging around application bugs 
sounds a bit like the Fram Oil Filter commercial where the mechanic is 
grinning while he says You can pay me now, or you can pay be later. As 
in pay for the slightly more expensive oil filter now, or engine repair 
later.



Well, obviously. That's _why_ I want to deploy IPv6 and get it tested.
But I used to be able to do this without actually breaking the network,
and without being told to _stop_ running radvd because it breaks things.


Other than fixing the applications that only take the first response 
(isn't that a generic application bug going back nearly decades now? 
amazing how things stay the same isn't it) Can you run a caching-only 
name server at the edge that filters-out the IPv6 responses so your 
systems never see Global IPV6 responses?



I don't think that kind of answer is going to be sufficient to persuade
Uli to switch back from favouring IPv4 over IPv6. That's done the trick,
admittedly -- by ensuring that we get _no_ testing of IPv6 unless we run
with IPv6-only networking :)



--
Kazunori Miyazawa
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[NET_SCHED]: HFSC: fix thinko in hfsc_adjust_levels()

2006-05-11 Thread Patrick McHardy
[NET_SCHED]: HFSC: fix thinko in hfsc_adjust_levels()

When deleting the last child the level of a class should drop to zero.

Noticed by Andreas Mueller [EMAIL PROTECTED]

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit c75053e281212b5ed3990a0aaced865db7e456d2
tree be2c674d4545ea41200fc5c57d53a03cb0672a93
parent 0e44dc383787b472a7f13564c6bd8a44cc07d408
author Patrick McHardy [EMAIL PROTECTED] Thu, 11 May 2006 10:29:30 +0200
committer Patrick McHardy [EMAIL PROTECTED] Thu, 11 May 2006 10:29:30 +0200

 net/sched/sch_hfsc.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index 91132f6..f1c7bd2 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -974,10 +974,10 @@ hfsc_adjust_levels(struct hfsc_class *cl
do {
level = 0;
list_for_each_entry(p, cl-children, siblings) {
-   if (p-level  level)
-   level = p-level;
+   if (p-level = level)
+   level = p-level + 1;
}
-   cl-level = level + 1;
+   cl-level = level;
} while ((cl = cl-cl_parent) != NULL);
 }
 


[PATCH] rt2x00: fix oops in config_interface

2006-05-11 Thread Jiri Benc
BSSID passed to config_interface callback is NULL in modes other than STA or
IBSS.

Signed-off-by: Jiri Benc [EMAIL PROTECTED]

---

 drivers/net/wireless/d80211/rt2x00/rt2400pci.c |3 ++-
 drivers/net/wireless/d80211/rt2x00/rt2500pci.c |3 ++-
 drivers/net/wireless/d80211/rt2x00/rt2500usb.c |3 ++-
 drivers/net/wireless/d80211/rt2x00/rt61pci.c   |3 ++-
 drivers/net/wireless/d80211/rt2x00/rt73usb.c   |3 ++-
 5 files changed, 10 insertions(+), 5 deletions(-)

--- dscape.orig/drivers/net/wireless/d80211/rt2x00/rt2400pci.c
+++ dscape/drivers/net/wireless/d80211/rt2x00/rt2400pci.c
@@ -1848,7 +1848,8 @@ rt2400pci_config_interface(struct net_de
if (rt2x00pci-type == IEEE80211_IF_TYPE_MNTR)
return 0;
 
-   rt2400pci_config_bssid(rt2x00pci, conf-bssid);
+   if (conf-bssid)
+   rt2400pci_config_bssid(rt2x00pci, conf-bssid);
 
return 0;
 }
--- dscape.orig/drivers/net/wireless/d80211/rt2x00/rt2500pci.c
+++ dscape/drivers/net/wireless/d80211/rt2x00/rt2500pci.c
@@ -1971,7 +1971,8 @@ rt2500pci_config_interface(struct net_de
if (conf-type == IEEE80211_IF_TYPE_MNTR)
return 0;
 
-   rt2500pci_config_bssid(rt2x00pci, conf-bssid);
+   if (conf-bssid)
+   rt2500pci_config_bssid(rt2x00pci, conf-bssid);
 
return 0;
 }
--- dscape.orig/drivers/net/wireless/d80211/rt2x00/rt2500usb.c
+++ dscape/drivers/net/wireless/d80211/rt2x00/rt2500usb.c
@@ -1636,7 +1636,8 @@ rt2500usb_config_interface(struct net_de
if (conf-type == IEEE80211_IF_TYPE_MNTR)
return 0;
 
-   rt2500usb_config_bssid(rt2x00usb, conf-bssid);
+   if (conf-bssid)
+   rt2500usb_config_bssid(rt2x00usb, conf-bssid);
 
return 0;
 }
--- dscape.orig/drivers/net/wireless/d80211/rt2x00/rt61pci.c
+++ dscape/drivers/net/wireless/d80211/rt2x00/rt61pci.c
@@ -2434,7 +2434,8 @@ rt61pci_config_interface(struct net_devi
if (conf-type == IEEE80211_IF_TYPE_MNTR)
return 0;
 
-   rt61pci_config_bssid(rt2x00pci, conf-bssid);
+   if (conf-bssid)
+   rt61pci_config_bssid(rt2x00pci, conf-bssid);
 
return 0;
 }
--- dscape.orig/drivers/net/wireless/d80211/rt2x00/rt73usb.c
+++ dscape/drivers/net/wireless/d80211/rt2x00/rt73usb.c
@@ -1935,7 +1935,8 @@ rt73usb_config_interface(struct net_devi
if (conf-type == IEEE80211_IF_TYPE_MNTR)
return 0;
 
-   rt73usb_config_bssid(rt2x00usb, conf-bssid);
+   if (conf-bssid)
+   rt73usb_config_bssid(rt2x00usb, conf-bssid);
 
return 0;
 }


-- 
Jiri Benc
SUSE Labs
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH wireless-dev] d80211: Don't discriminate against 802.11b drivers

2006-05-11 Thread Jiri Benc
On Thu, 4 May 2006 22:32:35 -0400, Michael Wu wrote:
 This makes the current hack used to prevent 802.11g cards from scanning with 
 802.11b channels not break scanning in 802.11b drivers.

I think this should be better:

Signed-off-by: Jiri Benc [EMAIL PROTECTED]

---

 net/d80211/ieee80211.c |   14 ++
 net/d80211/ieee80211_sta.c |1 -
 2 files changed, 14 insertions(+), 1 deletion(-)

--- dscape.orig/net/d80211/ieee80211.c
+++ dscape/net/d80211/ieee80211.c
@@ -4014,6 +4014,19 @@ static void ieee80211_precalc_rates(stru
}
 }
 
+static inline void ieee80211_apply_modes(struct ieee80211_hw *hw,
+struct ieee80211_local *local)
+{
+   struct ieee80211_hw_modes *mode;
+   int i;
+
+   local-scan_skip_11b = 0;
+   for (i = 0; i  hw-num_modes; i++) {
+   mode = hw-modes[i];
+   if (mode-mode == MODE_IEEE80211G)
+   local-scan_skip_11b = 1;
+   }
+}
 
 struct net_device *ieee80211_alloc_hw(size_t priv_data_len,
  void (*setup)(struct net_device *))
@@ -4258,6 +4271,7 @@ int ieee80211_update_hw(struct net_devic
return -1;
 
ieee80211_precalc_rates(hw);
+   ieee80211_apply_modes(hw, local);
local-conf.phymode = hw-modes[0].mode;
local-curr_rates = hw-modes[0].rates;
local-num_curr_rates = hw-modes[0].num_rates;
--- dscape.orig/net/d80211/ieee80211_sta.c
+++ dscape/net/d80211/ieee80211_sta.c
@@ -2566,7 +2566,6 @@ int ieee80211_sta_req_scan(struct net_de
memcpy(local-scan_ssid, ssid, ssid_len);
} else
local-scan_ssid_len = 0;
-   local-scan_skip_11b = 1; /* FIX: clear this is 11g is not supported */
local-scan_state = SCAN_SET_CHANNEL;
local-scan_hw_mode_idx = 0;
local-scan_channel_idx = 0;


-- 
Jiri Benc
SUSE Labs
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user].

2006-05-11 Thread Evgeniy Polyakov
On Thu, May 11, 2006 at 12:30:32PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
 On Thu, May 11, 2006 at 12:07:21AM -0700, David S. Miller ([EMAIL PROTECTED]) 
 wrote:
  You can test with single stream, but then you are only testing
  in-cache case.  Try several thousand sockets and real load from many
  unique source systems, it becomes interesting then.

 I can test system with large number of streams, but unfortunately only
 from small number of different src/dst ip addresses, so I can not
 benchmark route lookup performance in layered design.

I've run it with 200 UDP sockets in receive path. There were two load
generator machines with 100 clients in each.
There are no copies of skb-data in recvmsg().
Since I only have 1Gb link I'm unable to provide each client with high
bandwith, so they send 4k chunks.
Performance dropped twice down to 55 MB/sec and CPU usage increased noticebly
(slow drift from 12 to 8% compared to 2% with one socket),
but it is not because of cache effect I believe,
but due to highly increased number of syscalls per second.

Here is profile result:
1463625  78.0003  poll_idle
19171 1.0217  _spin_lock_irqsave
15887 0.8467  _read_lock
14712 0.7840  kfree
13370 0.7125  ip_frag_queue
11896 0.6340  delay_pmtmr
11811 0.6294  _spin_lock
11723 0.6247  csum_partial
11399 0.6075  ip_frag_destroy
11063 0.5896  serial_in
10533 0.5613  skb_release_data
10524 0.5609  ip_route_input
10319 0.5499  __alloc_skb
9903  0.5278  ip_defrag
9889  0.5270  _read_unlock
9536  0.5082  _write_unlock
8639  0.4604  _write_lock
7557  0.4027  netif_receive_skb
6748  0.3596  ip_frag_intern
6534  0.3482  preempt_schedule
6220  0.3315  __kmalloc
6005  0.3200  schedule
5924  0.3157  irq_entries_start
5823  0.3103  _spin_unlock_irqrestore
5678  0.3026  ip_rcv
5410  0.2883  __kfree_skb
5056  0.2694  kmem_cache_alloc
5014  0.2672  kfree_skb
4900  0.2611  eth_type_trans
4067  0.2167  kmem_cache_free
3532  0.1882  udp_recvmsg
3531  0.1882  ip_frag_reasm
3331  0.1775  _read_lock_irqsave
3327  0.1773  ipq_kill
3304  0.1761  udp_v4_lookup_longway

I'm going to resurrect zero-copy sniffer project [1] and create special
socket option which would allow to insert pages, which contain
skb-data, into process VMA using VM remapping tricks. Unfortunately it
requires TLB flushing and probably there will be no significant
performance/CPU gain if any, but I think, it is the only way to provide 
receiving 
zero-copy access to hardware which does not support header split.

Other idea, which I will try, if I understood you correctly, is to create 
unified cache.
I think some interesting results can be obtained from following
approach: in softint we do not process skb-data at all, but only get
src/dst/sport/dport/protocol numbers (it could require maximum two cache lines,
or it is not fast-path packet (but something like ipsec) and can be processed 
as usual) 
and create some initial cache based on that data, skb is then queued into that
initial cache entry and recvmsg() in process context later process' 
that entry.

Back to the drawing board...
Thanks for discussion.

1. zero-copy sniffer
http://tservice.net.ru/~s0mbre/old/?section=projectsitem=af_tlb

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 34/35] Add the Xen virtual network device driver.

2006-05-11 Thread Stephen Hemminger
On Thu, 11 May 2006 11:47:52 +0200
Andi Kleen [EMAIL PROTECTED] wrote:

 On Thursday 11 May 2006 09:49, Keir Fraser wrote:
  On 11 May 2006, at 01:33, Herbert Xu wrote:
   But if sampling virtual events for randomness is really unsafe (is it
   really?) then native guests in Xen would also get bad random numbers
   and this would need to be somehow addressed.
  
   Good point.  I wonder what VMWare does in this situation.
 
  Well, there's not much they can do except maybe jitter interrupt
  delivery. I doubt they do that though.
 
  The original complaint in our case was that we take entropy from
  interrupts caused by other local VMs, as well as external sources.
  There was a feeling that the former was more predictable and could form
  the basis of an attack. I have to say I'm unconvinced: I don't really
  see that it's significantly easier to inject precisely-timed interrupts
  into a local VM. Certainly not to better than +/- a few microseconds.
  As long as you add cycle-counter info to the entropy pool, the least
  significant bits of that will always be noise.
 
 I think I agree - e.g. i would expect the virtual interrupts to have
 enough jitter too. Maybe it would be good if someone could
 run a few statistics on the resulting numbers?
 
 Ok the randomness added doesn't consist only of the least significant
 bits. Currently it adds jiffies+full 32bit cycle count.  I guess if it was
 a real problem the code could be changed to leave out the jiffies and 
 only add maybe a 8 bit word from the low bits. But that would only
 help for the para case because the algorithm for native guests
 cannot be changed.
 
2. An entropy front/back is tricky -- how do we decide how much
  entropy to pull from domain0? How much should domain0 be prepared to
  give other domains? How easy is it to DoS domain0 by draining its
  entropy pool? Yuk.
 
 I claim (without having read any code) that in theory you need to have solved 
 that problem already in the vTPM @)
 

The base question under all this is how good does an entropy source have
to be? and then what guarantees do we make about the entropy inputs used
by /dev/random?.  If we can resolve those, then the virtual environment
answer should fall out.

This is a area where the security tin-foil hat types take over, and it
gets real hard to make good enough argument. People have built an expectation
that /dev/random has really strong entropy, good enough to generate long term
keys etc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] neighbour.c, pneigh_get_next() skips published entry

2006-05-11 Thread Jari Takkala
The following patch fixes a problem where output from /proc/net/arp
skips a record when the full output does not fit into the users read()
buffer.

To reproduce: publish a large number of ARP entries (more than 10
required on my system). Run 'dd if=/proc/net/arp of=arp-1024.out
bs=1024'. View the output, one entry will be missing.

Please review and commit if acceptable.

Signed-off-by: Jari Takkala [EMAIL PROTECTED]

--- linux-2.6.16.15.orig/net/core/neighbour.c   2006-05-09
15:53:30.0 -0400
+++ linux-2.6.16.15/net/core/neighbour.c2006-05-10
16:06:40.0 -0400
@@ -2120,6 +2120,11 @@
struct neigh_seq_state *state = seq-private;
struct neigh_table *tbl = state-tbl;

+   if (pos != NULL  *pos == 1  (pn-next ||
tbl-phash_buckets[state-bucket])) {
+   --(*pos);
+   return pn;
+   }
+
pn = pn-next;
while (!pn) {
if (++state-bucket  PNEIGH_HASHMASK)

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 34/35] Add the Xen virtual network device driver.

2006-05-11 Thread Rick Jones

From the peanut gallery...

Can remote TCP ISN's be considered a source of entropy these days?  How 
about checksums?


rick
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 34/35] Add the Xen virtual network device driver.

2006-05-11 Thread Andi Kleen
On Thursday 11 May 2006 18:48, Rick Jones wrote:
  From the peanut gallery...
 
 Can remote TCP ISN's be considered a source of entropy these days?  How 
 about checksums?

Indirectly - we measure how long it takes to compute them.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] expose simplified skb_checksum_recalc

2006-05-11 Thread Stephen Hemminger
Many users of skb_checksum_help() are just using it to recalculate
outbound checksum, so why not expose the interface in a more useful
way. Suggested by Ingo Oeser.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

--- linux-2.6.orig/include/linux/skbuff.h   2006-04-27 11:12:53.0 
-0700
+++ linux-2.6/include/linux/skbuff.h2006-05-11 11:17:39.0 -0700
@@ -1343,6 +1343,24 @@
__skb_checksum_complete(skb);
 }
 
+extern int skb_checksum_recalc(struct sk_buff *skb);
+/**
+ * skb_checksum_help - recalculate checksum of packet
+ * @skb: packet to process
+ * @inward: direction of flow, zero is receiving
+ *
+ * Invalidate hardware checksum when packet is to be mangled on
+ * receive and complete checksum manually on outgoing path.
+ */
+static inline int skb_checksum_help(struct sk_buff *skb, int inward)
+{
+   if (inward) {
+   skb-ip_summed = CHECKSUM_NONE;
+   return 0;
+   }
+   return skb_checksum_recalc(skb);
+}
+
 #ifdef CONFIG_NETFILTER
 static inline void nf_conntrack_put(struct nf_conntrack *nfct)
 {
--- sky2.orig/net/core/dev.c2006-05-10 10:17:51.0 -0700
+++ sky2/net/core/dev.c 2006-05-11 11:22:27.0 -0700
@@ -1144,39 +1144,6 @@
 EXPORT_SYMBOL(netif_device_attach);
 
 
-/*
- * Invalidate hardware checksum when packet is to be mangled, and
- * complete checksum manually on outgoing path.
- */
-int skb_checksum_help(struct sk_buff *skb, int inward)
-{
-   unsigned int csum;
-   int ret = 0, offset = skb-h.raw - skb-data;
-
-   if (inward) {
-   skb-ip_summed = CHECKSUM_NONE;
-   goto out;
-   }
-
-   if (skb_cloned(skb)) {
-   ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
-   if (ret)
-   goto out;
-   }
-
-   BUG_ON(offset  (int)skb-len);
-   csum = skb_checksum(skb, offset, skb-len-offset, 0);
-
-   offset = skb-tail - skb-h.raw;
-   BUG_ON(offset = 0);
-   BUG_ON(skb-csum + 2  offset);
-
-   *(u16*)(skb-h.raw + skb-csum) = csum_fold(csum);
-   skb-ip_summed = CHECKSUM_NONE;
-out:   
-   return ret;
-}
-
 /* Take action when hardware reception checksum errors are detected. */
 #ifdef CONFIG_BUG
 void netdev_rx_csum_fault(struct net_device *dev)
@@ -3403,7 +3370,6 @@
 EXPORT_SYMBOL(register_gifconf);
 EXPORT_SYMBOL(register_netdevice);
 EXPORT_SYMBOL(register_netdevice_notifier);
-EXPORT_SYMBOL(skb_checksum_help);
 EXPORT_SYMBOL(synchronize_net);
 EXPORT_SYMBOL(unregister_netdevice);
 EXPORT_SYMBOL(unregister_netdevice_notifier);
--- sky2.orig/net/core/skbuff.c 2006-04-27 11:12:54.0 -0700
+++ sky2/net/core/skbuff.c  2006-05-11 11:23:13.0 -0700
@@ -1334,6 +1334,36 @@
 }
 
 /**
+ * skb_checksum_recalc - force software checksum
+ * @skb: skb to process
+ * Force complete checksum, this is used to force a software checksum
+ * on the outgoing path.
+ */
+int skb_checksum_recalc(struct sk_buff *skb)
+{
+   unsigned int csum;
+   int ret = 0, offset = skb-h.raw - skb-data;
+
+   if (skb_cloned(skb)) {
+   ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
+   if (ret)
+   goto out;
+   }
+
+   BUG_ON(offset  (int)skb-len);
+   csum = skb_checksum(skb, offset, skb-len-offset, 0);
+
+   offset = skb-tail - skb-h.raw;
+   BUG_ON(offset = 0);
+   BUG_ON(skb-csum + 2  offset);
+
+   *(u16*)(skb-h.raw + skb-csum) = csum_fold(csum);
+   skb-ip_summed = CHECKSUM_NONE;
+out:
+   return ret;
+}
+
+/**
  * skb_dequeue - remove from the head of the queue
  * @list: list to dequeue from
  *
@@ -1854,6 +1884,7 @@
 EXPORT_SYMBOL(pskb_copy);
 EXPORT_SYMBOL(pskb_expand_head);
 EXPORT_SYMBOL(skb_checksum);
+EXPORT_SYMBOL(skb_checksum_recalc);
 EXPORT_SYMBOL(skb_clone);
 EXPORT_SYMBOL(skb_clone_fraglist);
 EXPORT_SYMBOL(skb_copy);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] expose simplified skb_checksum_recalc

2006-05-11 Thread Ingo Oeser
Hi Stephen,

Stephen Hemminger wrote:
 Many users of skb_checksum_help() are just using it to recalculate
 outbound checksum, so why not expose the interface in a more useful
 way. Suggested by Ingo Oeser.

You are damn fast Stephen :-)

That's even better and improves a lot on documentation and
code placement for free.


Thanks  Regards

Ingo Oeser
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user].

2006-05-11 Thread David S. Miller
From: Evgeniy Polyakov [EMAIL PROTECTED]
Date: Thu, 11 May 2006 20:18:15 +0400

 Here is profile result:
 1463625  78.0003  poll_idle
 19171 1.0217  _spin_lock_irqsave
 15887 0.8467  _read_lock
 14712 0.7840  kfree
 13370 0.7125  ip_frag_queue
 11896 0.6340  delay_pmtmr
 11811 0.6294  _spin_lock
 11723 0.6247  csum_partial
 11399 0.6075  ip_frag_destroy
 11063 0.5896  serial_in
 10533 0.5613  skb_release_data
 10524 0.5609  ip_route_input
 10319 0.5499  __alloc_skb

Too bad spinlocks are not inlined any longer, this makes oprofile
output so much less useful.

Also, since you test UDP with MTU sized sends, you add fragmentation
into the mix, yet another variable that you won't see with TCP :-)

BTW you make another massively critical error in your analysis of TCP
profiles.

You mention that tcp_v4_rcv() shows up in your profiles and not
__inet_lookup().  This __inet_lookup() is inlined, and thus it's cost
shows up as tcp_v4_rcv().  I find such oversight amazing for someone
as careful about details as you are :-)

I would suggest to look at instruction level profile hits, it makes
such mistakes in analysis almost impossible :-)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user].

2006-05-11 Thread Rick Jones

David S. Miller wrote:

From: Evgeniy Polyakov [EMAIL PROTECTED]
Date: Thu, 11 May 2006 20:18:15 +0400



Here is profile result:
1463625  78.0003  poll_idle
19171 1.0217  _spin_lock_irqsave
15887 0.8467  _read_lock
14712 0.7840  kfree
13370 0.7125  ip_frag_queue
11896 0.6340  delay_pmtmr
11811 0.6294  _spin_lock
11723 0.6247  csum_partial
11399 0.6075  ip_frag_destroy
11063 0.5896  serial_in
10533 0.5613  skb_release_data
10524 0.5609  ip_route_input
10319 0.5499  __alloc_skb



Too bad spinlocks are not inlined any longer, this makes oprofile
output so much less useful.


But it is nice to see how much time is being spent in locking

rick jones

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH wireless-dev] d80211: Don't discriminate against 802.11b drivers

2006-05-11 Thread Michael Buesch
On Thursday 11 May 2006 17:54, you wrote:
 On Thu, 4 May 2006 22:32:35 -0400, Michael Wu wrote:
  This makes the current hack used to prevent 802.11g cards from scanning 
  with 
  802.11b channels not break scanning in 802.11b drivers.
 
 I think this should be better:
 
 Signed-off-by: Jiri Benc [EMAIL PROTECTED]
 
 ---
 
  net/d80211/ieee80211.c |   14 ++
  net/d80211/ieee80211_sta.c |1 -
  2 files changed, 14 insertions(+), 1 deletion(-)
 
 --- dscape.orig/net/d80211/ieee80211.c
 +++ dscape/net/d80211/ieee80211.c
 @@ -4014,6 +4014,19 @@ static void ieee80211_precalc_rates(stru
   }
  }
  
 +static inline void ieee80211_apply_modes(struct ieee80211_hw *hw,
 +  struct ieee80211_local *local)

Just a minor nitpick, but please remove the inline.
This is a candidate for binary bloat, if it is called later
on another place, too, and modern compilers will inline it anyway, if
only used once.
Additionally it is not a hotpath.

 +{
 + struct ieee80211_hw_modes *mode;
 + int i;
 +
 + local-scan_skip_11b = 0;
 + for (i = 0; i  hw-num_modes; i++) {
 + mode = hw-modes[i];
 + if (mode-mode == MODE_IEEE80211G)
 + local-scan_skip_11b = 1;
 + }
 +}
  
  struct net_device *ieee80211_alloc_hw(size_t priv_data_len,
 void (*setup)(struct net_device *))
 @@ -4258,6 +4271,7 @@ int ieee80211_update_hw(struct net_devic
   return -1;
  
   ieee80211_precalc_rates(hw);
 + ieee80211_apply_modes(hw, local);
   local-conf.phymode = hw-modes[0].mode;
   local-curr_rates = hw-modes[0].rates;
   local-num_curr_rates = hw-modes[0].num_rates;
 --- dscape.orig/net/d80211/ieee80211_sta.c
 +++ dscape/net/d80211/ieee80211_sta.c
 @@ -2566,7 +2566,6 @@ int ieee80211_sta_req_scan(struct net_de
   memcpy(local-scan_ssid, ssid, ssid_len);
   } else
   local-scan_ssid_len = 0;
 - local-scan_skip_11b = 1; /* FIX: clear this is 11g is not supported */
   local-scan_state = SCAN_SET_CHANNEL;
   local-scan_hw_mode_idx = 0;
   local-scan_channel_idx = 0;
 
 

-- 
Greetings Michael.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH wireless-dev] d80211: Don't discriminate against 802.11b drivers

2006-05-11 Thread Michael Wu
On Thursday 11 May 2006 11:54, Jiri Benc wrote:
 On Thu, 4 May 2006 22:32:35 -0400, Michael Wu wrote:
  This makes the current hack used to prevent 802.11g cards from scanning
  with 802.11b channels not break scanning in 802.11b drivers.

 I think this should be better:

I think this is overkill to fix a hack. IMHO, scan_skip_11b shouldn't exist in 
the first place. One alternative would be to modify 802.11g drivers to not 
set IEEE80211_CHAN_W_SCAN on 802.11b channels when there are equivalent 
802.11g channels. Another would be to set the local-hw_modes bitfield 
correctly during driver initialization instead of relying on userspace to set 
it, so the existing logic for avoiding 802.11b channels when 802.11g is 
supported actually works.

Hmm... that ioctl for changing the hw_modes bitfield doesn't seem too good.. 
No validity checking at all .. but then again, the current value of hw_modes 
isn't valid to begin with. It seems like hw_modes is more useful for saying 
what modes shouldn't be used than saying what modes are supported by the 
hardware and should be used.

-Michael Wu
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] sky2: prevent dual port receiver problems

2006-05-11 Thread Stephen Hemminger
When both ports are receiving simultaneously, the receive logic gets confused
and may pass up a packet before it is full. This causes hangs, and IP will see
lots of garbage packets. There is even the potential for data corruption if 
a later arriving packet DMA's into freed memory. 

It looks like a hardware bug because status arrives for a packet but no
data is there. Until this bug is worked out, block the user from bringing
up both ports at once.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]


--- sky2.orig/drivers/net/sky2.c
+++ sky2/drivers/net/sky2.c
@@ -1020,8 +1020,19 @@ static int sky2_up(struct net_device *de
struct sky2_hw *hw = sky2-hw;
unsigned port = sky2-port;
u32 ramsize, rxspace, imask;
-   int err = -ENOMEM;
+   int err;
+   struct net_device *otherdev = hw-dev[sky2-port^1];
 
+   /* Block bringing up both ports at the same time on a dual port card.
+* There is an unfixed bug where receiver gets confused and picks up
+* packets out of order. Until this is fixed, prevent data corruption.
+*/
+   if (otherdev  netif_running(otherdev)) {
+   printk(KERN_INFO PFX dual port support is disabled.\n);
+   return -EBUSY;
+   }
+
+   err = -ENOMEM;
if (netif_msg_ifup(sky2))
printk(KERN_INFO PFX %s: enabling interface\n, dev-name);
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] myri10ge - First half of the driver

2006-05-11 Thread Brice Goglin
Roland Dreier wrote:
   +#define myri10ge_pio_copy(to,from,size) __iowrite64_copy(to,from,size/8)

 Why do you need this wrapper?  Why not just call __iowrite64_copy()
 without the obfuscation?  Anyone reading the code will just have to
 search back to this define and mentally translate the size back and
 forth all the time.
   

Well, I know that abstraction layer is bad. But in this case I really
think that a name like myri10ge_pio_copy(size) is way less obfuscating
than __iowrite64_copy(size/8).
Will fix it if it really matters.


   +int myri10ge_hyper_msi_cap_on(struct pci_dev *pdev)
   +{
   +  uint8_t cap_off;
   +  int nbcap = 0;
   +
   +  cap_off = PCI_CAPABILITY_LIST - 1;
   +  /* go through all caps looking for a hypertransport msi mapping */

 This looks like something that should be fixed up in the general PCI
 quirk handling rather than in every driver...

   +static int
   +myri10ge_use_msi(struct pci_dev *pdev)
   +{
   +  if (myri10ge_msi == 1 || myri10ge_msi == 0)
   +  return myri10ge_msi;
   +
   +  /*  find root complex for our device */
   +  while (pdev-bus  pdev-bus-self) {
   +  pdev = pdev-bus-self;
   +  }

 Similarly looks like generic PCI code (if it's needed at all).  If I
 understand correctly you're trying to check if MSI has a chance at
 working on the system, but a network device driver has no business
 walking up the PCI hierarchy.
   

Right, I will look at moving all this to the core PCI code.


Thanks for all the comments.

Brice

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] myri10ge - Second half of the driver

2006-05-11 Thread Brice Goglin
Stephen Hemminger wrote:
 +
 +static int
 +myri10ge_open(struct net_device *dev)
 

 It is preferred to put function declarations on one line.

 static int mril10ge_open(struct net_device *dev)
   

Well, I have seen several threads about this in the archive, with some
people against and some people pro. I personaly like grepping for the
declaration of function using ^name.
If this codingstyle is really required, I will do.

 I would prefer to just have driver always do NAPI.  It's a 10G driver, it
 really needs to be NAPI to prevent machine starvation.
   

When TSO is disabled, we see performance being about 300Mbs lower when
enabling NAPI. But we'll probably enable TSO by default (see below) so
we'll probably drop non-NAPI.

 +myri10ge_close(mgp-dev);
 +status = myri10ge_load_firmware(mgp);
 +if (status != 0) {
 +printk(KERN_ERR myri10ge: %s: failed to load firmware\n,
 +   mgp-dev-name);
 +return;
 +}
 +myri10ge_open(mgp-dev);
 +}
 

 Watchdog's are a sign of buggy hardware and drivers!
   

Well... the watchdog is supposed to help detecting memory parity errors
in the NIC. It's rare, but it happens with cosmic rays. The recovery
part still need some work anyway. So we might drop the watchdog for now
and come back when recovery is ready.

Additionally, we are using our own watchdog because the linux netdev
watchdog does not seem to work well for devices with large hardware
transmit queues.  If there is a hardware problem, a single (or even a
handful) of TCP streams will not backup into the hardware queue in a
timely fashion, leading to a long delay before the netdev watchdog
routine is called.

 +#if 0
 +/* TSO can be enabled via ethtool -K eth1 tso on */
 +#ifdef NETIF_F_TSO
 +netdev-features |= NETIF_F_TSO;
 +#endif
 +#endif
 

 If it works enable it, if it doesn't take the code out.
   

It works. We did not enable it by default because there were some
problems in older kernels. They seem to be fixed in recent kernels. So
we'll enable TSO by default and have people disable it if it causes
problems.


 [PATCH 3/6] myri10ge - Driver header files

 myri10ge driver header files.
 myri10ge_mcp.h is the generic header, while myri10ge_mcp_gen_header.h
 is automatically generated from our firmware image.
 

 Then clean it up after the auto generation.
 Auto generated code still gets maintained by humans.
   

Oops sorry, I forgot to apply my cleaning script before sending.


 +#define MYRI10GE_MCP_MAJOR  1
 +#define MYRI10GE_MCP_MINOR  4
 +
 

 Major/Minor for what. You don't have a character device.
   

That's the firmware version, we'll find better names.



Thanks a lot for all the comments.

Brice

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] myri10ge - First half of the driver

2006-05-11 Thread Brice Goglin
Francois Romieu wrote:

 +spin_lock(mgp-cmd_lock);
 +response-result = 0x;
 +mb();
 +myri10ge_pio_copy((void __iomem *) cmd_addr, buf, sizeof (*buf));
 +
 +/* wait up to 2 seconds */
 

 You must not hold a spinlock for up to 2 seconds.
   

We are working on reducing the delay to about 15ms. It only occurs when
the driver is loaded or the link brought up.

 +for (sleep_total = 0; sleep_total  (2 * 1000); sleep_total += 10) {
 +mb();
 +if (response-result != 0x) {
 +if (response-result == 0) {
 +data-data0 = ntohl(response-data);
 +spin_unlock(mgp-cmd_lock);
 +return 0;
 +} else {
 +dev_err(mgp-pdev-dev,
 +command %d failed, result = %d\n,
 +   cmd, ntohl(response-result));
 +spin_unlock(mgp-cmd_lock);
 +return -ENXIO;
 

 Return in a middle of a spinlock-intensive function. :o(
   

What do you mean ?

   
 +{
 +struct sk_buff *skb;
 +unsigned long data, roundup;
 +
 +skb = dev_alloc_skb(bytes + 4096 + MYRI10GE_MCP_ETHER_PAD);
 +if (skb == NULL)
 +return NULL;
 

 Imho you will want to work directly with pages shortly.
   

We had thought about doing this, but were a little nervous since we did
not know of any other drivers that worked directly with pages.  If this
is an official direction to work directly with pages, we will. But the
existing approach is well tested through our beta cycle, and we would
prefer to leave it as is and update to a pages based approach in the
future.


Thanks a lot for all the comments.

Brice

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] myri10ge - Second half of the driver

2006-05-11 Thread Herbert Xu
Brice Goglin [EMAIL PROTECTED] wrote:

 It is preferred to put function declarations on one line.

 static int mril10ge_open(struct net_device *dev)
 
 Well, I have seen several threads about this in the archive, with some
 people against and some people pro. I personaly like grepping for the
 declaration of function using ^name.
 If this codingstyle is really required, I will do.

Yes this is the standard coding style used in Linux so please do.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


skge driver oops

2006-05-11 Thread David Arnold
i've been getting semi-regular lockups on my machine over 2.6.16
series.  I recently attached a serial console in an attempt to capture
an OOPS.

i got one yesterday.  it's copied manually from the console, but
hopefully the values are all accurate.  there was more that had scrolled
off screen above this too (sorry).

oops, lspci, uname -a, .config and dmesg below.

any suggestions for further debugging would be great,

thanks,



d

  skb_over_panic: text:c11a5f4c len:1244 put:1127 head:f7514600 data:f7514602 
tail:f7514ade
  end:f75146c0 dev:eth1
  ---[ cut here ]
  kernel BUG at net/core/skbuff.c:94!
  invalid opcode:  [#1]
  SMP
  Modules linked in: hci_usb i2c_amd756
  CPU: 0
  EIP: 0060:[c126f0b3] Not tainted VLI
  EFLAGS: 00210292 (2.6.16.14da #18)
  EIP is at skb_over_panic+0x63/0x70
  eax: 83
  ebx: f7d96ac0
  ecx: c13fbe64
  edx: 200296
  esi: 0
  edi: 6b560467
  ebp: f71d4810
  esp: c14aff44
  ds: 7b
  es: 7b
  ss: 68
  Process swapper (pid: 0, threadinfo=c14af000 task=c13f9100)
  Stack: 0c13d9230 c11a5f4c 04dc 0467 f7514600 f7514602 f7514ade 
f75146c0
 f7d96800 c11a5f58 f72456e0 0467 c11a5f4c c1d1ae24 c10222c0 0600
 0467 0040 c1e565a0 f7d96ae8  f72456e0 0467cc2b f7d968c0
  Call Trace:
   [c11a5f4c] skge_poll+0x4bc/0x500
   [c11a5f58] skge_poll+0x4c8/0x500
   [c11a5f4c] skge_poll+0x4bc/0x500
   [c10222c0] it_real_fn+0x0/0x60
   [c1274f6f] net_rx_action+0x76/0xe0
   [c1023f26] __do_softirq+0x76/0xe0
   [c10058bb] do_softirq+0x5b/0x60
   ==
   [c1005839] do_IRQ+0x49/0x70
   [c1003a6a] common_interrupt+0x1a/0x20
   [c1001ca0] default_idle+0x0/0x60
   [c1001ccc] default_idle+0x2c/0x60
   [c1001d64] cpu_idle+0x64/0x80
   [c146f50f] start_kernel+0x2df/0x390
   [c146f5c0] unknown_bootoption+0x0/0x260
  Code: 84 00 00 00 89 44 24 10 8b 44 24 2c 89 44 24 2c 89 44 24 0c 8b 41 60 c7 
04 24 30 92 3d c1 89 44 24 08 8b 44 24 30 89 44
   0Kernel panic - not syncing: Fatal exception in interrupt
  

[EMAIL PROTECTED]  lspci
:00:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] 
System Controller (rev 11)
:00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] AGP 
Bridge
:00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ISA (rev 
05)
:00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-768 [Opus] IDE 
(rev 04)
:00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI (rev 03)
:00:09.0 Ethernet controller: D-Link System Inc Gigabit Ethernet Adapter 
(rev 11)
:00:10.0 PCI bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] PCI (rev 
05)
:01:05.0 VGA compatible controller: ATI Technologies Inc RV280 [Radeon 9200 
PRO] (rev 01)
:01:05.1 Display controller: ATI Technologies Inc: Unknown device 5940 (rev 
01)
:02:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-768 [Opus] USB 
(rev 07)
:02:06.0 Multimedia audio controller: Ensoniq 5880 AudioPCI (rev 02)
:02:08.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] 
(rev 78)
[EMAIL PROTECTED]  

[EMAIL PROTECTED]  uname -a
Linux d 2.6.16.14da #18 SMP Tue May 9 09:57:15 EST 2006 i686 GNU/Linux
[EMAIL PROTECTED]  

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.16.14
# Tue May  9 09:47:17 2006
#
CONFIG_X86_32=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=da
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
CONFIG_SYSCTL=y
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_CPUSETS is not set
CONFIG_INITRAMFS_SOURCE=
CONFIG_UID16=y
CONFIG_VM86=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
# CONFIG_EMBEDDED is not set
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_CC_ALIGN_FUNCTIONS=0
CONFIG_CC_ALIGN_LABELS=0
CONFIG_CC_ALIGN_LOOPS=0
CONFIG_CC_ALIGN_JUMPS=0
CONFIG_SLAB=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
# CONFIG_SLOB is not set

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_OBSOLETE_MODPARM=y
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y

#
# Block layer
#
CONFIG_LBD=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_AS=y
# CONFIG_DEFAULT_DEADLINE is not set
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED=anticipatory

#
# Processor type and features
#
CONFIG_X86_PC=y
# 

Re: [Bugme-new] [Bug 6530] New: MAINLINE

2006-05-11 Thread Paul Mackerras
Andy Gay writes:

 How does the serial driver know it has to call ppp_asynctty_wakeup()?

The serial driver is supposed to call the line discipline's wakeup
function when it has room in the output buffer and the
TTY_DO_WRITE_WAKEUP bit is set in tty-flags.  When the serial port is
set to the ppp line discipline, then it uses the functions defined in
the ppp_ldisc structure in drivers/net/ppp_async.c, and the
write_wakeup field in that structure points to ppp_asynctty_wakeup.

 There were a bunch of changes to the serial drivers between 2.6.15 and
 2.6.16, maybe that's where this problem was introduced. Do we know which
 serial driver is involved in the original report?

Apparently it's the pty driver.

Paul.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bugme-new] [Bug 6530] New: MAINLINE

2006-05-11 Thread Andy Gay
On Fri, 2006-05-12 at 11:59 +1000, Paul Mackerras wrote:
 Andy Gay writes:
 
  How does the serial driver know it has to call ppp_asynctty_wakeup()?
 
 The serial driver is supposed to call the line discipline's wakeup
 function when it has room in the output buffer and the
 TTY_DO_WRITE_WAKEUP bit is set in tty-flags.  When the serial port is
 set to the ppp line discipline, then it uses the functions defined in
 the ppp_ldisc structure in drivers/net/ppp_async.c, and the
 write_wakeup field in that structure points to ppp_asynctty_wakeup.
 
OK, thanks for the explanation. I'll pay special attention to that stuff
in my driver!

  There were a bunch of changes to the serial drivers between 2.6.15 and
  2.6.16, maybe that's where this problem was introduced. Do we know which
  serial driver is involved in the original report?
 
 Apparently it's the pty driver.
 
So I heard. Hopefully the maintainer of that driver will see this

 Paul.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html