Re: [BUG] stacktrace from skb_checksum_help() and skb_gso_segment()

2006-07-25 Thread Rolf Eike Beer
Am Mittwoch, 26. Juli 2006 08:44 schrieb Evgeniy Polyakov:
> On Wed, Jul 26, 2006 at 08:41:48AM +0200, Rolf Eike Beer 
([EMAIL PROTECTED]) wrote:
> > linux-2.6 git tree from yesterday.
> >
> > Before this the sky2 network driver was working. After a pseudo hotplug
> > of the device it was working again (at least if you receive this mail
> > *g*).
> >
> > What next?
>
> It is not a bug, but remind to update NAT helper function.
> It has nothing with device drivers and is only printed once.

But why was my sky2 down afterwards? There's nothing in the log but this.

Eike


pgpWGCOuw3dEj.pgp
Description: PGP signature


Re: [BUG] stacktrace from skb_checksum_help() and skb_gso_segment()

2006-07-25 Thread Evgeniy Polyakov
On Wed, Jul 26, 2006 at 08:41:48AM +0200, Rolf Eike Beer ([EMAIL PROTECTED]) 
wrote:
> linux-2.6 git tree from yesterday.
> 
> Before this the sky2 network driver was working. After a pseudo hotplug of 
> the 
> device it was working again (at least if you receive this mail *g*).
> 
> What next?

It is not a bug, but remind to update NAT helper function.
It has nothing with device drivers and is only printed once.

> Eike

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[BUG] stacktrace from skb_checksum_help() and skb_gso_segment()

2006-07-25 Thread Rolf Eike Beer
linux-2.6 git tree from yesterday.

Before this the sky2 network driver was working. After a pseudo hotplug of the 
device it was working again (at least if you receive this mail *g*).

What next?

Eike

Jul 26 08:22:51 siso-eb-i34d kernel: BUG: warning 
at /home/beer/repos/linux-2.6/net/core/dev.c:1171/skb_checksum_help()
Jul 26 08:22:51 siso-eb-i34d kernel:  [] 
show_trace_log_lvl+0x54/0xfd
Jul 26 08:22:51 siso-eb-i34d kernel:  [] show_trace+0xd/0x10
Jul 26 08:22:51 siso-eb-i34d kernel:  [] dump_stack+0x17/0x1c
Jul 26 08:22:51 siso-eb-i34d kernel:  [] 
skb_checksum_help+0x55/0x108
Jul 26 08:22:51 siso-eb-i34d kernel:  [] ip_nat_fn+0x43/0x183 
[iptable_nat]
Jul 26 08:22:51 siso-eb-i34d kernel:  [] ip_nat_local_fn+0x36/0xb5 
[iptable_nat]
Jul 26 08:22:51 siso-eb-i34d kernel:  [] nf_iterate+0x2e/0x61
Jul 26 08:22:51 siso-eb-i34d kernel:  [] nf_hook_slow+0x37/0x95
Jul 26 08:22:51 siso-eb-i34d kernel:  [] ip_queue_xmit+0x362/0x3b2
Jul 26 08:22:51 siso-eb-i34d kernel:  [] 
tcp_transmit_skb+0x5f3/0x615
Jul 26 08:22:51 siso-eb-i34d kernel:  [] tcp_push_one+0xb7/0xda
Jul 26 08:22:51 siso-eb-i34d kernel:  [] tcp_sendmsg+0x786/0x990
Jul 26 08:22:51 siso-eb-i34d kernel:  [] inet_sendmsg+0x39/0x46
Jul 26 08:22:51 siso-eb-i34d kernel:  [] do_sock_write+0x93/0x9b
Jul 26 08:22:51 siso-eb-i34d kernel:  [] sock_aio_write+0x56/0x64
Jul 26 08:22:51 siso-eb-i34d kernel:  [] do_sync_write+0xaf/0xe4
Jul 26 08:22:51 siso-eb-i34d kernel:  [] vfs_write+0x94/0xe8
Jul 26 08:22:51 siso-eb-i34d kernel:  [] sys_write+0x3b/0x60
Jul 26 08:22:51 siso-eb-i34d kernel:  [] sysenter_past_esp+0x56/0x8d
Jul 26 08:22:51 siso-eb-i34d kernel: BUG: warning 
at /home/beer/repos/linux-2.6/net/core/dev.c:1225/skb_gso_segment()
Jul 26 08:22:51 siso-eb-i34d kernel:  [] 
show_trace_log_lvl+0x54/0xfd
Jul 26 08:22:51 siso-eb-i34d kernel:  [] show_trace+0xd/0x10
Jul 26 08:22:51 siso-eb-i34d kernel:  [] dump_stack+0x17/0x1c
Jul 26 08:22:51 siso-eb-i34d kernel:  [] skb_gso_segment+0x84/0x171
Jul 26 08:22:51 siso-eb-i34d kernel:  [] 
dev_hard_start_xmit+0x175/0x202
Jul 26 08:22:51 siso-eb-i34d kernel:  [] __qdisc_run+0xde/0x1a5
Jul 26 08:22:51 siso-eb-i34d kernel:  [] dev_queue_xmit+0x13b/0x24c
Jul 26 08:22:51 siso-eb-i34d kernel:  [] 
neigh_resolve_output+0x1cf/0x1fb
Jul 26 08:22:51 siso-eb-i34d kernel:  [] ip_output+0x1c4/0x1ff
Jul 26 08:22:51 siso-eb-i34d kernel:  [] ip_queue_xmit+0x373/0x3b2
Jul 26 08:22:51 siso-eb-i34d kernel:  [] 
tcp_transmit_skb+0x5f3/0x615
Jul 26 08:22:51 siso-eb-i34d kernel:  [] tcp_push_one+0xb7/0xda
Jul 26 08:22:51 siso-eb-i34d kernel:  [] tcp_sendmsg+0x786/0x990
Jul 26 08:22:51 siso-eb-i34d kernel:  [] inet_sendmsg+0x39/0x46
Jul 26 08:22:51 siso-eb-i34d kernel:  [] do_sock_write+0x93/0x9b
Jul 26 08:22:51 siso-eb-i34d kernel:  [] sock_aio_write+0x56/0x64
Jul 26 08:22:51 siso-eb-i34d kernel:  [] do_sync_write+0xaf/0xe4
Jul 26 08:22:51 siso-eb-i34d kernel:  [] vfs_write+0x94/0xe8
Jul 26 08:22:51 siso-eb-i34d kernel:  [] sys_write+0x3b/0x60
Jul 26 08:22:51 siso-eb-i34d kernel:  [] sysenter_past_esp+0x56/0x8d


pgpYKUjf2m2Ju.pgp
Description: PGP signature


Re: async network I/O, event channels, etc

2006-07-25 Thread Evgeniy Polyakov
On Tue, Jul 25, 2006 at 03:01:22PM -0700, David Miller ([EMAIL PROTECTED]) 
wrote:
> From: Ulrich Drepper <[EMAIL PROTECTED]>
> Date: Tue, 25 Jul 2006 12:23:53 -0700
> 
> > I was very much surprised by the reactions I got after my OLS talk.
> > Lots of people declared interest and even agreed with the approach and
> > asked me to do further ahead with all this.  For those who missed it,
> > the paper and the slides are available on my home page:
> > 
> > http://people.redhat.com/drepper/
> > 
> > As for the next steps I see a number of possible ways.  The discussions
> > can be held on the usual mailing lists (i.e., lkml and netdev) but due
> > to the raw nature of the current proposal I would imagine that would be
> > mainly perceived as noise.
> 
> Since I gave a big thumbs up for Evgivny's kevent work yesterday
> on linux-kernel, you might want to start by comparing your work
> to his.  Because his has the advantage that 1) we have code now
> and 2) he has written many test applications and performed many
> benchmarks against his code which has flushed out most of the
> major implementation issues.
> 
> I think most of the people who have encouraged your work are unaware
> of Evgivny's kevent stuff, which is extremely unfortunate, the two
> works are more similar than they are different.
> 
> I do not think discussing all of this on netdev would be perceived
> as noise. :)

Hello David, Ulrich.

Here is brief description of what is kevent and how it works.

Kevent subsystem incorporates several AIO/kqueue design notes and ideas.
Kevent can be used both for edge and level notifications. It supports
socket notifications (accept, send, recv), network AIO (aio_send(),
aio_recv() and aio_sendfile()), inode notifications (create/remove),
generic poll()/select() notifications and timer notifications.

There are several object in the kevent system:
storage - each source of events (socket, inode, timer, aio, anything) has
structure kevent_storage incorporated into it, which is basically a list
of registered interests for this source of events.
user - it is abstraction which holds all requested kevents. It is
similar to FreeBSD's kqueue.
kevent - set of interests for given source of events or storage.

When kevent is queued into storage, it will live there until removed by
kevent_dequeue(). When some activity is noticed in given storage, it
scans it's kevent_storage->list for kevents which match activity event.
If kevents are found and they are not already in the
kevent_user->ready_list, they will be added there at the end.

ioctl(WAIT) (or appropriate syscall) will wait until either requested
number of kevents are ready or timeout elapsed or at least one kevent is
ready, it's behaviour depends on parameters.

It is possible to have one-shot kevents, which are automatically removed
when are ready.

Any event can be added/removed/modified by ioctl or special controlling
syscall.

Network AIO is based on kevent and works as usual kevent storage on top
of inode.
When new socket is created it is associated with that inode and when
some activity is detected appropriate notifications are generated and
kevent_naio_callback() is called.
When new kevent is being registered, network AIO ->enqueue() callback
simply marks itself like usual socket event watcher. It also locks
physical userspace pages in memory and stores appropriate pointers in
private kevent structure. I have not created additional DMA memory
allocation methods, like Ulrich described in his article, so I handle it
inside NAIO which has some overhead (I posted get_user_pages()
sclability graph some time ago).
Network AIO callback gets pointers to userspace pages and tries to copy
data from receiving skb queue into them using protocol specific
callback. This callback is very similar to ->recvmsg(), so they could
share a lot in future (as far as I recall it worked only with hardware
capable to do checksumming, I'm a bit lazy).

Both network and aio implementation work on top of hooks inside
appropriate state machines, but not as repeated call design (curect AIO) 
or special thread (SGI AIO). AIO work was stopped, since I was unable to 
achieve the same speed as synchronous read 
(maximum speeds were 2Gb/sec vs. 2.1 GB/sec for aio and sync IO accordingly
when reading data from the cache).
Network aio_sendfile() works lazily - it asynchronously populates pages
into the VFS cache (which can be used for various tricks with adaptive
readahead) and then uses usual ->sendfile() callback.

I have not created an interface for userspace events (like Solaris), 
since right now I do not see it's usefullness, but if there is
requirements for that it is quite easy with kevents.

I'm preparing set of kevent patches resend (with cleanups mentioned in
previous e-mails), which will be ready in a couple of moments.

1. kevent homepage.
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

2. network aio homepage.
http://tservice.net.ru/~s0mbre/

Re: [PATCH] kthread: airo.c

2006-07-25 Thread Andrew Morton
On Mon, 24 Jul 2006 11:13:09 -0700
Sukadev Bhattiprolu <[EMAIL PROTECTED]> wrote:

> Sukadev Bhattiprolu [EMAIL PROTECTED] wrote:
> 
> | Andrew,
> | 
> | Javier Achirica, one of the major contributors to 
> drivers/net/wireless/airo.c
> | took a look at this patch, and doesn't have any problems with it. It doesn't
> | fix any bugs and is just a cleanup, so it certainly isn't a candidate
> | for this mainline cycle
> 
> Here is the same patch, merged up to 2.6.18-rc2. Christoph's patch (see
> http://lkml.org/lkml/2006/7/13/332) still applies cleanly on top of this.
> 
> -
> The airo driver is currently caching a pid for later use, but with the
> implementation of containers, pids themselves do not uniquely identify
> a task. The driver is also using kernel_thread() which is deprecated in
> drivers.
> 
> This patch essentially replaces the kernel_thread() with kthread_create().
> It also stores the task_struct of the airo_thread rather than its pid.
> Since this introduces a second task_struct in struct airo_info, the patch
> renames airo_info.task to airo_info.list_bss_task.
> 
> As an extension of these changes, the patch further:
> 
>  - replaces kill_proc() with kthread_stop()
>  - replaces signal_pending() with kthread_should_stop()
>- removes thread completion synchronisation which is handled by
>  kthread_stop().
> 
> ..
>
> @@ -1736,9 +1736,9 @@ static int readBSSListRid(struct airo_in
>   issuecommand(ai, &cmd, &rsp);
>   up(&ai->sem);
>   /* Let the command take effect */
> - ai->task = current;
> + ai->list_bss_task = current;
>   ssleep(3);
> - ai->task = NULL;
> + ai->list_bss_task = NULL;

This looks a little racy to me.  It's relatively benign - a race will cause
us to sleep for too long.  But it's easy to fix:


--- a/drivers/net/wireless/airo.c~kthread-airoc-race-fix
+++ a/drivers/net/wireless/airo.c
@@ -1733,10 +1733,10 @@ static int readBSSListRid(struct airo_in
cmd.cmd=CMD_LISTBSS;
if (down_interruptible(&ai->sem))
return -ERESTARTSYS;
+   ai->list_bss_task = current;
issuecommand(ai, &cmd, &rsp);
up(&ai->sem);
/* Let the command take effect */
-   ai->list_bss_task = current;
ssleep(3);
ai->list_bss_task = NULL;
}
_



Actually, ssleep() ends up doing

while (timeout)
timeout = schedule_timeout_uninterruptible(timeout);

so if the intent of this code is to terminate the sleep early, when the
interrupt has happened then it isn't working right.  A fix would be to
convert the ssleep(3) into schedule_timeout_uninterruptible(3 * HZ).
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUGs in skb_checksum_help() and skb_gso_segment() in 2.6.18-rc2

2006-07-25 Thread Herbert Xu
On Wed, Jul 26, 2006 at 06:01:40AM +0200, Patrick McHardy wrote:
> 
> Please send it, I'll update my patch based on that. Thanks.

Here it is, it sits on top of

commit ca6bb5d7ab22ac79f608fe6cbc6b12de6a5a19f0
Author: David Woodhouse <[EMAIL PROTECTED]>
Date:   Thu Jun 22 16:07:52 2006 -0700

[NET]: Require CAP_NET_ADMIN to create tuntap devices.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
61a015eb86469404587e910e9b852fc35ce436b8
diff --git a/drivers/atm/he.c b/drivers/atm/he.c
index fde9334..601e7ee 100644
--- a/drivers/atm/he.c
+++ b/drivers/atm/he.c
@@ -1913,7 +1913,7 @@ #endif
skb->tail = skb->data + skb->len;
 #ifdef USE_CHECKSUM_HW
if (vcc->vpi == 0 && vcc->vci >= 
ATM_NOT_RSV_VCI) {
-   skb->ip_summed = CHECKSUM_HW;
+   skb->ip_summed = CHECKSUM_COMPLETE;
skb->csum = TCP_CKSUM(skb->data,
he_vcc->pdu_len);
}
diff --git a/drivers/net/3c59x.c b/drivers/net/3c59x.c
index e277789..15dcd4e 100644
--- a/drivers/net/3c59x.c
+++ b/drivers/net/3c59x.c
@@ -2243,7 +2243,7 @@ boomerang_start_xmit(struct sk_buff *skb
 
vp->tx_ring[entry].next = 0;
 #if DO_ZEROCOPY
-   if (skb->ip_summed != CHECKSUM_HW)
+   if (skb->ip_summed != CHECKSUM_PARTIAL)
vp->tx_ring[entry].status = cpu_to_le32(skb->len | 
TxIntrUploaded);
else
vp->tx_ring[entry].status = cpu_to_le32(skb->len | 
TxIntrUploaded | AddTCPChksum | AddUDPChksum);
diff --git a/drivers/net/8139cp.c b/drivers/net/8139cp.c
index ad0c8c3..4f566d8 100644
--- a/drivers/net/8139cp.c
+++ b/drivers/net/8139cp.c
@@ -809,7 +809,7 @@ #endif
 
if (mss)
flags |= LargeSend | ((mss & MSSMask) << MSSShift);
-   else if (skb->ip_summed == CHECKSUM_HW) {
+   else if (skb->ip_summed == CHECKSUM_PARTIAL) {
const struct iphdr *ip = skb->nh.iph;
if (ip->protocol == IPPROTO_TCP)
flags |= IPCS | TCPCS;
@@ -863,7 +863,7 @@ #endif
if (mss)
ctrl |= LargeSend |
((mss & MSSMask) << MSSShift);
-   else if (skb->ip_summed == CHECKSUM_HW) {
+   else if (skb->ip_summed == CHECKSUM_PARTIAL) {
if (ip->protocol == IPPROTO_TCP)
ctrl |= IPCS | TCPCS;
else if (ip->protocol == IPPROTO_UDP)
@@ -894,7 +894,7 @@ #endif
txd->addr = cpu_to_le64(first_mapping);
wmb();
 
-   if (skb->ip_summed == CHECKSUM_HW) {
+   if (skb->ip_summed == CHECKSUM_PARTIAL) {
if (ip->protocol == IPPROTO_TCP)
txd->opts1 = cpu_to_le32(first_eor | first_len |
 FirstFrag | DescOwn |
diff --git a/drivers/net/acenic.c b/drivers/net/acenic.c
index 23ff22b..3ab0e76 100644
--- a/drivers/net/acenic.c
+++ b/drivers/net/acenic.c
@@ -2041,7 +2041,7 @@ static void ace_rx_int(struct net_device
 */
if (bd_flags & BD_FLG_TCP_UDP_SUM) {
skb->csum = htons(csum);
-   skb->ip_summed = CHECKSUM_HW;
+   skb->ip_summed = CHECKSUM_COMPLETE;
} else {
skb->ip_summed = CHECKSUM_NONE;
}
@@ -2512,7 +2512,7 @@ restart:
 
mapping = ace_map_tx_skb(ap, skb, skb, idx);
flagsize = (skb->len << 16) | (BD_FLG_END);
-   if (skb->ip_summed == CHECKSUM_HW)
+   if (skb->ip_summed == CHECKSUM_PARTIAL)
flagsize |= BD_FLG_TCP_UDP_SUM;
 #if ACENIC_DO_VLAN
if (vlan_tx_tag_present(skb)) {
@@ -2535,7 +2535,7 @@ #endif
 
mapping = ace_map_tx_skb(ap, skb, NULL, idx);
flagsize = (skb_headlen(skb) << 16);
-   if (skb->ip_summed == CHECKSUM_HW)
+   if (skb->ip_summed == CHECKSUM_PARTIAL)
flagsize |= BD_FLG_TCP_UDP_SUM;
 #if ACENIC_DO_VLAN
if (vlan_tx_tag_present(skb)) {
@@ -2561,7 +2561,7 @@ #endif
   PCI_DMA_TODEVICE);
 
flagsize = (frag->size << 16);
-   if (skb->ip_summed == CHECKSUM_HW)
+   if (skb->ip_summed == CHECKSUM_PARTIAL)
flagsize |= BD_FLG_TCP_UDP_SUM;
  

Re: BUGs in skb_checksum_help() and skb_gso_segment() in 2.6.18-rc2

2006-07-25 Thread Patrick McHardy
Herbert Xu wrote:
> Hi Patrick:
> 
> On Wed, Jul 26, 2006 at 05:38:07AM +0200, Patrick McHardy wrote:
> 
>>I have a patch which changes netfilter to do incremental checksumming.
>>The hook number is passed to all functions doing this so they know
>>how to update the checksum. Could you explain how
>>CHECKSUM_COMPLETE/CHECKSUM_PARTIAL are going to be used? I assume
>>they're meant to avoid passing hook numbers around everywhere?
> 
> 
> Yes the hook number is another way to solve the same problem.  However,
> it can only be used within netfilter.  CHECKSUM_COMPLETE/CHECKSUM_PARTIAL
> on the other hand are valid throughout the stack.  With Xen feeding Linux
> packets into the stack the netfilter hook is also no longer sufficient to
> distinguish between these two cases as partial checksum packets can now
> appear on receive.
> 
> The problem is that you need to do different incremental updates depending
> on whether the checksum is complete (i.e., CHECKSUM_HW on receive), or
> partial (i.e., CHECKSUM_HW on transmit).
> 
> With complete checksums the current update code in netfilter can be used
> as is.  With partial checksums you need to exclude bits which weren't
> used when computing the partial checksums (e.g., TCP port numbers need
> to be excluded, but the IP address needs to be included for NAT).

That does sound better than the hook number approach.

> I have a patch that adds CHECKSUM_COMPLETE/CHECKSUM_PARTIAL if you want
> something to work from.  Let me know if you want this and I'll bounce it
> to you.

Please send it, I'll update my patch based on that. Thanks.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUGs in skb_checksum_help() and skb_gso_segment() in 2.6.18-rc2

2006-07-25 Thread Herbert Xu
Hi Patrick:

On Wed, Jul 26, 2006 at 05:38:07AM +0200, Patrick McHardy wrote:
> 
> I have a patch which changes netfilter to do incremental checksumming.
> The hook number is passed to all functions doing this so they know
> how to update the checksum. Could you explain how
> CHECKSUM_COMPLETE/CHECKSUM_PARTIAL are going to be used? I assume
> they're meant to avoid passing hook numbers around everywhere?

Yes the hook number is another way to solve the same problem.  However,
it can only be used within netfilter.  CHECKSUM_COMPLETE/CHECKSUM_PARTIAL
on the other hand are valid throughout the stack.  With Xen feeding Linux
packets into the stack the netfilter hook is also no longer sufficient to
distinguish between these two cases as partial checksum packets can now
appear on receive.

The problem is that you need to do different incremental updates depending
on whether the checksum is complete (i.e., CHECKSUM_HW on receive), or
partial (i.e., CHECKSUM_HW on transmit).

With complete checksums the current update code in netfilter can be used
as is.  With partial checksums you need to exclude bits which weren't
used when computing the partial checksums (e.g., TCP port numbers need
to be excluded, but the IP address needs to be included for NAT).

I have a patch that adds CHECKSUM_COMPLETE/CHECKSUM_PARTIAL if you want
something to work from.  Let me know if you want this and I'll bounce it
to you.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Is the qla3xxx driver in the mainline?

2006-07-25 Thread Albert Lee
Hi Ron,

The qla3xxx driver is in the -mm tree.

For the mainline inclusion, Jeff seems ready to accept the driver:
http://marc.theaimsgroup.com/?l=linux-netdev&m=115334172810775&w=2

Just wondering if the qla3xxx driver is in the mainline yet?

(I am curious to have some performance comparison of
qla3xxx + open iscsi v.s. qla4xxx + on board TOE/iscsi.)

--
albert

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH Round 4 2/3] Core network changes to support network event notification.

2006-07-25 Thread Herbert Xu
On Tue, Jul 25, 2006 at 10:05:40AM -0500, Steve Wise wrote:
> 
> But they really are seeing a delete followed by an add.  That's what the
> kernel is doing.

Actually that's the other thing I don't really like.  The user-space
monitor may perceive that a route was actually deleted and replaced
by a new one even though this isn't what's happening at all.

In fact the problem here is that you're sending route notifications
when it's really the dst_entry that's changing.  User-space as it
stands only get notifications about fib changes which is quite different
from changes to the transient dst_entry objects which only exist in the
route cache.

Is anyone actually going to use the user-space interface of this? If not
perhaps we should wait until someone really needs it before adding the
netlink part of the patch.

We can change the kernel interface at will so if we make a mistake with
netevent it can be easily corrected.  For user-space though the rules
are totally different.  I'd really hate to be stuck with an interface
which turns out to not be the one that people actually want to have.

> The rdma driver needs to update all established rdma connections that
> are using the next-hop information of the existing route and make them
> use the next-hop information of the new route.  In addition, the rdma
> driver might have a reference to the old dst entry.  So it can release
> that ref and add a ref to the new dst entry.

Do you really need the old route for the user-space part of your patch?

> I have to admit I'm a little fuzzy on the routing stuff.  The main
> netevents I've utilized in the the rdma driver I'm writing is the
> neighbour update event and the redirect event.  Route add/del was added
> for completeness of "routing" netevents.   

So you mean you aren't going to use the route notifications? In that case
we should probably just drop them and add them when someone actually needs
it.  At that point they can tell us what semantics they want from it :)

> Can you expand further or point me to code where the IP stack "flushes
> its tables" when routes are changed?

Grep for rt_cache_flush in net/ipv4/fib_hash.c.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUGs in skb_checksum_help() and skb_gso_segment() in 2.6.18-rc2

2006-07-25 Thread Patrick McHardy
Herbert Xu wrote:
> Evgeniy Polyakov <[EMAIL PROTECTED]> wrote:
> 
>>It is not a bug, but remind to update nat helper function.
> 
> 
> Yes, I need to add CHECKSUM_COMPLETE vs. CHECKSUM_PARTIAL first so that
> we actually know which is which in NAT.


I have a patch which changes netfilter to do incremental checksumming.
The hook number is passed to all functions doing this so they know
how to update the checksum. Could you explain how
CHECKSUM_COMPLETE/CHECKSUM_PARTIAL are going to be used? I assume
they're meant to avoid passing hook numbers around everywhere?

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUGs in skb_checksum_help() and skb_gso_segment() in 2.6.18-rc2

2006-07-25 Thread Herbert Xu
Evgeniy Polyakov <[EMAIL PROTECTED]> wrote:
> On Tue, Jul 25, 2006 at 05:15:04PM +0200, Tomasz Torcz ([EMAIL PROTECTED]) 
> wrote:
>> BUG: warning at net/core/dev.c:1171/skb_checksum_help()
>>  [] show_trace_log_lvl+0x51/0xe6
>>  [] show_trace+0xa/0xc
>>  [] dump_stack+0x13/0x15
>>  [] skb_checksum_help+0x4d/0xeb
>>  [] ip_nat_fn+0x47/0x19a
> 
> It is not a bug, but remind to update nat helper function.

Yes, I need to add CHECKSUM_COMPLETE vs. CHECKSUM_PARTIAL first so that
we actually know which is which in NAT.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ip multicast route bug fix

2006-07-25 Thread Herbert Xu
Alexey Kuznetsov <[EMAIL PROTECTED]> wrote:
> 
> I think you mean this.
> 
> Note, it is real skb_clone(), not alloc_skb(). Equeued skb contains
> the whole half-prepared netlink message plus room for the rest.
> It could be also skb_copy(), if we want to be puristic about mangling
> cloned data, but original copy is really not going to be used.  

I like this.  However, since the cloned skb is either discarded in case
of error, or queued in which case the caller discards its reference right
away, wouldn't it be simpler to just do this?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index ba33f86..0a2af08 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1593,6 +1593,7 @@ int ipmr_get_route(struct sk_buff *skb, 
read_unlock(&mrt_lock);
return -ENODEV;
}
+   skb_get(skb);
skb->nh.raw = skb_push(skb, sizeof(struct iphdr));
skb->nh.iph->ihl = sizeof(struct iphdr)>>2;
skb->nh.iph->saddr = rt->rt_src;
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MTU probing bug?

2006-07-25 Thread John Heffner

David Miller wrote:

I find it interesting that black hole detection is handled
different from a normal probe failure.  I guess here we are
dealing with a more significant failure, so we should start
at the thing which is most guarenteed to work.


Black hole detection is substantially different than probe failure. 
It's an indication that your current effective MTU is too large, and 
must be reduced.  A black hole basically means that your current path 
information is no good, either because your initial values are wrong, or 
the path changed.  So, you really want to reset everything to 
conservative values.  After a failed probe, your effective MTU is still 
fine, you've just gotten some additional information about the path 
(that it doesn't support the higher MTU).


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Netchannel: default, find, add to socket

2006-07-25 Thread David Miller
From: Kelly Daly <[EMAIL PROTECTED]>
Date: Tue, 11 Jul 2006 16:17:56 +1000

> Implement finding of correct netchannel for buffer, default netchannel and 
> attach a netchannel to a socket
> 
> Signed-off-by:  Kelly Daly <[EMAIL PROTECTED]>

Kelly, I want to apply this, but your email client has busted up the
patch by placing new lines all over.  It also has turned tabs into
spaces, which also corrupts the patch.

Can you configure KMail not to corrupt the patch like this and
resubmit?

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][IPv4/IPv6] Setting 0 for unused port field.

2006-07-25 Thread David Miller

Tetsuo-san can you please use a correct "From:" field in your patch
postings?  Thank you.

This will allow me to form a correct attribution when I apply your
patches in the future.

This time I had to perform a lengthy web search to find who is behind
these strange [EMAIL PROTECTED] email addresses
:-((  Using these per-listname email addresses really makes things
painful for other developers, even though it might simplify things
for you.

Thanks again.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][IPv4/IPv6] Setting 0 for unused port field. (fwd)

2006-07-25 Thread YOSHIFUJI Hideaki / 吉藤英明
In article <[EMAIL PROTECTED]> (at Tue, 25 Jul 2006 16:54:36 -0700 (PDT)), 
David Miller <[EMAIL PROTECTED]> says:

> > Well, instead, should it be initalized to protocol number, shouldn't it?
> 
> Initially, this was my reaction too.  But, aparently it is defined
> to be zero, from TCP/IP Illustrated Volume 2, page 1055, which is
:

Okay, Acked.

Acked-by: YOSHIFUJI Hideaki <[EMAIL PROTECTED]>

--yoshfuji
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][IPv4/IPv6] Setting 0 for unused port field. (fwd)

2006-07-25 Thread David Miller
From: YOSHIFUJI Hideaki <[EMAIL PROTECTED]>
Date: Tue, 25 Jul 2006 23:53:40 +0900 (JST)

> In article <[EMAIL PROTECTED]> (at Tue, 25 Jul 2006 10:45:51 -0400 (EDT)), 
> James Morris <[EMAIL PROTECTED]> says:
> 
> > The recvmsg() for raw socket seems to return random u16 value
> > from the kernel stack memory since port field is not initialized.
> > But I'm not sure this patch is correct.
> > Does raw socket return any information stored in port field?
> > 
> > -- Start of patch --
> > diff -ur before/net/ipv4/raw.c after/net/ipv4/raw.c
> > --- before/net/ipv4/raw.c   2006-06-18 10:49:35.0 +0900
> > +++ after/net/ipv4/raw.c2006-07-25 16:15:26.0 +0900
> > @@ -609,6 +609,7 @@
> > if (sin) {
> > sin->sin_family = AF_INET;
> > sin->sin_addr.s_addr = skb->nh.iph->saddr;
> > +   sin->sin_port = 0;
> > memset(&sin->sin_zero, 0, sizeof(sin->sin_zero));
> > }
> > if (inet->cmsg_flags)
> 
> Well, instead, should it be initalized to protocol number, shouldn't it?

Initially, this was my reaction too.  But, aparently it is defined
to be zero, from TCP/IP Illustrated Volume 2, page 1055, which is
discussing rip_input() in the BSD stack:

Unlike UDP, there is no concept of a port number in raw IP, so
the sin_port field in the sockaddr_in structures is always 0.

The BSD code does not explicitly set it to zero.  But it doesn't have
to because it uses "ripsrc" which is a static variable in the BSS
section of the BSD kernel image.

So this patch appears correct and I will apply it.  Thanks everyone.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ip multicast route bug fix

2006-07-25 Thread David Miller
From: Alexey Kuznetsov <[EMAIL PROTECTED]>
Date: Wed, 26 Jul 2006 01:17:25 +0400

> Hello!
> 
> > Wouldn't it be better to have a consistent interface (skb always freed),
> > and clone the skb if needed for deferred processing?
> 
> I think you mean this.
> 
> Note, it is real skb_clone(), not alloc_skb(). Equeued skb contains
> the whole half-prepared netlink message plus room for the rest.
> It could be also skb_copy(), if we want to be puristic about mangling
> cloned data, but original copy is really not going to be used.  

Applied, thanks Alexey.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] tg3: Update version and reldate

2006-07-25 Thread David Miller
From: "Michael Chan" <[EMAIL PROTECTED]>
Date: Tue, 25 Jul 2006 15:56:26 -0700

> Update version to 3.63.
> 
> Signed-off-by: Michael Chan <[EMAIL PROTECTED]>

Also applied, thanks again.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] tg3: Handle tg3_init_rings() failures

2006-07-25 Thread David Miller
From: "Michael Chan" <[EMAIL PROTECTED]>
Date: Tue, 25 Jul 2006 15:56:20 -0700

> Handle dev_alloc_skb() failures when initializing the RX rings.
> Without proper handling, the driver will crash when using a partial
> ring.
> 
> Thanks to Stephane Doyon <[EMAIL PROTECTED]> for reporting the bug and
> providing the initial patch.
> 
> Howie Xu <[EMAIL PROTECTED]> also reported the same issue.
> 
> Signed-off-by: Michael Chan <[EMAIL PROTECTED]>

Applied, thanks a lot.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] tg3: Add tg3_restart_hw()

2006-07-25 Thread David Miller
From: "Michael Chan" <[EMAIL PROTECTED]>
Date: Tue, 25 Jul 2006 15:56:13 -0700

> Add tg3_restart_hw() to handle failures when re-initializing the
> device.
> 
> Signed-off-by: Michael Chan <[EMAIL PROTECTED]>

Applied, thanks Michael.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] tg3: Update version and reldate

2006-07-25 Thread Michael Chan
Update version to 3.63.

Signed-off-by: Michael Chan <[EMAIL PROTECTED]>


diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index d66b06f..1b8138f 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -68,8 +68,8 @@
 
 #define DRV_MODULE_NAME"tg3"
 #define PFX DRV_MODULE_NAME": "
-#define DRV_MODULE_VERSION "3.62"
-#define DRV_MODULE_RELDATE "June 30, 2006"
+#define DRV_MODULE_VERSION "3.63"
+#define DRV_MODULE_RELDATE "July 25, 2006"
 
 #define TG3_DEF_MAC_MODE   0
 #define TG3_DEF_RX_MODE0


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] tg3: Add tg3_restart_hw()

2006-07-25 Thread Michael Chan
Add tg3_restart_hw() to handle failures when re-initializing the
device.

Signed-off-by: Michael Chan <[EMAIL PROTECTED]>


diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index ce6f3be..1253cec 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -3590,6 +3590,28 @@ static irqreturn_t tg3_test_isr(int irq,
 static int tg3_init_hw(struct tg3 *, int);
 static int tg3_halt(struct tg3 *, int, int);
 
+/* Restart hardware after configuration changes, self-test, etc.
+ * Invoked with tp->lock held.
+ */
+static int tg3_restart_hw(struct tg3 *tp, int reset_phy)
+{
+   int err;
+
+   err = tg3_init_hw(tp, reset_phy);
+   if (err) {
+   printk(KERN_ERR PFX "%s: Failed to re-initialize device, "
+  "aborting.\n", tp->dev->name);
+   tg3_halt(tp, RESET_KIND_SHUTDOWN, 1);
+   tg3_full_unlock(tp);
+   del_timer_sync(&tp->timer);
+   tp->irq_sync = 0;
+   netif_poll_enable(tp->dev);
+   dev_close(tp->dev);
+   tg3_full_lock(tp, 0);
+   }
+   return err;
+}
+
 #ifdef CONFIG_NET_POLL_CONTROLLER
 static void tg3_poll_controller(struct net_device *dev)
 {
@@ -3630,13 +3652,15 @@ static void tg3_reset_task(void *_data)
}
 
tg3_halt(tp, RESET_KIND_SHUTDOWN, 0);
-   tg3_init_hw(tp, 1);
+   if (tg3_init_hw(tp, 1))
+   goto out;
 
tg3_netif_start(tp);
 
if (restart_timer)
mod_timer(&tp->timer, jiffies + 1);
 
+out:
tp->tg3_flags &= ~TG3_FLAG_IN_RESET_TASK;
 
tg3_full_unlock(tp);
@@ -4124,6 +4148,7 @@ static inline void tg3_set_mtu(struct ne
 static int tg3_change_mtu(struct net_device *dev, int new_mtu)
 {
struct tg3 *tp = netdev_priv(dev);
+   int err;
 
if (new_mtu < TG3_MIN_MTU || new_mtu > TG3_MAX_MTU(tp))
return -EINVAL;
@@ -4144,13 +4169,14 @@ static int tg3_change_mtu(struct net_dev
 
tg3_set_mtu(dev, tp, new_mtu);
 
-   tg3_init_hw(tp, 0);
+   err = tg3_restart_hw(tp, 0);
 
-   tg3_netif_start(tp);
+   if (!err)
+   tg3_netif_start(tp);
 
tg3_full_unlock(tp);
 
-   return 0;
+   return err;
 }
 
 /* Free up pending packets in all rx/tx rings.
@@ -5815,6 +5841,7 @@ static int tg3_set_mac_addr(struct net_d
 {
struct tg3 *tp = netdev_priv(dev);
struct sockaddr *addr = p;
+   int err = 0;
 
if (!is_valid_ether_addr(addr->sa_data))
return -EINVAL;
@@ -5832,9 +5859,9 @@ static int tg3_set_mac_addr(struct net_d
tg3_full_lock(tp, 1);
 
tg3_halt(tp, RESET_KIND_SHUTDOWN, 1);
-   tg3_init_hw(tp, 0);
-
-   tg3_netif_start(tp);
+   err = tg3_restart_hw(tp, 0);
+   if (!err)
+   tg3_netif_start(tp);
tg3_full_unlock(tp);
} else {
spin_lock_bh(&tp->lock);
@@ -5842,7 +5869,7 @@ static int tg3_set_mac_addr(struct net_d
spin_unlock_bh(&tp->lock);
}
 
-   return 0;
+   return err;
 }
 
 /* tp->lock is held. */
@@ -7956,7 +7983,7 @@ static void tg3_get_ringparam(struct net
 static int tg3_set_ringparam(struct net_device *dev, struct ethtool_ringparam 
*ering)
 {
struct tg3 *tp = netdev_priv(dev);
-   int irq_sync = 0;
+   int irq_sync = 0, err = 0;
   
if ((ering->rx_pending > TG3_RX_RING_SIZE - 1) ||
(ering->rx_jumbo_pending > TG3_RX_JUMBO_RING_SIZE - 1) ||
@@ -7980,13 +8007,14 @@ static int tg3_set_ringparam(struct net_
 
if (netif_running(dev)) {
tg3_halt(tp, RESET_KIND_SHUTDOWN, 1);
-   tg3_init_hw(tp, 1);
-   tg3_netif_start(tp);
+   err = tg3_restart_hw(tp, 1);
+   if (!err)
+   tg3_netif_start(tp);
}
 
tg3_full_unlock(tp);
   
-   return 0;
+   return err;
 }
   
 static void tg3_get_pauseparam(struct net_device *dev, struct 
ethtool_pauseparam *epause)
@@ -8001,7 +8029,7 @@ static void tg3_get_pauseparam(struct ne
 static int tg3_set_pauseparam(struct net_device *dev, struct 
ethtool_pauseparam *epause)
 {
struct tg3 *tp = netdev_priv(dev);
-   int irq_sync = 0;
+   int irq_sync = 0, err = 0;
   
if (netif_running(dev)) {
tg3_netif_stop(tp);
@@ -8025,13 +8053,14 @@ static int tg3_set_pauseparam(struct net
 
if (netif_running(dev)) {
tg3_halt(tp, RESET_KIND_SHUTDOWN, 1);
-   tg3_init_hw(tp, 1);
-   tg3_netif_start(tp);
+   err = tg3_restart_hw(tp, 1);
+   if (!err)
+   tg3_netif_start(tp);
}
 
tg3_full_unlock(tp);
   
-   return 0;
+   return err;
 }
   
 static u32 tg3_get_rx_csum(struct net_device *dev)
@@ -8666,7 +8695,9 @@ static int tg3_test_loopback(struct tg3 
if (!netif_running(tp->dev))
  

Re: async network I/O, event channels, etc

2006-07-25 Thread Nicholas Miell
On Tue, 2006-07-25 at 15:01 -0700, David Miller wrote:
> From: Ulrich Drepper <[EMAIL PROTECTED]>
> Date: Tue, 25 Jul 2006 12:23:53 -0700
> 
> > I was very much surprised by the reactions I got after my OLS talk.
> > Lots of people declared interest and even agreed with the approach and
> > asked me to do further ahead with all this.  For those who missed it,
> > the paper and the slides are available on my home page:
> > 
> > http://people.redhat.com/drepper/
> > 
> > As for the next steps I see a number of possible ways.  The discussions
> > can be held on the usual mailing lists (i.e., lkml and netdev) but due
> > to the raw nature of the current proposal I would imagine that would be
> > mainly perceived as noise.
> 
> Since I gave a big thumbs up for Evgivny's kevent work yesterday
> on linux-kernel, you might want to start by comparing your work
> to his.  Because his has the advantage that 1) we have code now
> and 2) he has written many test applications and performed many
> benchmarks against his code which has flushed out most of the
> major implementation issues.
> 
> I think most of the people who have encouraged your work are unaware
> of Evgivny's kevent stuff, which is extremely unfortunate, the two
> works are more similar than they are different.
> 
> I do not think discussing all of this on netdev would be perceived
> as noise. :)

While the comparing is going on, how does this compare to Solaris's
ports interface? It's documented at
http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrir?a=view

Also, since we're on the subject, why a whole new interface for event
queuing instead of extending the existing io_getevents(2) and friends?

-- 
Nicholas Miell <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] tg3: Handle tg3_init_rings() failures

2006-07-25 Thread Michael Chan
Handle dev_alloc_skb() failures when initializing the RX rings.
Without proper handling, the driver will crash when using a partial
ring.

Thanks to Stephane Doyon <[EMAIL PROTECTED]> for reporting the bug and
providing the initial patch.

Howie Xu <[EMAIL PROTECTED]> also reported the same issue.

Signed-off-by: Michael Chan <[EMAIL PROTECTED]>


diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 1253cec..d66b06f 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -4258,7 +4258,7 @@ static void tg3_free_rings(struct tg3 *t
  * end up in the driver.  tp->{tx,}lock are held and thus
  * we may not sleep.
  */
-static void tg3_init_rings(struct tg3 *tp)
+static int tg3_init_rings(struct tg3 *tp)
 {
u32 i;
 
@@ -4307,18 +4307,38 @@ static void tg3_init_rings(struct tg3 *t
 
/* Now allocate fresh SKBs for each rx ring. */
for (i = 0; i < tp->rx_pending; i++) {
-   if (tg3_alloc_rx_skb(tp, RXD_OPAQUE_RING_STD,
--1, i) < 0)
+   if (tg3_alloc_rx_skb(tp, RXD_OPAQUE_RING_STD, -1, i) < 0) {
+   printk(KERN_WARNING PFX
+  "%s: Using a smaller RX standard ring, "
+  "only %d out of %d buffers were allocated "
+  "successfully.\n",
+  tp->dev->name, i, tp->rx_pending);
+   if (i == 0)
+   return -ENOMEM;
+   tp->rx_pending = i;
break;
+   }
}
 
if (tp->tg3_flags & TG3_FLAG_JUMBO_RING_ENABLE) {
for (i = 0; i < tp->rx_jumbo_pending; i++) {
if (tg3_alloc_rx_skb(tp, RXD_OPAQUE_RING_JUMBO,
--1, i) < 0)
+-1, i) < 0) {
+   printk(KERN_WARNING PFX
+  "%s: Using a smaller RX jumbo ring, "
+  "only %d out of %d buffers were "
+  "allocated successfully.\n",
+  tp->dev->name, i, tp->rx_jumbo_pending);
+   if (i == 0) {
+   tg3_free_rings(tp);
+   return -ENOMEM;
+   }
+   tp->rx_jumbo_pending = i;
break;
+   }
}
}
+   return 0;
 }
 
 /*
@@ -5969,7 +5989,9 @@ static int tg3_reset_hw(struct tg3 *tp, 
 * can only do this after the hardware has been
 * successfully reset.
 */
-   tg3_init_rings(tp);
+   err = tg3_init_rings(tp);
+   if (err)
+   return err;
 
/* This value is determined during the probe time DMA
 * engine test, tg3_test_dma.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: async network I/O, event channels, etc

2006-07-25 Thread David Miller
From: Ulrich Drepper <[EMAIL PROTECTED]>
Date: Tue, 25 Jul 2006 12:23:53 -0700

> I was very much surprised by the reactions I got after my OLS talk.
> Lots of people declared interest and even agreed with the approach and
> asked me to do further ahead with all this.  For those who missed it,
> the paper and the slides are available on my home page:
> 
> http://people.redhat.com/drepper/
> 
> As for the next steps I see a number of possible ways.  The discussions
> can be held on the usual mailing lists (i.e., lkml and netdev) but due
> to the raw nature of the current proposal I would imagine that would be
> mainly perceived as noise.

Since I gave a big thumbs up for Evgivny's kevent work yesterday
on linux-kernel, you might want to start by comparing your work
to his.  Because his has the advantage that 1) we have code now
and 2) he has written many test applications and performed many
benchmarks against his code which has flushed out most of the
major implementation issues.

I think most of the people who have encouraged your work are unaware
of Evgivny's kevent stuff, which is extremely unfortunate, the two
works are more similar than they are different.

I do not think discussing all of this on netdev would be perceived
as noise. :)


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can we have GET_NETDEV_DEV?

2006-07-25 Thread Pavel Roskin
On Tue, 2006-07-25 at 14:26 -0700, Stephen Hemminger wrote:
> On Tue, 25 Jul 2006 17:08:11 -0400
> Pavel Roskin <[EMAIL PROTECTED]> wrote:
> > Considering the drivers that are already in the kernel, you may prefer
> > to have a more high-level function that would clone the network device
> > by copying most of the net_device structure.  I think netdev_get_pdev()
> > would be mostly used for such cloning if implemented.
> > 
> 
> What is the wireless tree using for this?

Suppose there is a wireless device that can be used on different buses,
e.g. PCI and PCMCIA.  The driver consists of the common part that works
with the chip and the bus specific parts that allocate the resources.

The bus specific parts call SET_NETDEV_DEV to associate the network
device with the hardware device.

Now, suppose the user requests creating a WDS interface.  WDS interfaces
use 4-address headers, which allows bridging.  As it stands now, the way
to support a WDS interface is to make is a separate network device.  I
wish it wouldn't be necessary, but we would need to change quite a few
code inside and outside the kernel to understand WDS addressing (I hope
to be wrong about that).

There is nothing hardware specific about WDS interfaces.  They are just
extensions of the main interface with a fixed "receive address" and
4-address headers enabled.  So it makes sense to create WDS interface in
the common (i.e. bus-agnostic) part of the driver.

In the (possibly misguided) attempt to make WDS network devices look
like real network interfaces, the "class_dev" field is copied from the
master network device, because it's something that can be done in the
bus-agnostic way:

SET_NETDEV_DEV(dev, mdev->class_dev.dev);

This is how it's done in HostAP (in the kernel) and MadWifi (outside the
kernel).

Sure, this value can be stored in the private part of the driver, so
it's a minor issue.

(Actually, the main (AP or STA) interface is also made a virtual network
device, so that the the master interface appears as the parent of the
main and the WDS devices and as the single point for monitoring of all
wireless traffic.)

-- 
Regards,
Pavel Roskin


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MTU probing bug?

2006-07-25 Thread David Miller
From: John Heffner <[EMAIL PROTECTED]>
Date: Tue, 25 Jul 2006 09:56:38 -0400

> David Miller wrote:
> > John, have a look at this code in tcp_write_timeout():
> > 
> > mss = min(sysctl_tcp_base_mss,
> >   tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low)/2);
> > mss = max(mss, 68 - tp->tcp_header_len);
> > 
> > That first line looks like it should be a max() instead
> > of a min().
> > 
> > tcp_base_mss is the smallest MSS we should use, therefore
> > we should make sure the "mss" is at least that large.
> > 
> > It's also possible that I misread the intention of this code :) From
> > what I read, it is trying to half the MSS in use and adjust the MTU
> > search low point to be based upon this new value.
> 
> No, the min() is what's intended here.  The base_mss is where you want 
> to start searching from.  So, on black hole detection, you drop 
> immediatly down to the base.  The base is configurable, because making 
> it higher can make searching faster.  If it's still too high for some 
> links, you halve it again on successive timeouts.

Thanks for the clarification John.

I find it interesting that black hole detection is handled
different from a normal probe failure.  I guess here we are
dealing with a more significant failure, so we should start
at the thing which is most guarenteed to work.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can we have GET_NETDEV_DEV?

2006-07-25 Thread Greg KH
On Tue, Jul 25, 2006 at 10:20:05AM -0700, Stephen Hemminger wrote:
> On Tue, 25 Jul 2006 00:52:25 -0400
> Pavel Roskin <[EMAIL PROTECTED]> wrote:
> 
> > Hello!
> > 
> > gregkh-driver-network-class_device-to-device.patch, which briefly
> > appeared in Linux 2.6.18-rc1-mm1 broke MadWifi, which is copying the
> > physical device information from the master network device to the
> > virtual network devices:
> > 
> > SET_NETDEV_DEV(dev, mdev->class_dev.dev);
> >
> 
> I would rather see SET_NETDEV_DEV go away. It was done for source
> compatibility between 2.4 and 2.6 network device drivers. This is
> no longer really important.

It did make my change much simpler to do, which I appreciated :)

> It would be better to either access the network device directly.
> BUT, there are plans to get rid of class_dev so that would mean
> source changes to all the drivers again...
> 
> So how about these wrappers.

They look good to me.

BTW, I'll be getting the patch back into -mm that does the class_device
-> device conversion after I fix the SuSE 10.0 release that doesn't like
the change and get that pushed out to users.  After that happens, there
shouldn't be any userspace problems with this patch.

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can we have GET_NETDEV_DEV?

2006-07-25 Thread Stephen Hemminger
On Tue, 25 Jul 2006 17:08:11 -0400
Pavel Roskin <[EMAIL PROTECTED]> wrote:

> Hello, Stephen!
> 
> On Tue, 2006-07-25 at 10:20 -0700, Stephen Hemminger wrote:
> 
> > So how about these wrappers.
> 
> > +static inline void netdev_set_pdev(struct net_device *dev, struct device 
> > *pdev)
> > +{
> > +   dev->class_dev.dev = pdev;
> > +}
> > +
> > +static inline struct device *netdev_get_pdev(struct net_device *dev)
> > +{
> > +   return dev->class_dev.dev;
> > +}
> 
> The positive effect of a macro is that if would allow MadWifi to prepare
> in advance by testing if GET_NETDEV_DEV is defined.
> 
> The functions may or may not be useful for the code in the kernel, but
> I'll not be able to prepare MadWifi unless I know the kernel version in
> which they will appear.
> 
> Considering the drivers that are already in the kernel, you may prefer
> to have a more high-level function that would clone the network device
> by copying most of the net_device structure.  I think netdev_get_pdev()
> would be mostly used for such cloning if implemented.
> 

What is the wireless tree using for this?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ip multicast route bug fix

2006-07-25 Thread Stephen Hemminger
On Wed, 26 Jul 2006 01:17:25 +0400
Alexey Kuznetsov <[EMAIL PROTECTED]> wrote:

> Hello!
> 
> > Wouldn't it be better to have a consistent interface (skb always freed),
> > and clone the skb if needed for deferred processing?
> 
> I think you mean this.
> 
> Note, it is real skb_clone(), not alloc_skb(). Equeued skb contains
> the whole half-prepared netlink message plus room for the rest.
> It could be also skb_copy(), if we want to be puristic about mangling
> cloned data, but original copy is really not going to be used.  
> 
> Alexey
> 

That is what I was thinking of. Doesn't matter copy or clone, I prefer
clone.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ip multicast route bug fix

2006-07-25 Thread Alexey Kuznetsov
Hello!

> Wouldn't it be better to have a consistent interface (skb always freed),
> and clone the skb if needed for deferred processing?

I think you mean this.

Note, it is real skb_clone(), not alloc_skb(). Equeued skb contains
the whole half-prepared netlink message plus room for the rest.
It could be also skb_copy(), if we want to be puristic about mangling
cloned data, but original copy is really not going to be used.  

Alexey


diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 9ccacf5..85893ee 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1578,6 +1578,7 @@ int ipmr_get_route(struct sk_buff *skb, 
cache = ipmr_cache_find(rt->rt_src, rt->rt_dst);
 
if (cache==NULL) {
+   struct sk_buff *skb2;
struct net_device *dev;
int vif;
 
@@ -1591,12 +1592,18 @@ int ipmr_get_route(struct sk_buff *skb, 
read_unlock(&mrt_lock);
return -ENODEV;
}
-   skb->nh.raw = skb_push(skb, sizeof(struct iphdr));
-   skb->nh.iph->ihl = sizeof(struct iphdr)>>2;
-   skb->nh.iph->saddr = rt->rt_src;
-   skb->nh.iph->daddr = rt->rt_dst;
-   skb->nh.iph->version = 0;
-   err = ipmr_cache_unresolved(vif, skb);
+   skb2 = skb_clone(skb, GFP_ATOMIC);
+   if (!skb2) {
+   read_unlock(&mrt_lock);
+   return -ENOMEM;
+   }
+
+   skb2->nh.raw = skb_push(skb2, sizeof(struct iphdr));
+   skb2->nh.iph->ihl = sizeof(struct iphdr)>>2;
+   skb2->nh.iph->saddr = rt->rt_src;
+   skb2->nh.iph->daddr = rt->rt_dst;
+   skb2->nh.iph->version = 0;
+   err = ipmr_cache_unresolved(vif, skb2);
read_unlock(&mrt_lock);
return err;
}
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can we have GET_NETDEV_DEV?

2006-07-25 Thread Pavel Roskin
Hello, Stephen!

On Tue, 2006-07-25 at 10:20 -0700, Stephen Hemminger wrote:

> So how about these wrappers.

> +static inline void netdev_set_pdev(struct net_device *dev, struct device 
> *pdev)
> +{
> + dev->class_dev.dev = pdev;
> +}
> +
> +static inline struct device *netdev_get_pdev(struct net_device *dev)
> +{
> + return dev->class_dev.dev;
> +}

The positive effect of a macro is that if would allow MadWifi to prepare
in advance by testing if GET_NETDEV_DEV is defined.

The functions may or may not be useful for the code in the kernel, but
I'll not be able to prepare MadWifi unless I know the kernel version in
which they will appear.

Considering the drivers that are already in the kernel, you may prefer
to have a more high-level function that would clone the network device
by copying most of the net_device structure.  I think netdev_get_pdev()
would be mostly used for such cloning if implemented.

-- 
Regards,
Pavel Roskin


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Stop calling phy_stop_interrupts() twice

2006-07-25 Thread Sergei Shtylylov
Prevent phylib from freeing PHY IRQ twice on closing an eth device:
phy_disconnect() first calls phy_stop_interrupts(), then it calls 
phy_stop_machine() which in turn calls phy_stop_interrupts() making the 
kernel complain on each bootup...

Signed-off-by: Sergei Shtylyov <[EMAIL PROTECTED]>

Index: linux-2.6/drivers/net/phy/phy.c
===
--- linux-2.6.orig/drivers/net/phy/phy.c
+++ linux-2.6/drivers/net/phy/phy.c
@@ -419,9 +419,8 @@ void phy_start_machine(struct phy_device
 
 /* phy_stop_machine
  *
- * description: Stops the state machine timer, sets the state to
- *   UP (unless it wasn't up yet), and then frees the interrupt,
- *   if it is in use. This function must be called BEFORE
+ * description: Stops the state machine timer, sets the state to UP
+ *   (unless it wasn't up yet). This function must be called BEFORE
  *   phy_detach.
  */
 void phy_stop_machine(struct phy_device *phydev)
@@ -433,9 +432,6 @@ void phy_stop_machine(struct phy_device 
phydev->state = PHY_UP;
spin_unlock(&phydev->lock);
 
-   if (phydev->irq != PHY_POLL)
-   phy_stop_interrupts(phydev);
-
phydev->adjust_state = NULL;
 }
 

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bandwidth limitation help

2006-07-25 Thread Ian McDonald

On 7/26/06, Piotrowski, Ted P. <[EMAIL PROTECTED]> wrote:

Hi,

I am new to the mailing list so I'm not sure if anybody reads these, but
here goes nothing. I recently read: Linux Advanced Routing & Traffic
Control HOWTO and have been trying to test my applications using
bandwidth limitation. All the examples described in the HOWTO do not
simulate the conditions I need to test my software. What I would like is
for my bandwidth limitation to empty my UDP buffer at a given rate. I
have tried using a simple TBF to do this, but all that happens is that
my application floods the TBF buffer at link speed and the TBF buffer
quickly overflows and drops packets. I want the packets to actually stay
in the UDP buffer and be emptied at a given rate without modifying my
application.

I don't know if any of you are familiar with netem, but it can be used
in conjuction with tc to add delay to a link. Surprisingly, packets
delayed by netem appear to remain in the UDP buffer until it is time for
them to be sent. I would like this same behavior of keeping the packets
in the UDP buffer, but with bandwidth limitation on the rate at which
the buffer empties, not just packet delay. Has anybody ever done
anything like this or can point me to some resources?


Have a look at:
http://linux-net.osdl.org/index.php/Netem

I have written my own test scenarios using examples from this website
but I can also send you my small scripts if you want.

Ian
--
Ian McDonald
Web: http://wand.net.nz/~iam4
Blog: http://imcdnzl.blogspot.com
WAND Network Research Group
Department of Computer Science
University of Waikato
New Zealand
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


bandwidth limitation help

2006-07-25 Thread Piotrowski, Ted P.
Hi,
 
I am new to the mailing list so I'm not sure if anybody reads these, but
here goes nothing. I recently read: Linux Advanced Routing & Traffic
Control HOWTO and have been trying to test my applications using
bandwidth limitation. All the examples described in the HOWTO do not
simulate the conditions I need to test my software. What I would like is
for my bandwidth limitation to empty my UDP buffer at a given rate. I
have tried using a simple TBF to do this, but all that happens is that
my application floods the TBF buffer at link speed and the TBF buffer
quickly overflows and drops packets. I want the packets to actually stay
in the UDP buffer and be emptied at a given rate without modifying my
application.
 
I don't know if any of you are familiar with netem, but it can be used
in conjuction with tc to add delay to a link. Surprisingly, packets
delayed by netem appear to remain in the UDP buffer until it is time for
them to be sent. I would like this same behavior of keeping the packets
in the UDP buffer, but with bandwidth limitation on the rate at which
the buffer empties, not just packet delay. Has anybody ever done
anything like this or can point me to some resources?
 
Thank you,
Ted
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Who maintains the website ?

2006-07-25 Thread Ian McDonald

On 7/26/06, Christophe Devriese <[EMAIL PROTECTED]> wrote:

The http://linux-net.osdl.org/index.php/Main_Page website I mean.


It's a Wiki so anybody can alter content on the website. The exception
to this is that particular page - the main page. If you want something
altered on that particular page send email to one of the sysops.
--
Ian McDonald
Web: http://wand.net.nz/~iam4
Blog: http://imcdnzl.blogspot.com
WAND Network Research Group
Department of Computer Science
University of Waikato
New Zealand
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


bandwidth limitation help

2006-07-25 Thread Piotrowski, Ted P.
Hi,
 
I am new to the mailing list so I'm not sure if anybody reads these, but
here goes nothing. I recently read: Linux Advanced Routing & Traffic
Control HOWTO and have been trying to test my applications using
bandwidth limitation. All the examples described in the HOWTO do not
simulate the conditions I need to test my software. What I would like is
for my bandwidth limitation to empty my UDP buffer at a given rate. I
have tried using a simple TBF to do this, but all that happens is that
my application floods the TBF buffer at link speed and the TBF buffer
quickly overflows and drops packets. I want the packets to actually stay
in the UDP buffer and be emptied at a given rate without modifying my
application.
 
I don't know if any of you are familiar with netem, but it can be used
in conjuction with tc to add delay to a link. Surprisingly, packets
delayed by netem appear to remain in the UDP buffer until it is time for
them to be sent. I would like this same behavior of keeping the packets
in the UDP buffer, but with bandwidth limitation on the rate at which
the buffer empties, not just packet delay. Has anybody ever done
anything like this or can point me to some resources?
 
Thank you,
Ted P.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-07-25 Thread Tom Tucker
On Mon, 2006-07-24 at 15:23 -0700, David Miller wrote: 
> From: Tom Tucker <[EMAIL PROTECTED]>
> Date: Wed, 05 Jul 2006 12:09:42 -0500
> 
> > "A TOE net stack is closed source firmware. Linux engineers have no way
> > to fix security issues that arise. As a result, only non-TOE users will
> > receive security updates, leaving random windows of vulnerability for
> > each TOE NIC's users."
> > 
> > - A Linux security update may or may not be relevant to a vendors
> > implementation. 
> > 
> > - If a vendor's implementation has a security issue then the customer
> > must rely on the vendor to fix it. This is no less true for iWARP than
> > for any adapter.
> 
> This isn't how things actually work.
> 
> Users have a computer, and they can rightly expect the community
> to help them solve problems that occur in the upstream kernel.
> 
> When a bug is found and the person is using NIC X, we don't
> necessarily forward the bug report to the vendor of NIC X.
> Instead we try to fix the bug.  Many chip drivers are maintained
> by people who do not work for the company that makes the chip,
> and this works just fine.
> 
> If only the chip vendor can fix a security problem, this makes Linux
> less agile to fix.  Even aspect of a problem on a Linux system that
> cannot be fixed entirely by the community is a net negative for Linux.
> 

All true. What I meant to say was that this is "no less true than for
any deep adapter". It is incontrovertible that a deep adapter is less
flexible, and more difficult to support than a shallow adapter.

> > - iWARP needs to do protocol processing in order to validate and
> > evaluate TCP payload in advance of direct data placement. This
> > requirement is independent of CPU speed. 
> 
> Yet, RDMA itself is just an optimization meant to deal with
> limitations of cpu and memory speed.  You can rephrase the
> situation in whatever way suits your argument, but it does not
> make the core issue go away :)

Yep.

> 
> > - I suspect that connection rates for RDMA adapters fall well-below the
> > rates attainable with a dumb device. That said, all of the RDMA
> > applications that I know of are not connection intensive. Even for TOE,
> > the later HTTP versions makes connection rates less of an issue.
> 
> This is a very naive evaluation of the situation.  Yes, newer
> versions of protocols such as HTTP make the per-client connection
> burdon lower, but the number of clients will increase in time to
> more than makeup for whatever gains are seen due to this.

Naive is being kind, my HTTP comment is irrelevant :).  

> And then you have protocols which by design are connection heavy,
> and rightly so, such as bittorrent.
> 
> Being able to handle enormous numbers of connections, with extreme
> scalability and low latency, is an absolute requirement of any modern
> day serious TCP stack.  And this requirement is not going away.
> Wishing this requirement away due to HTTP persistent connections is
> very unrealistic, at best.
> 
> > - This is the problem we're trying to solve...incrementally and
> > responsibly.
> 
> You can't.  See my email to Roland about why even VJ net channels
> are found to be impractical.  To support netfilter properly, you
> must traverse the whole netfilter stack, because NAT can rewrite
> packets, yet still make them destined for the local system, and
> thus they will have a different identity for connection demux
> by the time the TCP stack sees the packet.
> 

I'm not claiming that all the problems can be solved, I'm suggesting
that integration could be better and that partial integration is better
than none. 

> All of these tranformations occur between normal packet receive
> and the TCP stack.  You would therefore need to put your card
> between netfilter and TCP in the packet input path, and at that
> point why bother with the stateful card at all?
> 
> The fact is that stateless approaches will always be better than
> stateful things because you cannot replicate the functionality we
> have in the Linux stack without replicating 10 years of work into
> your chip's firmware.  At that point you should just run Linux
> on your NIC since that is what you are effectively doing :)
> 

I wish...I'd have a better stack. 

> In conversations such as these, it helps us a lot if you can be frank
> and honest about the true absolute limitations of your technology.  

I'm trying ... classifying these limitations as "core can't fix" and
"fixable with integration" is where we're getting crosswise. 

> I
> can see that your viewpoint is tainted when I hear things such as HTTP
> persistent connections being used as a reason why high TCP connection
> rates won't matter in the future.  Such assertions are understood to
> be patently false by anyone who understands TCP and how it is used in
> the real world.

Partial "Fixable with Integration" Summary

- ARP Resolution
- ICMP Redirect
- Path MTU Change
- Route Update
- Colliding TCP Port Spaces

Partial "Can't Fix" Issues Summary:

Who maintains the website ?

2006-07-25 Thread Christophe Devriese
The http://linux-net.osdl.org/index.php/Main_Page website I mean.

-- 
--
Christophe Devriese  EURiD
Network Adminstrator / Developer
[EMAIL PROTECTED]
--- http://www.eth1.org --

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ip multicast route bug fix

2006-07-25 Thread Alexey Kuznetsov
Hello!

> Wouldn't it be better to have a consistent interface (skb always freed),
> and clone the skb if needed for deferred processing?

I am sorry, I misunderstood you. I absolutely agree. It is much better,
the variant which I suggested is a good sample of bad programming. :-)

Alexey

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can we have GET_NETDEV_DEV?

2006-07-25 Thread Stephen Hemminger
On Tue, 25 Jul 2006 00:52:25 -0400
Pavel Roskin <[EMAIL PROTECTED]> wrote:

> Hello!
> 
> gregkh-driver-network-class_device-to-device.patch, which briefly
> appeared in Linux 2.6.18-rc1-mm1 broke MadWifi, which is copying the
> physical device information from the master network device to the
> virtual network devices:
> 
> SET_NETDEV_DEV(dev, mdev->class_dev.dev);
>

I would rather see SET_NETDEV_DEV go away. It was done for source
compatibility between 2.4 and 2.6 network device drivers. This is
no longer really important.

It would be better to either access the network device directly.
BUT, there are plans to get rid of class_dev so that would mean
source changes to all the drivers again...

So how about these wrappers.

--- a/include/linux/netdevice.h 2006-07-24 10:56:59.0 -0700
+++ b/include/linux/netdevice.h 2006-07-25 10:18:12.0 -0700
@@ -535,10 +535,23 @@
 }
 
 #define SET_MODULE_OWNER(dev) do { } while (0)
+
 /* Set the sysfs physical device reference for the network logical device
  * if set prior to registration will cause a symlink during initialization.
  */
-#define SET_NETDEV_DEV(net, pdev)  ((net)->class_dev.dev = (pdev))
+static inline void netdev_set_pdev(struct net_device *dev, struct device *pdev)
+{
+   dev->class_dev.dev = pdev;
+}
+
+static inline struct device *netdev_get_pdev(struct net_device *dev)
+{
+   return dev->class_dev.dev;
+}
+
+
+/* old style macro for compatiablity */
+#define SET_NETDEV_DEV(net, pdev)  netdev_set_pdev(net, pdev)
 
 struct packet_type {
__be16  type;   /* This is really htons(ether_type). */
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Runtime power management for network interfaces

2006-07-25 Thread David Brownell
On Tuesday 25 July 2006 8:59 am, Alan Stern wrote:
> During a Power Management session at the Ottawa Linux Symposium, it was
> generally agreed that network interface drivers ought to automatically
> suspend their devices (if possible) whenever:
> 
> (1) The interface is ifconfig'ed down, or
> 
> (2) No link is available.
> 
> Presumably (1) should be easy enough to implement.

I can't imagine that there are many network drivers that couldn't
put the hardware in some kind of low power mode at that point.

All PCI drivers can certainly set_pci_power_state(pdev, PCI_D3hot)
and save a certain amount of power.  Of course, that means coping
with a possible device reset when going to PCI_D0, and I can imagine
some devices might prefer to use PCI_D2 rather than reload firmware.

Embedded chips probably have an easier time with this; just turn
off all the clocks going to that controller and its PHY.  


> (2) might or might not 
> be feasible, depending on how much WOL support is available.

A network adapter without PHY support in the WOL hardware wouldn't be
able to implement (2).  And in today's systems, where we can't rely on
ACPI+PCI to morph PME# to pci_driver.resume() during runtime suspend,
not all platforms can implement (2).  Embedded chips probably have an
easier time here ... I know for a fact that at91_ether can do that sort
of thing easily, at least on boards where the PHY IRQ is hooked up, so
that the driver doesn't need to poll link status.

The USB analogy here is that _some_ external transceivers (PHY) can
report link state changes via IRQs, supporting a mode where everything
else is powered off until the link can be active, at which point the
controller can come out of its low power state.  Those are of course
transceivers intended for use in systems with itty bitty batteries.


(I keep thinking wakeup event handling is pretty weak in Linux today.
As you know, we're still shaking the USB HCDs into shape there, since
some system configs have issues.  Network adapters seem to be another
major use case for wakeup events in Linux, and I get the impression
they're not as widely used -- or functional -- as would be good.)


> (It might 
> not be feasible at all for wireless networking.)  Still, there can be no
> question that it would be a Good Thing for laptops to power-down their
> ethernet controllers when the network cable is unplugged.
> 
> Has any progress been made in this direction?  If not, a natural approach 
> would be to start with a reference implementation in one driver which 
> could then be copied to other drivers.
> 
> Any suggestions?

My initial thought was that network drivers could be refactored so
that they have ifsuspend() and ifresume() methods to put the hardware
into the low power state.

The remove(), open() and resume() methods would call ifresume(); probe(),
close() and suspend() would call ifsuspend() ... thatid give (1).  And
for hardware supporting (2) there'd be some housekeeping for the WOL support
during suspend() and resume(), in addition to netif_device_{de,at}tach().

For hardware where a PHY can report link state without requiring an active
Ethernet controller (e.g. at91_ether for sure) some events could trigger
ifresume()/ifsuspend() when the interface is active.

- Dave

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ip multicast route bug fix

2006-07-25 Thread Alexey Kuznetsov
Hello!

> checking tools because the skb lifetime depends on the return value.
> Wouldn't it be better to have a consistent interface (skb always freed),
> and clone the skb if needed for deferred processing?

But skb is not always freed in any case.
Normally it is submitted to netlink_unicast(). It is freed only in error case.
So, following this logic should not you clone skb to feed to netlink_unicast()?

Alexey

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Runtime power management for network interfaces

2006-07-25 Thread Auke Kok

Alan Stern wrote:

During a Power Management session at the Ottawa Linux Symposium, it was
generally agreed that network interface drivers ought to automatically
suspend their devices (if possible) whenever:

(1) The interface is ifconfig'ed down, or

(2) No link is available.

Presumably (1) should be easy enough to implement.  (2) might or might not
be feasible, depending on how much WOL support is available.  (It might
not be feasible at all for wireless networking.)  Still, there can be no
question that it would be a Good Thing for laptops to power-down their
ethernet controllers when the network cable is unplugged.

Has any progress been made in this direction?  If not, a natural approach 
would be to start with a reference implementation in one driver which 
could then be copied to other drivers.



Intel's newer e1000's (ich7 onboard e1000 and newer versions for instance) 
already support this feature partially - the MAC stays on but the PHY can be 
powered off when no link is present.


In order to enable this feature you will need to turn it on explicitly at load 
time:


modprobe e1000 SmartPowerDownEnable=1

Allthough not the entire NIC is powered off, it still saves a significant part 
of the power consumed by the NIC. All bits help. The power automatically 
restores once a cable is plugged in.



Cheers,

Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ip multicast route bug fix

2006-07-25 Thread Stephen Hemminger
On Tue, 25 Jul 2006 20:20:01 +0400
Alexey Kuznetsov <[EMAIL PROTECTED]> wrote:

> Hello!
> 
> > Code was reusing an skb which could lead to use after free or double free.
> 
> No, this does not help. The bug is not here.
> 
> I was so ashamed of this that could not touch the thing. :-)
> It startled me a lot, how is it possible that the thing was in production
> for several years and such bad bug never was noticed?
> 
> Now it is clear. skbs leaked sometimes before BK changeset 1.889.26.53,
> subj: [IPV4]: Fix skb leak in inet_rtm_getroute. But after this fix,
> which introduced new bug (goto out_free in the enclosed patch),
> the bug showed on surface.
> 
> Please, review this.
> 
> - ipmr_cache_unresolved() does not free skb. Caller does.
> - goto out_free in route.c in the case, when skb is enqueued
>   is returned to goto out.
> - some cosmetic cleanup in rt_fill_info() to make it more readable.

This looks correct, but it may lead to false reports from automated
checking tools because the skb lifetime depends on the return value.
Wouldn't it be better to have a consistent interface (skb always freed),
and clone the skb if needed for deferred processing?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-07-25 Thread Andi Kleen

> It may be a hardware interpretation, but doesn't it have non-trivial system 
> implications - where one runs threads/processes etc?

Only if you do process context RX processing. If you chose not to it doesn't 
have much influence.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-07-25 Thread Rick Jones

David Miller wrote:

From: Rick Jones <[EMAIL PROTECTED]>
Date: Mon, 24 Jul 2006 17:55:24 -0700


Even enough bits for 1024 or 2048 CPUs in the single system image?  I have seen 
1024 touted by SGI, and with things going so multi-core, perhaps 16384 while 
sounding initially bizzare would be in the realm of theoretically possible 
before to long.



Read the RSS NDIS documents from Microsoft. 


I'll see about hunting them down.


You aren't going to want
to demux to more than, say, 256 cpus for single network adapter even
on the largest machines.


I suppose, it just seems to tweak _small_ alarms in my intuition - maybe because 
it still sounds like networking telling the scheduler where to run threads of 
execution, and even though I'm a networking guy I seem to have the notion that 
it should be the other way 'round.



That would cover TCP, are there similarly fungible fields in SCTP or
other ULPs?  And if we were to want to get HW support for the thing,
getting it adopted in a de jure standards body would probably be in
order :)



Microsoft never does this, neither do we.  LRO came out of our own
design, the network folks found it reasonable and thus they have
started to implement it.  The same is true for Microsofts RSS stuff.

It's a hardware interpretation, therefore it belongs in a driver API
specification, nowhere else.


It may be a hardware interpretation, but doesn't it have non-trivial system 
implications - where one runs threads/processes etc?


rick jones
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ip multicast route bug fix (rev2)

2006-07-25 Thread Stephen Hemminger
IP multicast route code was reusing an skb which causes
use after free and double free.

Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>
---
 net/ipv4/ipmr.c |   22 --
 1 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index ba33f86..f5d3d96 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1580,6 +1580,7 @@ int ipmr_get_route(struct sk_buff *skb, 
cache = ipmr_cache_find(rt->rt_src, rt->rt_dst);
 
if (cache==NULL) {
+   struct sk_buff *skb2;
struct net_device *dev;
int vif;
 
@@ -1593,12 +1594,21 @@ int ipmr_get_route(struct sk_buff *skb, 
read_unlock(&mrt_lock);
return -ENODEV;
}
-   skb->nh.raw = skb_push(skb, sizeof(struct iphdr));
-   skb->nh.iph->ihl = sizeof(struct iphdr)>>2;
-   skb->nh.iph->saddr = rt->rt_src;
-   skb->nh.iph->daddr = rt->rt_dst;
-   skb->nh.iph->version = 0;
-   err = ipmr_cache_unresolved(vif, skb);
+   
+   skb2 = alloc_skb(sizeof(struct iphdr) + sizeof(struct nlmsghdr),
+GFP_ATOMIC);
+   if (!skb2) {
+   read_unlock(&mrt_lock);
+   return -ENOMEM;
+   }
+
+   memset(skb2->data, 0, sizeof(struct iphdr));
+   skb2->nh.raw = skb2->data;
+   skb2->nh.iph->ihl = sizeof(struct iphdr)>>2;
+   skb2->nh.iph->saddr = rt->rt_src;
+   skb2->nh.iph->daddr = rt->rt_dst;
+
+   err = ipmr_cache_unresolved(vif, skb2);
read_unlock(&mrt_lock);
return err;
}
-- 
1.4.0

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ip multicast route bug fix

2006-07-25 Thread Alexey Kuznetsov
Hello!

> Code was reusing an skb which could lead to use after free or double free.

No, this does not help. The bug is not here.

I was so ashamed of this that could not touch the thing. :-)
It startled me a lot, how is it possible that the thing was in production
for several years and such bad bug never was noticed?

Now it is clear. skbs leaked sometimes before BK changeset 1.889.26.53,
subj: [IPV4]: Fix skb leak in inet_rtm_getroute. But after this fix,
which introduced new bug (goto out_free in the enclosed patch),
the bug showed on surface.

Please, review this.

- ipmr_cache_unresolved() does not free skb. Caller does.
- goto out_free in route.c in the case, when skb is enqueued
  is returned to goto out.
- some cosmetic cleanup in rt_fill_info() to make it more readable.

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 9ccacf5..1788c04 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -640,8 +640,6 @@ ipmr_cache_unresolved(vifi_t vifi, struc
if (atomic_read(&cache_resolve_queue_len)>=10 ||
(c=ipmr_cache_alloc_unres())==NULL) {
spin_unlock_bh(&mfc_unres_lock);
-
-   kfree_skb(skb);
return -ENOBUFS;
}
 
@@ -662,7 +660,6 @@ ipmr_cache_unresolved(vifi_t vifi, struc
spin_unlock_bh(&mfc_unres_lock);
 
kmem_cache_free(mrt_cachep, c);
-   kfree_skb(skb);
return err;
}
 
@@ -677,7 +674,6 @@ ipmr_cache_unresolved(vifi_t vifi, struc
 *  See if we can append the packet
 */
if (c->mfc_un.unres.unresolved.qlen>3) {
-   kfree_skb(skb);
err = -ENOBUFS;
} else {
skb_queue_tail(&c->mfc_un.unres.unresolved,skb);
@@ -1375,6 +1371,7 @@ int ip_mr_input(struct sk_buff *skb)
 */
if (cache==NULL) {
int vif;
+   int err;
 
if (local) {
struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);
@@ -1387,15 +1384,13 @@ int ip_mr_input(struct sk_buff *skb)
}
 
vif = ipmr_find_vif(skb->dev);
-   if (vif >= 0) {
-   int err = ipmr_cache_unresolved(vif, skb);
-   read_unlock(&mrt_lock);
-
-   return err;
-   }
+   err = -ENODEV;
+   if (vif >= 0)
+   err = ipmr_cache_unresolved(vif, skb);
read_unlock(&mrt_lock);
-   kfree_skb(skb);
-   return -ENODEV;
+   if (err)
+   kfree_skb(skb);
+   return err;
}
 
ip_mr_forward(skb, cache, local);
@@ -1568,6 +1563,17 @@ rtattr_failure:
return -EMSGSIZE;
 }
 
+/**
+ * ipmr_get_route - return mrouting information or queue for resolution
+ * @skb - netlink skb with partially complete rtnelink information
+ * @rtm - pointer to head of rtmsg
+ * @nowait - if mroute is not in cache, do not resolve, fail with EAGAIN
+ *
+ * Return:
+ *   1 - if mrouting information is succesfully recorded in skb
+ *   0 - if information is not available, but skb is queued for resolution
+ *   < 0 - error
+ */
 int ipmr_get_route(struct sk_buff *skb, struct rtmsg *rtm, int nowait)
 {
int err;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 2dc6dbb..62f53e9 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2703,16 +2703,12 @@ static int rt_fill_info(struct sk_buff *
if (MULTICAST(dst) && !LOCAL_MCAST(dst) &&
ipv4_devconf.mc_forwarding) {
int err = ipmr_get_route(skb, r, nowait);
-   if (err <= 0) {
-   if (!nowait) {
-   if (err == 0)
-   return 0;
+   if (err == 0)
+   return 0;
+   if (err < 0) {
+   if (err == -EMSGSIZE)
goto nlmsg_failure;
-   } else {
-   if (err == -EMSGSIZE)
-   goto nlmsg_failure;
-   ((struct 
rta_cacheinfo*)RTA_DATA(eptr))->rta_error = err;
-   }
+   ((struct 
rta_cacheinfo*)RTA_DATA(eptr))->rta_error = err;
}
} else
 #endif
@@ -2794,7 +2790,7 @@ int inet_rtm_getroute(struct sk_buff *in
err = rt_fill_info(skb, NETLINK_CB(in_skb).pid, nlh->nlmsg_seq,
RTM_NEWROUTE, 0, 0);
if (!err)
-   goto out_free;
+   goto out;
if (err < 0) {
err = -EMSGSIZE;
goto out_fre

Runtime power management for network interfaces

2006-07-25 Thread Alan Stern
During a Power Management session at the Ottawa Linux Symposium, it was
generally agreed that network interface drivers ought to automatically
suspend their devices (if possible) whenever:

(1) The interface is ifconfig'ed down, or

(2) No link is available.

Presumably (1) should be easy enough to implement.  (2) might or might not
be feasible, depending on how much WOL support is available.  (It might
not be feasible at all for wireless networking.)  Still, there can be no
question that it would be a Good Thing for laptops to power-down their
ethernet controllers when the network cable is unplugged.

Has any progress been made in this direction?  If not, a natural approach 
would be to start with a reference implementation in one driver which 
could then be copied to other drivers.

Any suggestions?

Alan Stern

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUGs in skb_checksum_help() and skb_gso_segment() in 2.6.18-rc2

2006-07-25 Thread Evgeniy Polyakov
On Tue, Jul 25, 2006 at 05:15:04PM +0200, Tomasz Torcz ([EMAIL PROTECTED]) 
wrote:
> BUG: warning at net/core/dev.c:1171/skb_checksum_help()
>  [] show_trace_log_lvl+0x51/0xe6
>  [] show_trace+0xa/0xc
>  [] dump_stack+0x13/0x15
>  [] skb_checksum_help+0x4d/0xeb
>  [] ip_nat_fn+0x47/0x19a

It is not a bug, but remind to update nat helper function.

> -- 
> Tomasz Torcz"Funeral in the morning, IDE hacking
> [EMAIL PROTECTED]in the afternoon and evening." - Alan Cox

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BUGs in skb_checksum_help() and skb_gso_segment() in 2.6.18-rc2

2006-07-25 Thread Tomasz Torcz

  Hi,

I just noticed in my logs following messages:

BUG: warning at net/core/dev.c:1171/skb_checksum_help()
 [] show_trace_log_lvl+0x51/0xe6
 [] show_trace+0xa/0xc
 [] dump_stack+0x13/0x15
 [] skb_checksum_help+0x4d/0xeb
 [] ip_nat_fn+0x47/0x19a
 [] ip_nat_local_fn+0x3c/0xba
 [] nf_iterate+0x40/0x60
 [] nf_hook_slow+0x42/0xa2
 [] ip_queue_xmit+0x396/0x3e2
 [] tcp_transmit_skb+0x387/0x3a4
 [] tcp_push_one+0xaf/0xd3
 [] tcp_sendmsg+0x7ce/0x9b3
 [] inet_sendmsg+0x35/0x3f
 [] sock_sendmsg+0xbf/0xd8
 [] sys_sendto+0xe5/0x106
 [] sys_send+0x19/0x1d
 [] sys_socketcall+0xf2/0x19b
 [] syscall_call+0x7/0xb
 [<47bb967e>] 0x47bb967e
BUG: warning at net/core/dev.c:1225/skb_gso_segment()
 [] show_trace_log_lvl+0x51/0xe6
 [] show_trace+0xa/0xc
 [] dump_stack+0x13/0x15
 [] skb_gso_segment+0x88/0x174
 [] dev_gso_segment+0x5c/0x82
 [] dev_hard_start_xmit+0x49/0xaf
 [] __qdisc_run+0x91/0x117
 [] dev_queue_xmit+0x121/0x1ce
 [] ip_output+0x1a8/0x1de

 System is fairly standard x86 with e1000 card. 2.6.18-rc2.
# ethtool -k ep0
Offload parameters for ep0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on


-- 
Tomasz Torcz"Funeral in the morning, IDE hacking
[EMAIL PROTECTED]in the afternoon and evening." - Alan Cox



pgpA2Yg9c5ukp.pgp
Description: PGP signature


Re: [PATCH Round 4 2/3] Core network changes to support network event notification.

2006-07-25 Thread Steve Wise
On Tue, 2006-07-25 at 17:39 +1000, Herbert Xu wrote:
> Steve Wise <[EMAIL PROTECTED]> wrote:
> > 
> > Routing redirect events are broadcast as a pair of rtmsgs, RTM_DELROUTE
> > and RTM_NEWROUTE.
> 
> This may confuse existing rtnetlink users since you're generating an
> RTM_DELROUTE message that's identical to one triggered by something
> like 'ip route del'.
> 

Yea, I didn't really want to create a REDIRECT rtmsg, so I punted. :-)  

But they really are seeing a delete followed by an add.  That's what the
kernel is doing.

> As you're introducing a completely new RTM_ROUTEUPD type, it might
> be better to attach any information from the existing route that you
> need to the ROUTEUPD message.

Yea, the main change is the next hop ip address or gateway field.  

> 
> Actually, what was the reason you need the existing route here?
> 

The rdma driver needs to update all established rdma connections that
are using the next-hop information of the existing route and make them
use the next-hop information of the new route.  In addition, the rdma
driver might have a reference to the old dst entry.  So it can release
that ref and add a ref to the new dst entry.

> > diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> > index 5f87533..33d8a83 100644
> > --- a/net/ipv4/fib_semantics.c
> > +++ b/net/ipv4/fib_semantics.c
> > @@ -44,6 +44,7 @@ #include 
> > #include 
> > #include 
> > #include 
> > +#include 
> > 
> > #include "fib_lookup.h"
> > 
> > @@ -279,6 +280,14 @@ void rtmsg_fib(int event, u32 key, struc
> >struct sk_buff *skb;
> >u32 pid = req ? req->pid : n->nlmsg_pid;
> >int size = NLMSG_SPACE(sizeof(struct rtmsg)+256);
> > +   struct netevent_route_info nri;
> > +   int netevent;
> > +
> > +   nri.family = AF_INET;
> > +   nri.data = &fa->fa_info;
> > +   netevent = event == RTM_NEWROUTE ? NETEVENT_ROUTE_ADD 
> > +: NETEVENT_ROUTE_DEL;
> > +   call_netevent_notifiers(netevent, &nri);
> 
> Hmm, this is broken.  These route events are meaningless without the
> corresponding IP rule events.  Are you sure you really want to make
> your hardware/driver grok multiple routing tables?
> 
> Perhaps you should simply stick to dst entries and flush all your
> tables when the routes are changed.  This is what the Linux IP stack
> does.
> 

I have to admit I'm a little fuzzy on the routing stuff.  The main
netevents I've utilized in the the rdma driver I'm writing is the
neighbour update event and the redirect event.  Route add/del was added
for completeness of "routing" netevents.   

Can you expand further or point me to code where the IP stack "flushes
its tables" when routes are changed?

>From my experience, all the rdma driver needs is the dst entry.   It
using the routing table to determine the dst_entry at connection
establish time.  And it needs to know if the next-hop or PMTU ever
changes.



> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index 2dc6dbb..18879e6 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -1117,6 +1120,52 @@ static void rt_del(unsigned hash, struct
> >spin_unlock_bh(rt_hash_lock_addr(hash));
> > }
> > 
> > +static void rtm_redirect(struct rtable *old, struct rtable *new)
> > +{
> > +   struct netevent_redirect netevent;
> > +   struct sk_buff *skb;
> > +   int err;
> > +
> > +   netevent.old = &old->u.dst;
> > +   netevent.new = &new->u.dst;
> > +
> > +   /* notify netevent subscribers */
> > +   call_netevent_notifiers(NETEVENT_REDIRECT, &netevent);
> > +
> > +   /* Post NETLINK messages:  RTM_DELROUTE for old route, 
> > +  RTM_NEWROUTE for new route */
> > +   skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC);
> 
> Please use a better size estimate rather than NLMSG_GOODSIZE here since
> you're doing GFP_ATOMIC.
> 

ok

> > @@ -1442,6 +1493,32 @@ unsigned short ip_rt_frag_needed(struct 
> >return est_mtu ? : new_mtu;
> > }
> > 
> > +static void rtm_pmtu_update(struct rtable *rt)
> > +{
> > +   struct sk_buff *skb;
> > +   int err;
> > +
> > +   call_netevent_notifiers(NETEVENT_PMTU_UPDATE, &rt->u.dst);
> > +
> > +   skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC);
> 
> Ditto.
> 

ok


Thanks,

Steve.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][IPv4/IPv6] Setting 0 for unused port field. (fwd)

2006-07-25 Thread YOSHIFUJI Hideaki / 吉藤英明
In article <[EMAIL PROTECTED]> (at Tue, 25 Jul 2006 10:45:51 -0400 (EDT)), 
James Morris <[EMAIL PROTECTED]> says:

> The recvmsg() for raw socket seems to return random u16 value
> from the kernel stack memory since port field is not initialized.
> But I'm not sure this patch is correct.
> Does raw socket return any information stored in port field?
> 
> -- Start of patch --
> diff -ur before/net/ipv4/raw.c after/net/ipv4/raw.c
> --- before/net/ipv4/raw.c   2006-06-18 10:49:35.0 +0900
> +++ after/net/ipv4/raw.c2006-07-25 16:15:26.0 +0900
> @@ -609,6 +609,7 @@
> if (sin) {
> sin->sin_family = AF_INET;
> sin->sin_addr.s_addr = skb->nh.iph->saddr;
> +   sin->sin_port = 0;
> memset(&sin->sin_zero, 0, sizeof(sin->sin_zero));
> }
> if (inet->cmsg_flags)

Well, instead, should it be initalized to protocol number, shouldn't it?

--yoshfuji
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][IPv4/IPv6] Setting 0 for unused port field. (fwd)

2006-07-25 Thread James Morris

-- Forwarded message --
Date: Tue, 25 Jul 2006 16:38:05 +0900
From: [EMAIL PROTECTED]
To: linux-kernel@vger.kernel.org
Subject: [PATCH][IPv4/IPv6] Setting 0 for unused port field.

Hello.

The recvmsg() for raw socket seems to return random u16 value
from the kernel stack memory since port field is not initialized.
But I'm not sure this patch is correct.
Does raw socket return any information stored in port field?

-- Start of patch --
diff -ur before/net/ipv4/raw.c after/net/ipv4/raw.c
--- before/net/ipv4/raw.c   2006-06-18 10:49:35.0 +0900
+++ after/net/ipv4/raw.c2006-07-25 16:15:26.0 +0900
@@ -609,6 +609,7 @@
if (sin) {
sin->sin_family = AF_INET;
sin->sin_addr.s_addr = skb->nh.iph->saddr;
+   sin->sin_port = 0;
memset(&sin->sin_zero, 0, sizeof(sin->sin_zero));
}
if (inet->cmsg_flags)
diff -ur before/net/ipv6/raw.c after/net/ipv6/raw.c
--- before/net/ipv6/raw.c   2006-06-18 10:49:35.0 +0900
+++ after/net/ipv6/raw.c2006-07-25 16:16:52.0 +0900
@@ -411,6 +411,7 @@
/* Copy the address. */
if (sin6) {
sin6->sin6_family = AF_INET6;
+   sin6->sin6_port = 0;
ipv6_addr_copy(&sin6->sin6_addr, &skb->nh.ipv6h->saddr);
sin6->sin6_flowinfo = 0;
sin6->sin6_scope_id = 0;
-- End of patch --

Regards.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MTU probing bug?

2006-07-25 Thread John Heffner

David Miller wrote:

John, have a look at this code in tcp_write_timeout():

mss = min(sysctl_tcp_base_mss,
  tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low)/2);
mss = max(mss, 68 - tp->tcp_header_len);

That first line looks like it should be a max() instead
of a min().

tcp_base_mss is the smallest MSS we should use, therefore
we should make sure the "mss" is at least that large.

It's also possible that I misread the intention of this code :) From
what I read, it is trying to half the MSS in use and adjust the MTU
search low point to be based upon this new value.


No, the min() is what's intended here.  The base_mss is where you want 
to start searching from.  So, on black hole detection, you drop 
immediatly down to the base.  The base is configurable, because making 
it higher can make searching faster.  If it's still too high for some 
links, you halve it again on successive timeouts.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Networking lock problem. general protection fault: 0000 [#1]

2006-07-25 Thread Ian Stirling
This happened several seconds after plugging in my wireless network card 
- zd1211 - with the driver from http://sourceforge.net/projects/zd1211/


However.
It was after the module was loaded, when my setup script was being run 
for the device - the relevant portion:


ifconfig $device:1 $alternate
route add $router $device:1
route add default gw router

I'm not sure which route command triggered it.
Hope this is the right place.

general protection fault:  [#1]
PREEMPT
Modules linked in: zd1211 toshiba_acpi aes_i586 rd uhci_hcd 
snd_usb_audio snd_usb_lib snd_hwdep usbhid ehci_hcd ohci_hcd pcmcia 
firmware_class yenta_socket rsrc_nonstatic pcmcia_core eepro100 twofish 
tea sha512 sha256 sha1 serpent michael_mic md5 md4 khazad des deflate 
zlib_deflate zlib_inflate crypto_null cast6 cast5 blowfish arc4 
cryptoloop loop sd_mod usb_storage libusual usbcore snd_es1968 
snd_ac97_codec snd_ac97_bus snd_mpu401_uart snd_rawmidi

CPU:0
EIP:0060:[]Not tainted VLI
EFLAGS: 00010286   (2.6.17laptop-3110-firstcut #7)
EIP is at fib_find_info+0x4d/0xed
eax: 0015   ebx:    ecx: c6c75aa0   edx: 
esi: c72421bc   edi: c6c75a9c   ebp: c7242160   esp: c21a9dac
ds: 007b   es: 007b   ss: 0068
Process route (pid: 1856, threadinfo=c21a8000 task=c5a02540)
Stack: 0001 0001    0003  
c21a9ed8
   c72421e0 c7242160 c02f1182 c7242160   c7242160 

    c21a9ed8 bffada60 c7120ca0 c02f219d c21a9e68 c21a9ed8 
c21a9e58

Call Trace:
  fib_create_info+0x232/0x3ec   
fn_hash_insert+0x94/0x3f7

  ip_rt_ioctl+0xb5/0x113   sock_attach_fd+0x72/0xd2
  inet_ioctl+0x34/0x63   sock_ioctl+0x96/0x1b6
  do_ioctl+0x55/0x68   vfs_ioctl+0x59/0x191
  sys_ioctl+0x2b/0x46   syscall_call+0x7/0xb
Code: 55 24 31 d0 89 54 24 10 8b 55 28 31 d0 89 54 24 0c 89 c2 c1 ea 07 
31 c2 c1 e8 0c 31 c2 a1 c0 1e 42 c0 21 ca 8b 1c 90 85 db 74 20 <8b> 03 
89 44 24 08 8d 74 26 00 8b 53 5c 8b 04 24 89 54 24 04 39

EIP: [] fib_find_info+0x4d/0xed SS:ESP 0068:c21a9dac
 BUG: route/1856, lock held at task exit time!
 [c03a04a0] {rtnl_mutex}
.. held by: route: 1856 [c5a02540, 112]
... acquired at:   ip_rt_ioctl+0x5d/0x113
d[2774]: Caught signal 15, un-registering and exiting.

Config is at http://www.mauve.plus.com/config.gz
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2006-07-25 Thread Kunio Murasawa

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to handle kernel paging request, another 2.6.16.25 server reboots

2006-07-25 Thread Chuck Ebbert
In-Reply-To: <[EMAIL PROTECTED]>

On Mon, 24 Jul 2006 12:01:48 +0400, Jim Klimov wrote:

>  I recently wrote about problems with a fileserver rebooting
>  frequently. Another similar server got under NFS load today
>  and rebooted at least twice in the past few hours.
>
>  This server has a similar motherboard (Supermicro X5DP8-G2),
>  dual [EMAIL PROTECTED], two older 3Ware controllers (7506+8506) and
>  a reiserfs v3 archive.
>
>  The server reported last week has two 3Ware 9550 controllers,
>  ext3 archives and primarily a Samba usage.

I decoded your oops.  It's in netfilter:

Unable to handle kernel paging request at virtual address f9445d43
printing eip:
 c0392bba
*pde = 32e59067
Oops:  [#1]
 SMP 
Modules linked in: w83781d hwmon_vid hwmon i2c_isa i2c_core w83627hf_wdt
CPU:0
EIP:0060:[]Not tainted VLI
EFLAGS: 00010282   (2.6.16.25 #3) 
EIP is at ipt_do_table+0xae/0x385
eax: 0003   ebx:    ecx: cbf4b8d8   edx: f944a2c8
esi: e2262940   edi: f9445cf0   ebp: 8000   esp: f700fac8
ds: 007b   es: 007b   ss: 0068
Process nfsd (pid: 10685, threadinfo=f700e000 task=f36a7ab0)
Stack: f9446b10 0282 c33e2180 f36cda80  c047deec f944a2c8 f9418000 
   f7788800 c0530cd4  cbf4b8d8  0003 f700fba0  
   f700fba0 0003 c052f0d8 8000 c03947e7 f7788800 c047dec0  
Call Trace:
 [] ipt_local_out_hook+0x72/0x77
 [] nf_iterate+0x69/0x83
 [] dst_output+0x0/0x7
 [] dst_output+0x0/0x7
 [] nf_hook_slow+0x5d/0xea
 [] dst_output+0x0/0x7
 [] ip_queue_xmit+0x3d4/0x4f5
 [] dst_output+0x0/0x7
 [] __rcu_process_callbacks+0x7d/0xc5
 [] activate_task+0x99/0xa5
 [] try_to_wake_up+0x29c/0x33b
 [] tcp_v4_send_check+0x4a/0xdc
 [] tcp_transmit_skb+0x2e6/0x45a
 [] tcp_push_one+0x97/0x104
 [] tcp_sendmsg+0x36b/0xb4d
 [] nf_iterate+0x69/0x83
 [] tcp_v4_rcv+0x4e6/0x81f
 [] inet_sendmsg+0x47/0x5f
 [] sock_sendmsg+0xc9/0xe3
 [] ip_rcv+0x2bc/0x56f
 [] netif_receive_skb+0x227/0x2d7
 [] __kfree_skb+0x3a/0xc3
 [] autoremove_wake_function+0x0/0x43
 [] update_wall_time_one_tick+0x6/0x7e
 [] update_wall_time+0x8/0x35
 [] timer_interrupt+0x5b/0x86
 [] handle_IRQ_event+0x26/0x59
 [] kernel_sendmsg+0x2e/0x3c
 [] sock_no_sendpage+0x80/0x9f
 [] tcp_sendpage+0x49/0x85
 [] svc_sendto+0x134/0x250
 [] net_rx_action+0x88/0x15f
 [] do_IRQ+0x1e/0x24
 [] common_interrupt+0x1a/0x20
 [] svc_tcp_sendto+0x4d/0x99
 [] _atomic_dec_and_lock+0x33/0x4c
 [] svc_send+0xaa/0xed
 [] fh_put+0x133/0x17d
 [] svcauth_unix_release+0x43/0x45
 [] nfs3svc_release_fhandle+0x0/0xe
 [] svc_process+0x1b1/0x619
 [] default_wake_function+0x0/0xc
 [] nfsd+0x178/0x301
 [] nfsd+0x0/0x301
 [] kernel_thread_helper+0x5/0xb

   6:   8b 40 10  mov0x10(%eax),%eax
   9:   8b 44 86 34   mov0x34(%esi,%eax,4),%eax
   d:   89 44 24 1c   mov%eax,0x1c(%esp)
  11:   89 c7 mov%eax,%edi
  13:   8b 44 24 34   mov0x34(%esp),%eax
  17:   8b 54 24 1c   mov0x1c(%esp),%edx
  1b:   03 7c 86 0c   add0xc(%esi,%eax,4),%edi
  1f:   03 54 86 20   add0x20(%esi,%eax,4),%edx
  23:   89 5c 24 10   mov%ebx,0x10(%esp)
  27:   89 54 24 18   mov%edx,0x18(%esp)
   0:   0f b6 5f 53   movzbl 0x53(%edi),%ebx   <=
   4:   89 d8 mov%ebx,%eax
   6:   24 08 and$0x8,%al
   8:   84 c0 test   %al,%al
   a:   0f 84 b4 02 00 00 je 2c4 <_EIP+0x2c4>
  10:   8b 47 08  mov0x8(%edi),%eax

This is in net/ipv4/netfiler/ip_tables.c::ipt_do_table():

table_base = (void *)private->entries[smp_processor_id()];
e = get_entry(table_base, private->hook_entry[hook]);

/* For return from builtin chain */
back = get_entry(table_base, private->underflow[hook]);

do {
IP_NF_ASSERT(e);
IP_NF_ASSERT(back);
===>if (ip_packet_match(ip, indev, outdev, &e->ip, offset)) {

'e' is an invalid pointer. (ip_packet_match() was inlined.)
hook == 3


The call trace seems to show that svc_tcp_sendto() was interrupted by an
IRQ for an incoming packet, or maybe the timer interrupt?

Can you build with CONFIG_FRAME_POINTERS and see if you can get a cleaner
trace?

-- 
Chuck
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPROUTE]: Add support for multipath route realms

2006-07-25 Thread Patrick McHardy
Herbert Xu wrote:
> Patrick McHardy <[EMAIL PROTECTED]> wrote:
> 
>>[IPROUTE]: Add support for multipath route realms
>>
>>Routing realms exist per nexthop, but iproute currently only allows to send
>>a single route realm, which is refused by the kernel for multipath routes.
>>Add support for specifying per nexthop realms. Old kernels only return the
>>first realm back to userspace when dumping, so the others can't be displayed,
>>besides that it will also behave correctly on old kernels.
>>
>>old kernel:
>>
>>1.2.3.4 realm 1
>>   nexthop dev dummy0 weight 1
>>   nexthop dev dummy1 weight 1
>>   nexthop dev dummy2 weight 1
>>   nexthop dev dummy3 weight 1
>>
>>new kernel:
>>
>>1.2.3.4
>>   nexthop realm 1 dev dummy0 weight 1
>>   nexthop realm 2 dev dummy1 weight 1
>>   nexthop realm 3 dev dummy2 weight 1
>>   nexthop realm 4 dev dummy3 weight 1
> 
> 
> This really looks like papering over fundamental brokenness of
> IP_ROUTE_MULTIPATH_CACHED since you wouldn't otherwise get these
> entries in the routing cache.
> 
> This reminds me that I better revisit the reasons that people gave
> for actually using IP_ROUTE_MULTIPATH_CACHED the last time we tried
> to get rid of it.


Actually it has nothing to do with MULTIPATH_CACHED, it didn't even
use it in these tests unless I seriously mixed something up. The
realm can be used by netfilter to find out which nexthop was chosen
and bind the connection to this nexthop manually using CONNMARK
and MARK.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPROUTE]: Add support for multipath route realms

2006-07-25 Thread Herbert Xu
On Tue, Jul 25, 2006 at 06:19:33PM +1000, Herbert Xu wrote:
>
> > new kernel:
> > 
> > 1.2.3.4
> >nexthop realm 1 dev dummy0 weight 1
> >nexthop realm 2 dev dummy1 weight 1
> >nexthop realm 3 dev dummy2 weight 1
> >nexthop realm 4 dev dummy3 weight 1
> 
> This really looks like papering over fundamental brokenness of
> IP_ROUTE_MULTIPATH_CACHED since you wouldn't otherwise get these
> entries in the routing cache.

Nevermind, I misread your changelog.  Your patch is obviously
not related to IP_ROUTE_MULTIPATH_CACHED :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPROUTE]: Add support for multipath route realms

2006-07-25 Thread Herbert Xu
Patrick McHardy <[EMAIL PROTECTED]> wrote:
> 
> [IPROUTE]: Add support for multipath route realms
> 
> Routing realms exist per nexthop, but iproute currently only allows to send
> a single route realm, which is refused by the kernel for multipath routes.
> Add support for specifying per nexthop realms. Old kernels only return the
> first realm back to userspace when dumping, so the others can't be displayed,
> besides that it will also behave correctly on old kernels.
> 
> old kernel:
> 
> 1.2.3.4 realm 1
>nexthop dev dummy0 weight 1
>nexthop dev dummy1 weight 1
>nexthop dev dummy2 weight 1
>nexthop dev dummy3 weight 1
> 
> new kernel:
> 
> 1.2.3.4
>nexthop realm 1 dev dummy0 weight 1
>nexthop realm 2 dev dummy1 weight 1
>nexthop realm 3 dev dummy2 weight 1
>nexthop realm 4 dev dummy3 weight 1

This really looks like papering over fundamental brokenness of
IP_ROUTE_MULTIPATH_CACHED since you wouldn't otherwise get these
entries in the routing cache.

This reminds me that I better revisit the reasons that people gave
for actually using IP_ROUTE_MULTIPATH_CACHED the last time we tried
to get rid of it.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ip multicast route bug fix

2006-07-25 Thread Herbert Xu
Stephen Hemminger <[EMAIL PROTECTED]> wrote:
>
> @@ -1593,12 +1594,19 @@ int ipmr_get_route(struct sk_buff *skb, 
>read_unlock(&mrt_lock);
>return -ENODEV;
>}
> -   skb->nh.raw = skb_push(skb, sizeof(struct iphdr));
> -   skb->nh.iph->ihl = sizeof(struct iphdr)>>2;
> -   skb->nh.iph->saddr = rt->rt_src;
> -   skb->nh.iph->daddr = rt->rt_dst;
> -   skb->nh.iph->version = 0;
> -   err = ipmr_cache_unresolved(vif, skb);
> +   
> +   iskb = alloc_skb(sizeof(struct iphdr), GFP_KERNEL);
> +   if (!iskb) {
> +   read_unlock(&mrt_lock);
> +   return -ENOMEM;
> +   }
> +   memset(iskb->data, 0, sizeof(struct iphdr));
> +   iskb->nh.raw = iskb->data;
> +   iskb->nh.iph->ihl = sizeof(struct iphdr)>>2;
> +   iskb->nh.iph->saddr = rt->rt_src;
> +   iskb->nh.iph->daddr = rt->rt_dst;
> +
> +   err = ipmr_cache_unresolved(vif, iskb);

I'm afraid this is still broken in a different way.

If ipmr_cache_unresolved queues the skb onto the unresolved list things
it's going to try to use the skb as a netlink skb instead :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-07-25 Thread Evgeniy Polyakov
On Tue, Jul 25, 2006 at 12:33:44AM -0700, David Miller ([EMAIL PROTECTED]) 
wrote:
> From: Evgeniy Polyakov <[EMAIL PROTECTED]>
> Date: Tue, 25 Jul 2006 10:59:21 +0400
> 
> > As a side completely unrelated to either my or others work note :) - 
> > I think it is a nanooptimisation - we get a bit of performance here, 
> > and lose those bit in other place.
> > When bag is filled, there is no much sence of rearranging some stuff
> > inside to be able to place another one - it is better to buy new bag.
> 
> It is a matter of what the viewpoint is, I suppose.

Definitely.

> I think in this specific case it might turn out to be
> better for the scheduler to respond to what the device
> throws at it, rather than the other way around.  And
> in that case we need no feedback from scheduler to
> cpu demux engine.

That's exactly one bit lose/gain - if CPU is loafing - we get a gain,
and lose otherwise - so instead of generally predictible steady behaviour 
we can end up with bursty shapes.

Actually without real tests all it is just a handwaving, so let's see when
modern NICs get that capability, so network softirq scheduling would be 
changed accordingly.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH Round 4 2/3] Core network changes to support network event notification.

2006-07-25 Thread Herbert Xu
Steve Wise <[EMAIL PROTECTED]> wrote:
> 
> Routing redirect events are broadcast as a pair of rtmsgs, RTM_DELROUTE
> and RTM_NEWROUTE.

This may confuse existing rtnetlink users since you're generating an
RTM_DELROUTE message that's identical to one triggered by something
like 'ip route del'.

As you're introducing a completely new RTM_ROUTEUPD type, it might
be better to attach any information from the existing route that you
need to the ROUTEUPD message.

Actually, what was the reason you need the existing route here?

> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index 5f87533..33d8a83 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -44,6 +44,7 @@ #include 
> #include 
> #include 
> #include 
> +#include 
> 
> #include "fib_lookup.h"
> 
> @@ -279,6 +280,14 @@ void rtmsg_fib(int event, u32 key, struc
>struct sk_buff *skb;
>u32 pid = req ? req->pid : n->nlmsg_pid;
>int size = NLMSG_SPACE(sizeof(struct rtmsg)+256);
> +   struct netevent_route_info nri;
> +   int netevent;
> +
> +   nri.family = AF_INET;
> +   nri.data = &fa->fa_info;
> +   netevent = event == RTM_NEWROUTE ? NETEVENT_ROUTE_ADD 
> +: NETEVENT_ROUTE_DEL;
> +   call_netevent_notifiers(netevent, &nri);

Hmm, this is broken.  These route events are meaningless without the
corresponding IP rule events.  Are you sure you really want to make
your hardware/driver grok multiple routing tables?

Perhaps you should simply stick to dst entries and flush all your
tables when the routes are changed.  This is what the Linux IP stack
does.

> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 2dc6dbb..18879e6 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1117,6 +1120,52 @@ static void rt_del(unsigned hash, struct
>spin_unlock_bh(rt_hash_lock_addr(hash));
> }
> 
> +static void rtm_redirect(struct rtable *old, struct rtable *new)
> +{
> +   struct netevent_redirect netevent;
> +   struct sk_buff *skb;
> +   int err;
> +
> +   netevent.old = &old->u.dst;
> +   netevent.new = &new->u.dst;
> +
> +   /* notify netevent subscribers */
> +   call_netevent_notifiers(NETEVENT_REDIRECT, &netevent);
> +
> +   /* Post NETLINK messages:  RTM_DELROUTE for old route, 
> +  RTM_NEWROUTE for new route */
> +   skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC);

Please use a better size estimate rather than NLMSG_GOODSIZE here since
you're doing GFP_ATOMIC.

> @@ -1442,6 +1493,32 @@ unsigned short ip_rt_frag_needed(struct 
>return est_mtu ? : new_mtu;
> }
> 
> +static void rtm_pmtu_update(struct rtable *rt)
> +{
> +   struct sk_buff *skb;
> +   int err;
> +
> +   call_netevent_notifiers(NETEVENT_PMTU_UPDATE, &rt->u.dst);
> +
> +   skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC);

Ditto.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-07-25 Thread David Miller
From: Evgeniy Polyakov <[EMAIL PROTECTED]>
Date: Tue, 25 Jul 2006 10:59:21 +0400

> As a side completely unrelated to either my or others work note :) - 
> I think it is a nanooptimisation - we get a bit of performance here, 
> and lose those bit in other place.
> When bag is filled, there is no much sence of rearranging some stuff
> inside to be able to place another one - it is better to buy new bag.

It is a matter of what the viewpoint is, I suppose.

I think in this specific case it might turn out to be
better for the scheduler to respond to what the device
throws at it, rather than the other way around.  And
in that case we need no feedback from scheduler to
cpu demux engine.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


MTU probing bug?

2006-07-25 Thread David Miller

John, have a look at this code in tcp_write_timeout():

mss = min(sysctl_tcp_base_mss,
  tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low)/2);
mss = max(mss, 68 - tp->tcp_header_len);

That first line looks like it should be a max() instead
of a min().

tcp_base_mss is the smallest MSS we should use, therefore
we should make sure the "mss" is at least that large.

It's also possible that I misread the intention of this code :) From
what I read, it is trying to half the MSS in use and adjust the MTU
search low point to be based upon this new value.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-07-25 Thread Evgeniy Polyakov
On Mon, Jul 24, 2006 at 11:48:53PM -0700, David Miller ([EMAIL PROTECTED]) 
wrote:
> > And if that CPU is very busy?
> > Linux should somehow tell NIC that some CPUs are valid and some are not
> > right now, but not in a second, so scheduler must be tightly bound with
> > network internals.
> 
> Yes, it is research problem.
> 
> Most of the time, even stateless version will improve things.
> From another viewpoint, even in worst case, it can be no
> worse than current situation. :)
> 
> BTW, such dynamic remapping is provided for in the NDIS interfaces.
> There is an indexing table that is gone through using computed hash to
> get "cpu number".

I think we should force Linux scheduler to export some easily accessed CPU
statistic, so that info might be used by irq layer/protocol processing.

As a side completely unrelated to either my or others work note :) - 
I think it is a nanooptimisation - we get a bit of performance here, 
and lose those bit in other place.
When bag is filled, there is no much sence of rearranging some stuff
inside to be able to place another one - it is better to buy new bag.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html