Re: [BUG] stacktrace from skb_checksum_help() and skb_gso_segment()
Am Mittwoch, 26. Juli 2006 08:44 schrieb Evgeniy Polyakov: > On Wed, Jul 26, 2006 at 08:41:48AM +0200, Rolf Eike Beer ([EMAIL PROTECTED]) wrote: > > linux-2.6 git tree from yesterday. > > > > Before this the sky2 network driver was working. After a pseudo hotplug > > of the device it was working again (at least if you receive this mail > > *g*). > > > > What next? > > It is not a bug, but remind to update NAT helper function. > It has nothing with device drivers and is only printed once. But why was my sky2 down afterwards? There's nothing in the log but this. Eike pgpWGCOuw3dEj.pgp Description: PGP signature
Re: [BUG] stacktrace from skb_checksum_help() and skb_gso_segment()
On Wed, Jul 26, 2006 at 08:41:48AM +0200, Rolf Eike Beer ([EMAIL PROTECTED]) wrote: > linux-2.6 git tree from yesterday. > > Before this the sky2 network driver was working. After a pseudo hotplug of > the > device it was working again (at least if you receive this mail *g*). > > What next? It is not a bug, but remind to update NAT helper function. It has nothing with device drivers and is only printed once. > Eike -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[BUG] stacktrace from skb_checksum_help() and skb_gso_segment()
linux-2.6 git tree from yesterday. Before this the sky2 network driver was working. After a pseudo hotplug of the device it was working again (at least if you receive this mail *g*). What next? Eike Jul 26 08:22:51 siso-eb-i34d kernel: BUG: warning at /home/beer/repos/linux-2.6/net/core/dev.c:1171/skb_checksum_help() Jul 26 08:22:51 siso-eb-i34d kernel: [] show_trace_log_lvl+0x54/0xfd Jul 26 08:22:51 siso-eb-i34d kernel: [] show_trace+0xd/0x10 Jul 26 08:22:51 siso-eb-i34d kernel: [] dump_stack+0x17/0x1c Jul 26 08:22:51 siso-eb-i34d kernel: [] skb_checksum_help+0x55/0x108 Jul 26 08:22:51 siso-eb-i34d kernel: [] ip_nat_fn+0x43/0x183 [iptable_nat] Jul 26 08:22:51 siso-eb-i34d kernel: [] ip_nat_local_fn+0x36/0xb5 [iptable_nat] Jul 26 08:22:51 siso-eb-i34d kernel: [] nf_iterate+0x2e/0x61 Jul 26 08:22:51 siso-eb-i34d kernel: [] nf_hook_slow+0x37/0x95 Jul 26 08:22:51 siso-eb-i34d kernel: [] ip_queue_xmit+0x362/0x3b2 Jul 26 08:22:51 siso-eb-i34d kernel: [] tcp_transmit_skb+0x5f3/0x615 Jul 26 08:22:51 siso-eb-i34d kernel: [] tcp_push_one+0xb7/0xda Jul 26 08:22:51 siso-eb-i34d kernel: [] tcp_sendmsg+0x786/0x990 Jul 26 08:22:51 siso-eb-i34d kernel: [] inet_sendmsg+0x39/0x46 Jul 26 08:22:51 siso-eb-i34d kernel: [] do_sock_write+0x93/0x9b Jul 26 08:22:51 siso-eb-i34d kernel: [] sock_aio_write+0x56/0x64 Jul 26 08:22:51 siso-eb-i34d kernel: [] do_sync_write+0xaf/0xe4 Jul 26 08:22:51 siso-eb-i34d kernel: [] vfs_write+0x94/0xe8 Jul 26 08:22:51 siso-eb-i34d kernel: [] sys_write+0x3b/0x60 Jul 26 08:22:51 siso-eb-i34d kernel: [] sysenter_past_esp+0x56/0x8d Jul 26 08:22:51 siso-eb-i34d kernel: BUG: warning at /home/beer/repos/linux-2.6/net/core/dev.c:1225/skb_gso_segment() Jul 26 08:22:51 siso-eb-i34d kernel: [] show_trace_log_lvl+0x54/0xfd Jul 26 08:22:51 siso-eb-i34d kernel: [] show_trace+0xd/0x10 Jul 26 08:22:51 siso-eb-i34d kernel: [] dump_stack+0x17/0x1c Jul 26 08:22:51 siso-eb-i34d kernel: [] skb_gso_segment+0x84/0x171 Jul 26 08:22:51 siso-eb-i34d kernel: [] dev_hard_start_xmit+0x175/0x202 Jul 26 08:22:51 siso-eb-i34d kernel: [] __qdisc_run+0xde/0x1a5 Jul 26 08:22:51 siso-eb-i34d kernel: [] dev_queue_xmit+0x13b/0x24c Jul 26 08:22:51 siso-eb-i34d kernel: [] neigh_resolve_output+0x1cf/0x1fb Jul 26 08:22:51 siso-eb-i34d kernel: [] ip_output+0x1c4/0x1ff Jul 26 08:22:51 siso-eb-i34d kernel: [] ip_queue_xmit+0x373/0x3b2 Jul 26 08:22:51 siso-eb-i34d kernel: [] tcp_transmit_skb+0x5f3/0x615 Jul 26 08:22:51 siso-eb-i34d kernel: [] tcp_push_one+0xb7/0xda Jul 26 08:22:51 siso-eb-i34d kernel: [] tcp_sendmsg+0x786/0x990 Jul 26 08:22:51 siso-eb-i34d kernel: [] inet_sendmsg+0x39/0x46 Jul 26 08:22:51 siso-eb-i34d kernel: [] do_sock_write+0x93/0x9b Jul 26 08:22:51 siso-eb-i34d kernel: [] sock_aio_write+0x56/0x64 Jul 26 08:22:51 siso-eb-i34d kernel: [] do_sync_write+0xaf/0xe4 Jul 26 08:22:51 siso-eb-i34d kernel: [] vfs_write+0x94/0xe8 Jul 26 08:22:51 siso-eb-i34d kernel: [] sys_write+0x3b/0x60 Jul 26 08:22:51 siso-eb-i34d kernel: [] sysenter_past_esp+0x56/0x8d pgpYKUjf2m2Ju.pgp Description: PGP signature
Re: async network I/O, event channels, etc
On Tue, Jul 25, 2006 at 03:01:22PM -0700, David Miller ([EMAIL PROTECTED]) wrote: > From: Ulrich Drepper <[EMAIL PROTECTED]> > Date: Tue, 25 Jul 2006 12:23:53 -0700 > > > I was very much surprised by the reactions I got after my OLS talk. > > Lots of people declared interest and even agreed with the approach and > > asked me to do further ahead with all this. For those who missed it, > > the paper and the slides are available on my home page: > > > > http://people.redhat.com/drepper/ > > > > As for the next steps I see a number of possible ways. The discussions > > can be held on the usual mailing lists (i.e., lkml and netdev) but due > > to the raw nature of the current proposal I would imagine that would be > > mainly perceived as noise. > > Since I gave a big thumbs up for Evgivny's kevent work yesterday > on linux-kernel, you might want to start by comparing your work > to his. Because his has the advantage that 1) we have code now > and 2) he has written many test applications and performed many > benchmarks against his code which has flushed out most of the > major implementation issues. > > I think most of the people who have encouraged your work are unaware > of Evgivny's kevent stuff, which is extremely unfortunate, the two > works are more similar than they are different. > > I do not think discussing all of this on netdev would be perceived > as noise. :) Hello David, Ulrich. Here is brief description of what is kevent and how it works. Kevent subsystem incorporates several AIO/kqueue design notes and ideas. Kevent can be used both for edge and level notifications. It supports socket notifications (accept, send, recv), network AIO (aio_send(), aio_recv() and aio_sendfile()), inode notifications (create/remove), generic poll()/select() notifications and timer notifications. There are several object in the kevent system: storage - each source of events (socket, inode, timer, aio, anything) has structure kevent_storage incorporated into it, which is basically a list of registered interests for this source of events. user - it is abstraction which holds all requested kevents. It is similar to FreeBSD's kqueue. kevent - set of interests for given source of events or storage. When kevent is queued into storage, it will live there until removed by kevent_dequeue(). When some activity is noticed in given storage, it scans it's kevent_storage->list for kevents which match activity event. If kevents are found and they are not already in the kevent_user->ready_list, they will be added there at the end. ioctl(WAIT) (or appropriate syscall) will wait until either requested number of kevents are ready or timeout elapsed or at least one kevent is ready, it's behaviour depends on parameters. It is possible to have one-shot kevents, which are automatically removed when are ready. Any event can be added/removed/modified by ioctl or special controlling syscall. Network AIO is based on kevent and works as usual kevent storage on top of inode. When new socket is created it is associated with that inode and when some activity is detected appropriate notifications are generated and kevent_naio_callback() is called. When new kevent is being registered, network AIO ->enqueue() callback simply marks itself like usual socket event watcher. It also locks physical userspace pages in memory and stores appropriate pointers in private kevent structure. I have not created additional DMA memory allocation methods, like Ulrich described in his article, so I handle it inside NAIO which has some overhead (I posted get_user_pages() sclability graph some time ago). Network AIO callback gets pointers to userspace pages and tries to copy data from receiving skb queue into them using protocol specific callback. This callback is very similar to ->recvmsg(), so they could share a lot in future (as far as I recall it worked only with hardware capable to do checksumming, I'm a bit lazy). Both network and aio implementation work on top of hooks inside appropriate state machines, but not as repeated call design (curect AIO) or special thread (SGI AIO). AIO work was stopped, since I was unable to achieve the same speed as synchronous read (maximum speeds were 2Gb/sec vs. 2.1 GB/sec for aio and sync IO accordingly when reading data from the cache). Network aio_sendfile() works lazily - it asynchronously populates pages into the VFS cache (which can be used for various tricks with adaptive readahead) and then uses usual ->sendfile() callback. I have not created an interface for userspace events (like Solaris), since right now I do not see it's usefullness, but if there is requirements for that it is quite easy with kevents. I'm preparing set of kevent patches resend (with cleanups mentioned in previous e-mails), which will be ready in a couple of moments. 1. kevent homepage. http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent 2. network aio homepage. http://tservice.net.ru/~s0mbre/
Re: [PATCH] kthread: airo.c
On Mon, 24 Jul 2006 11:13:09 -0700 Sukadev Bhattiprolu <[EMAIL PROTECTED]> wrote: > Sukadev Bhattiprolu [EMAIL PROTECTED] wrote: > > | Andrew, > | > | Javier Achirica, one of the major contributors to > drivers/net/wireless/airo.c > | took a look at this patch, and doesn't have any problems with it. It doesn't > | fix any bugs and is just a cleanup, so it certainly isn't a candidate > | for this mainline cycle > > Here is the same patch, merged up to 2.6.18-rc2. Christoph's patch (see > http://lkml.org/lkml/2006/7/13/332) still applies cleanly on top of this. > > - > The airo driver is currently caching a pid for later use, but with the > implementation of containers, pids themselves do not uniquely identify > a task. The driver is also using kernel_thread() which is deprecated in > drivers. > > This patch essentially replaces the kernel_thread() with kthread_create(). > It also stores the task_struct of the airo_thread rather than its pid. > Since this introduces a second task_struct in struct airo_info, the patch > renames airo_info.task to airo_info.list_bss_task. > > As an extension of these changes, the patch further: > > - replaces kill_proc() with kthread_stop() > - replaces signal_pending() with kthread_should_stop() >- removes thread completion synchronisation which is handled by > kthread_stop(). > > .. > > @@ -1736,9 +1736,9 @@ static int readBSSListRid(struct airo_in > issuecommand(ai, &cmd, &rsp); > up(&ai->sem); > /* Let the command take effect */ > - ai->task = current; > + ai->list_bss_task = current; > ssleep(3); > - ai->task = NULL; > + ai->list_bss_task = NULL; This looks a little racy to me. It's relatively benign - a race will cause us to sleep for too long. But it's easy to fix: --- a/drivers/net/wireless/airo.c~kthread-airoc-race-fix +++ a/drivers/net/wireless/airo.c @@ -1733,10 +1733,10 @@ static int readBSSListRid(struct airo_in cmd.cmd=CMD_LISTBSS; if (down_interruptible(&ai->sem)) return -ERESTARTSYS; + ai->list_bss_task = current; issuecommand(ai, &cmd, &rsp); up(&ai->sem); /* Let the command take effect */ - ai->list_bss_task = current; ssleep(3); ai->list_bss_task = NULL; } _ Actually, ssleep() ends up doing while (timeout) timeout = schedule_timeout_uninterruptible(timeout); so if the intent of this code is to terminate the sleep early, when the interrupt has happened then it isn't working right. A fix would be to convert the ssleep(3) into schedule_timeout_uninterruptible(3 * HZ). - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUGs in skb_checksum_help() and skb_gso_segment() in 2.6.18-rc2
On Wed, Jul 26, 2006 at 06:01:40AM +0200, Patrick McHardy wrote: > > Please send it, I'll update my patch based on that. Thanks. Here it is, it sits on top of commit ca6bb5d7ab22ac79f608fe6cbc6b12de6a5a19f0 Author: David Woodhouse <[EMAIL PROTECTED]> Date: Thu Jun 22 16:07:52 2006 -0700 [NET]: Require CAP_NET_ADMIN to create tuntap devices. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- 61a015eb86469404587e910e9b852fc35ce436b8 diff --git a/drivers/atm/he.c b/drivers/atm/he.c index fde9334..601e7ee 100644 --- a/drivers/atm/he.c +++ b/drivers/atm/he.c @@ -1913,7 +1913,7 @@ #endif skb->tail = skb->data + skb->len; #ifdef USE_CHECKSUM_HW if (vcc->vpi == 0 && vcc->vci >= ATM_NOT_RSV_VCI) { - skb->ip_summed = CHECKSUM_HW; + skb->ip_summed = CHECKSUM_COMPLETE; skb->csum = TCP_CKSUM(skb->data, he_vcc->pdu_len); } diff --git a/drivers/net/3c59x.c b/drivers/net/3c59x.c index e277789..15dcd4e 100644 --- a/drivers/net/3c59x.c +++ b/drivers/net/3c59x.c @@ -2243,7 +2243,7 @@ boomerang_start_xmit(struct sk_buff *skb vp->tx_ring[entry].next = 0; #if DO_ZEROCOPY - if (skb->ip_summed != CHECKSUM_HW) + if (skb->ip_summed != CHECKSUM_PARTIAL) vp->tx_ring[entry].status = cpu_to_le32(skb->len | TxIntrUploaded); else vp->tx_ring[entry].status = cpu_to_le32(skb->len | TxIntrUploaded | AddTCPChksum | AddUDPChksum); diff --git a/drivers/net/8139cp.c b/drivers/net/8139cp.c index ad0c8c3..4f566d8 100644 --- a/drivers/net/8139cp.c +++ b/drivers/net/8139cp.c @@ -809,7 +809,7 @@ #endif if (mss) flags |= LargeSend | ((mss & MSSMask) << MSSShift); - else if (skb->ip_summed == CHECKSUM_HW) { + else if (skb->ip_summed == CHECKSUM_PARTIAL) { const struct iphdr *ip = skb->nh.iph; if (ip->protocol == IPPROTO_TCP) flags |= IPCS | TCPCS; @@ -863,7 +863,7 @@ #endif if (mss) ctrl |= LargeSend | ((mss & MSSMask) << MSSShift); - else if (skb->ip_summed == CHECKSUM_HW) { + else if (skb->ip_summed == CHECKSUM_PARTIAL) { if (ip->protocol == IPPROTO_TCP) ctrl |= IPCS | TCPCS; else if (ip->protocol == IPPROTO_UDP) @@ -894,7 +894,7 @@ #endif txd->addr = cpu_to_le64(first_mapping); wmb(); - if (skb->ip_summed == CHECKSUM_HW) { + if (skb->ip_summed == CHECKSUM_PARTIAL) { if (ip->protocol == IPPROTO_TCP) txd->opts1 = cpu_to_le32(first_eor | first_len | FirstFrag | DescOwn | diff --git a/drivers/net/acenic.c b/drivers/net/acenic.c index 23ff22b..3ab0e76 100644 --- a/drivers/net/acenic.c +++ b/drivers/net/acenic.c @@ -2041,7 +2041,7 @@ static void ace_rx_int(struct net_device */ if (bd_flags & BD_FLG_TCP_UDP_SUM) { skb->csum = htons(csum); - skb->ip_summed = CHECKSUM_HW; + skb->ip_summed = CHECKSUM_COMPLETE; } else { skb->ip_summed = CHECKSUM_NONE; } @@ -2512,7 +2512,7 @@ restart: mapping = ace_map_tx_skb(ap, skb, skb, idx); flagsize = (skb->len << 16) | (BD_FLG_END); - if (skb->ip_summed == CHECKSUM_HW) + if (skb->ip_summed == CHECKSUM_PARTIAL) flagsize |= BD_FLG_TCP_UDP_SUM; #if ACENIC_DO_VLAN if (vlan_tx_tag_present(skb)) { @@ -2535,7 +2535,7 @@ #endif mapping = ace_map_tx_skb(ap, skb, NULL, idx); flagsize = (skb_headlen(skb) << 16); - if (skb->ip_summed == CHECKSUM_HW) + if (skb->ip_summed == CHECKSUM_PARTIAL) flagsize |= BD_FLG_TCP_UDP_SUM; #if ACENIC_DO_VLAN if (vlan_tx_tag_present(skb)) { @@ -2561,7 +2561,7 @@ #endif PCI_DMA_TODEVICE); flagsize = (frag->size << 16); - if (skb->ip_summed == CHECKSUM_HW) + if (skb->ip_summed == CHECKSUM_PARTIAL) flagsize |= BD_FLG_TCP_UDP_SUM;
Re: BUGs in skb_checksum_help() and skb_gso_segment() in 2.6.18-rc2
Herbert Xu wrote: > Hi Patrick: > > On Wed, Jul 26, 2006 at 05:38:07AM +0200, Patrick McHardy wrote: > >>I have a patch which changes netfilter to do incremental checksumming. >>The hook number is passed to all functions doing this so they know >>how to update the checksum. Could you explain how >>CHECKSUM_COMPLETE/CHECKSUM_PARTIAL are going to be used? I assume >>they're meant to avoid passing hook numbers around everywhere? > > > Yes the hook number is another way to solve the same problem. However, > it can only be used within netfilter. CHECKSUM_COMPLETE/CHECKSUM_PARTIAL > on the other hand are valid throughout the stack. With Xen feeding Linux > packets into the stack the netfilter hook is also no longer sufficient to > distinguish between these two cases as partial checksum packets can now > appear on receive. > > The problem is that you need to do different incremental updates depending > on whether the checksum is complete (i.e., CHECKSUM_HW on receive), or > partial (i.e., CHECKSUM_HW on transmit). > > With complete checksums the current update code in netfilter can be used > as is. With partial checksums you need to exclude bits which weren't > used when computing the partial checksums (e.g., TCP port numbers need > to be excluded, but the IP address needs to be included for NAT). That does sound better than the hook number approach. > I have a patch that adds CHECKSUM_COMPLETE/CHECKSUM_PARTIAL if you want > something to work from. Let me know if you want this and I'll bounce it > to you. Please send it, I'll update my patch based on that. Thanks. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUGs in skb_checksum_help() and skb_gso_segment() in 2.6.18-rc2
Hi Patrick: On Wed, Jul 26, 2006 at 05:38:07AM +0200, Patrick McHardy wrote: > > I have a patch which changes netfilter to do incremental checksumming. > The hook number is passed to all functions doing this so they know > how to update the checksum. Could you explain how > CHECKSUM_COMPLETE/CHECKSUM_PARTIAL are going to be used? I assume > they're meant to avoid passing hook numbers around everywhere? Yes the hook number is another way to solve the same problem. However, it can only be used within netfilter. CHECKSUM_COMPLETE/CHECKSUM_PARTIAL on the other hand are valid throughout the stack. With Xen feeding Linux packets into the stack the netfilter hook is also no longer sufficient to distinguish between these two cases as partial checksum packets can now appear on receive. The problem is that you need to do different incremental updates depending on whether the checksum is complete (i.e., CHECKSUM_HW on receive), or partial (i.e., CHECKSUM_HW on transmit). With complete checksums the current update code in netfilter can be used as is. With partial checksums you need to exclude bits which weren't used when computing the partial checksums (e.g., TCP port numbers need to be excluded, but the IP address needs to be included for NAT). I have a patch that adds CHECKSUM_COMPLETE/CHECKSUM_PARTIAL if you want something to work from. Let me know if you want this and I'll bounce it to you. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Is the qla3xxx driver in the mainline?
Hi Ron, The qla3xxx driver is in the -mm tree. For the mainline inclusion, Jeff seems ready to accept the driver: http://marc.theaimsgroup.com/?l=linux-netdev&m=115334172810775&w=2 Just wondering if the qla3xxx driver is in the mainline yet? (I am curious to have some performance comparison of qla3xxx + open iscsi v.s. qla4xxx + on board TOE/iscsi.) -- albert - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH Round 4 2/3] Core network changes to support network event notification.
On Tue, Jul 25, 2006 at 10:05:40AM -0500, Steve Wise wrote: > > But they really are seeing a delete followed by an add. That's what the > kernel is doing. Actually that's the other thing I don't really like. The user-space monitor may perceive that a route was actually deleted and replaced by a new one even though this isn't what's happening at all. In fact the problem here is that you're sending route notifications when it's really the dst_entry that's changing. User-space as it stands only get notifications about fib changes which is quite different from changes to the transient dst_entry objects which only exist in the route cache. Is anyone actually going to use the user-space interface of this? If not perhaps we should wait until someone really needs it before adding the netlink part of the patch. We can change the kernel interface at will so if we make a mistake with netevent it can be easily corrected. For user-space though the rules are totally different. I'd really hate to be stuck with an interface which turns out to not be the one that people actually want to have. > The rdma driver needs to update all established rdma connections that > are using the next-hop information of the existing route and make them > use the next-hop information of the new route. In addition, the rdma > driver might have a reference to the old dst entry. So it can release > that ref and add a ref to the new dst entry. Do you really need the old route for the user-space part of your patch? > I have to admit I'm a little fuzzy on the routing stuff. The main > netevents I've utilized in the the rdma driver I'm writing is the > neighbour update event and the redirect event. Route add/del was added > for completeness of "routing" netevents. So you mean you aren't going to use the route notifications? In that case we should probably just drop them and add them when someone actually needs it. At that point they can tell us what semantics they want from it :) > Can you expand further or point me to code where the IP stack "flushes > its tables" when routes are changed? Grep for rt_cache_flush in net/ipv4/fib_hash.c. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUGs in skb_checksum_help() and skb_gso_segment() in 2.6.18-rc2
Herbert Xu wrote: > Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: > >>It is not a bug, but remind to update nat helper function. > > > Yes, I need to add CHECKSUM_COMPLETE vs. CHECKSUM_PARTIAL first so that > we actually know which is which in NAT. I have a patch which changes netfilter to do incremental checksumming. The hook number is passed to all functions doing this so they know how to update the checksum. Could you explain how CHECKSUM_COMPLETE/CHECKSUM_PARTIAL are going to be used? I assume they're meant to avoid passing hook numbers around everywhere? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUGs in skb_checksum_help() and skb_gso_segment() in 2.6.18-rc2
Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: > On Tue, Jul 25, 2006 at 05:15:04PM +0200, Tomasz Torcz ([EMAIL PROTECTED]) > wrote: >> BUG: warning at net/core/dev.c:1171/skb_checksum_help() >> [] show_trace_log_lvl+0x51/0xe6 >> [] show_trace+0xa/0xc >> [] dump_stack+0x13/0x15 >> [] skb_checksum_help+0x4d/0xeb >> [] ip_nat_fn+0x47/0x19a > > It is not a bug, but remind to update nat helper function. Yes, I need to add CHECKSUM_COMPLETE vs. CHECKSUM_PARTIAL first so that we actually know which is which in NAT. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ip multicast route bug fix
Alexey Kuznetsov <[EMAIL PROTECTED]> wrote: > > I think you mean this. > > Note, it is real skb_clone(), not alloc_skb(). Equeued skb contains > the whole half-prepared netlink message plus room for the rest. > It could be also skb_copy(), if we want to be puristic about mangling > cloned data, but original copy is really not going to be used. I like this. However, since the cloned skb is either discarded in case of error, or queued in which case the caller discards its reference right away, wouldn't it be simpler to just do this? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index ba33f86..0a2af08 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -1593,6 +1593,7 @@ int ipmr_get_route(struct sk_buff *skb, read_unlock(&mrt_lock); return -ENODEV; } + skb_get(skb); skb->nh.raw = skb_push(skb, sizeof(struct iphdr)); skb->nh.iph->ihl = sizeof(struct iphdr)>>2; skb->nh.iph->saddr = rt->rt_src; - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MTU probing bug?
David Miller wrote: I find it interesting that black hole detection is handled different from a normal probe failure. I guess here we are dealing with a more significant failure, so we should start at the thing which is most guarenteed to work. Black hole detection is substantially different than probe failure. It's an indication that your current effective MTU is too large, and must be reduced. A black hole basically means that your current path information is no good, either because your initial values are wrong, or the path changed. So, you really want to reset everything to conservative values. After a failed probe, your effective MTU is still fine, you've just gotten some additional information about the path (that it doesn't support the higher MTU). -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Netchannel: default, find, add to socket
From: Kelly Daly <[EMAIL PROTECTED]> Date: Tue, 11 Jul 2006 16:17:56 +1000 > Implement finding of correct netchannel for buffer, default netchannel and > attach a netchannel to a socket > > Signed-off-by: Kelly Daly <[EMAIL PROTECTED]> Kelly, I want to apply this, but your email client has busted up the patch by placing new lines all over. It also has turned tabs into spaces, which also corrupts the patch. Can you configure KMail not to corrupt the patch like this and resubmit? Thanks. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][IPv4/IPv6] Setting 0 for unused port field.
Tetsuo-san can you please use a correct "From:" field in your patch postings? Thank you. This will allow me to form a correct attribution when I apply your patches in the future. This time I had to perform a lengthy web search to find who is behind these strange [EMAIL PROTECTED] email addresses :-(( Using these per-listname email addresses really makes things painful for other developers, even though it might simplify things for you. Thanks again. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][IPv4/IPv6] Setting 0 for unused port field. (fwd)
In article <[EMAIL PROTECTED]> (at Tue, 25 Jul 2006 16:54:36 -0700 (PDT)), David Miller <[EMAIL PROTECTED]> says: > > Well, instead, should it be initalized to protocol number, shouldn't it? > > Initially, this was my reaction too. But, aparently it is defined > to be zero, from TCP/IP Illustrated Volume 2, page 1055, which is : Okay, Acked. Acked-by: YOSHIFUJI Hideaki <[EMAIL PROTECTED]> --yoshfuji - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][IPv4/IPv6] Setting 0 for unused port field. (fwd)
From: YOSHIFUJI Hideaki <[EMAIL PROTECTED]> Date: Tue, 25 Jul 2006 23:53:40 +0900 (JST) > In article <[EMAIL PROTECTED]> (at Tue, 25 Jul 2006 10:45:51 -0400 (EDT)), > James Morris <[EMAIL PROTECTED]> says: > > > The recvmsg() for raw socket seems to return random u16 value > > from the kernel stack memory since port field is not initialized. > > But I'm not sure this patch is correct. > > Does raw socket return any information stored in port field? > > > > -- Start of patch -- > > diff -ur before/net/ipv4/raw.c after/net/ipv4/raw.c > > --- before/net/ipv4/raw.c 2006-06-18 10:49:35.0 +0900 > > +++ after/net/ipv4/raw.c2006-07-25 16:15:26.0 +0900 > > @@ -609,6 +609,7 @@ > > if (sin) { > > sin->sin_family = AF_INET; > > sin->sin_addr.s_addr = skb->nh.iph->saddr; > > + sin->sin_port = 0; > > memset(&sin->sin_zero, 0, sizeof(sin->sin_zero)); > > } > > if (inet->cmsg_flags) > > Well, instead, should it be initalized to protocol number, shouldn't it? Initially, this was my reaction too. But, aparently it is defined to be zero, from TCP/IP Illustrated Volume 2, page 1055, which is discussing rip_input() in the BSD stack: Unlike UDP, there is no concept of a port number in raw IP, so the sin_port field in the sockaddr_in structures is always 0. The BSD code does not explicitly set it to zero. But it doesn't have to because it uses "ripsrc" which is a static variable in the BSS section of the BSD kernel image. So this patch appears correct and I will apply it. Thanks everyone. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ip multicast route bug fix
From: Alexey Kuznetsov <[EMAIL PROTECTED]> Date: Wed, 26 Jul 2006 01:17:25 +0400 > Hello! > > > Wouldn't it be better to have a consistent interface (skb always freed), > > and clone the skb if needed for deferred processing? > > I think you mean this. > > Note, it is real skb_clone(), not alloc_skb(). Equeued skb contains > the whole half-prepared netlink message plus room for the rest. > It could be also skb_copy(), if we want to be puristic about mangling > cloned data, but original copy is really not going to be used. Applied, thanks Alexey. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] tg3: Update version and reldate
From: "Michael Chan" <[EMAIL PROTECTED]> Date: Tue, 25 Jul 2006 15:56:26 -0700 > Update version to 3.63. > > Signed-off-by: Michael Chan <[EMAIL PROTECTED]> Also applied, thanks again. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] tg3: Handle tg3_init_rings() failures
From: "Michael Chan" <[EMAIL PROTECTED]> Date: Tue, 25 Jul 2006 15:56:20 -0700 > Handle dev_alloc_skb() failures when initializing the RX rings. > Without proper handling, the driver will crash when using a partial > ring. > > Thanks to Stephane Doyon <[EMAIL PROTECTED]> for reporting the bug and > providing the initial patch. > > Howie Xu <[EMAIL PROTECTED]> also reported the same issue. > > Signed-off-by: Michael Chan <[EMAIL PROTECTED]> Applied, thanks a lot. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] tg3: Add tg3_restart_hw()
From: "Michael Chan" <[EMAIL PROTECTED]> Date: Tue, 25 Jul 2006 15:56:13 -0700 > Add tg3_restart_hw() to handle failures when re-initializing the > device. > > Signed-off-by: Michael Chan <[EMAIL PROTECTED]> Applied, thanks Michael. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] tg3: Update version and reldate
Update version to 3.63. Signed-off-by: Michael Chan <[EMAIL PROTECTED]> diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c index d66b06f..1b8138f 100644 --- a/drivers/net/tg3.c +++ b/drivers/net/tg3.c @@ -68,8 +68,8 @@ #define DRV_MODULE_NAME"tg3" #define PFX DRV_MODULE_NAME": " -#define DRV_MODULE_VERSION "3.62" -#define DRV_MODULE_RELDATE "June 30, 2006" +#define DRV_MODULE_VERSION "3.63" +#define DRV_MODULE_RELDATE "July 25, 2006" #define TG3_DEF_MAC_MODE 0 #define TG3_DEF_RX_MODE0 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] tg3: Add tg3_restart_hw()
Add tg3_restart_hw() to handle failures when re-initializing the device. Signed-off-by: Michael Chan <[EMAIL PROTECTED]> diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c index ce6f3be..1253cec 100644 --- a/drivers/net/tg3.c +++ b/drivers/net/tg3.c @@ -3590,6 +3590,28 @@ static irqreturn_t tg3_test_isr(int irq, static int tg3_init_hw(struct tg3 *, int); static int tg3_halt(struct tg3 *, int, int); +/* Restart hardware after configuration changes, self-test, etc. + * Invoked with tp->lock held. + */ +static int tg3_restart_hw(struct tg3 *tp, int reset_phy) +{ + int err; + + err = tg3_init_hw(tp, reset_phy); + if (err) { + printk(KERN_ERR PFX "%s: Failed to re-initialize device, " + "aborting.\n", tp->dev->name); + tg3_halt(tp, RESET_KIND_SHUTDOWN, 1); + tg3_full_unlock(tp); + del_timer_sync(&tp->timer); + tp->irq_sync = 0; + netif_poll_enable(tp->dev); + dev_close(tp->dev); + tg3_full_lock(tp, 0); + } + return err; +} + #ifdef CONFIG_NET_POLL_CONTROLLER static void tg3_poll_controller(struct net_device *dev) { @@ -3630,13 +3652,15 @@ static void tg3_reset_task(void *_data) } tg3_halt(tp, RESET_KIND_SHUTDOWN, 0); - tg3_init_hw(tp, 1); + if (tg3_init_hw(tp, 1)) + goto out; tg3_netif_start(tp); if (restart_timer) mod_timer(&tp->timer, jiffies + 1); +out: tp->tg3_flags &= ~TG3_FLAG_IN_RESET_TASK; tg3_full_unlock(tp); @@ -4124,6 +4148,7 @@ static inline void tg3_set_mtu(struct ne static int tg3_change_mtu(struct net_device *dev, int new_mtu) { struct tg3 *tp = netdev_priv(dev); + int err; if (new_mtu < TG3_MIN_MTU || new_mtu > TG3_MAX_MTU(tp)) return -EINVAL; @@ -4144,13 +4169,14 @@ static int tg3_change_mtu(struct net_dev tg3_set_mtu(dev, tp, new_mtu); - tg3_init_hw(tp, 0); + err = tg3_restart_hw(tp, 0); - tg3_netif_start(tp); + if (!err) + tg3_netif_start(tp); tg3_full_unlock(tp); - return 0; + return err; } /* Free up pending packets in all rx/tx rings. @@ -5815,6 +5841,7 @@ static int tg3_set_mac_addr(struct net_d { struct tg3 *tp = netdev_priv(dev); struct sockaddr *addr = p; + int err = 0; if (!is_valid_ether_addr(addr->sa_data)) return -EINVAL; @@ -5832,9 +5859,9 @@ static int tg3_set_mac_addr(struct net_d tg3_full_lock(tp, 1); tg3_halt(tp, RESET_KIND_SHUTDOWN, 1); - tg3_init_hw(tp, 0); - - tg3_netif_start(tp); + err = tg3_restart_hw(tp, 0); + if (!err) + tg3_netif_start(tp); tg3_full_unlock(tp); } else { spin_lock_bh(&tp->lock); @@ -5842,7 +5869,7 @@ static int tg3_set_mac_addr(struct net_d spin_unlock_bh(&tp->lock); } - return 0; + return err; } /* tp->lock is held. */ @@ -7956,7 +7983,7 @@ static void tg3_get_ringparam(struct net static int tg3_set_ringparam(struct net_device *dev, struct ethtool_ringparam *ering) { struct tg3 *tp = netdev_priv(dev); - int irq_sync = 0; + int irq_sync = 0, err = 0; if ((ering->rx_pending > TG3_RX_RING_SIZE - 1) || (ering->rx_jumbo_pending > TG3_RX_JUMBO_RING_SIZE - 1) || @@ -7980,13 +8007,14 @@ static int tg3_set_ringparam(struct net_ if (netif_running(dev)) { tg3_halt(tp, RESET_KIND_SHUTDOWN, 1); - tg3_init_hw(tp, 1); - tg3_netif_start(tp); + err = tg3_restart_hw(tp, 1); + if (!err) + tg3_netif_start(tp); } tg3_full_unlock(tp); - return 0; + return err; } static void tg3_get_pauseparam(struct net_device *dev, struct ethtool_pauseparam *epause) @@ -8001,7 +8029,7 @@ static void tg3_get_pauseparam(struct ne static int tg3_set_pauseparam(struct net_device *dev, struct ethtool_pauseparam *epause) { struct tg3 *tp = netdev_priv(dev); - int irq_sync = 0; + int irq_sync = 0, err = 0; if (netif_running(dev)) { tg3_netif_stop(tp); @@ -8025,13 +8053,14 @@ static int tg3_set_pauseparam(struct net if (netif_running(dev)) { tg3_halt(tp, RESET_KIND_SHUTDOWN, 1); - tg3_init_hw(tp, 1); - tg3_netif_start(tp); + err = tg3_restart_hw(tp, 1); + if (!err) + tg3_netif_start(tp); } tg3_full_unlock(tp); - return 0; + return err; } static u32 tg3_get_rx_csum(struct net_device *dev) @@ -8666,7 +8695,9 @@ static int tg3_test_loopback(struct tg3 if (!netif_running(tp->dev))
Re: async network I/O, event channels, etc
On Tue, 2006-07-25 at 15:01 -0700, David Miller wrote: > From: Ulrich Drepper <[EMAIL PROTECTED]> > Date: Tue, 25 Jul 2006 12:23:53 -0700 > > > I was very much surprised by the reactions I got after my OLS talk. > > Lots of people declared interest and even agreed with the approach and > > asked me to do further ahead with all this. For those who missed it, > > the paper and the slides are available on my home page: > > > > http://people.redhat.com/drepper/ > > > > As for the next steps I see a number of possible ways. The discussions > > can be held on the usual mailing lists (i.e., lkml and netdev) but due > > to the raw nature of the current proposal I would imagine that would be > > mainly perceived as noise. > > Since I gave a big thumbs up for Evgivny's kevent work yesterday > on linux-kernel, you might want to start by comparing your work > to his. Because his has the advantage that 1) we have code now > and 2) he has written many test applications and performed many > benchmarks against his code which has flushed out most of the > major implementation issues. > > I think most of the people who have encouraged your work are unaware > of Evgivny's kevent stuff, which is extremely unfortunate, the two > works are more similar than they are different. > > I do not think discussing all of this on netdev would be perceived > as noise. :) While the comparing is going on, how does this compare to Solaris's ports interface? It's documented at http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrir?a=view Also, since we're on the subject, why a whole new interface for event queuing instead of extending the existing io_getevents(2) and friends? -- Nicholas Miell <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] tg3: Handle tg3_init_rings() failures
Handle dev_alloc_skb() failures when initializing the RX rings. Without proper handling, the driver will crash when using a partial ring. Thanks to Stephane Doyon <[EMAIL PROTECTED]> for reporting the bug and providing the initial patch. Howie Xu <[EMAIL PROTECTED]> also reported the same issue. Signed-off-by: Michael Chan <[EMAIL PROTECTED]> diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c index 1253cec..d66b06f 100644 --- a/drivers/net/tg3.c +++ b/drivers/net/tg3.c @@ -4258,7 +4258,7 @@ static void tg3_free_rings(struct tg3 *t * end up in the driver. tp->{tx,}lock are held and thus * we may not sleep. */ -static void tg3_init_rings(struct tg3 *tp) +static int tg3_init_rings(struct tg3 *tp) { u32 i; @@ -4307,18 +4307,38 @@ static void tg3_init_rings(struct tg3 *t /* Now allocate fresh SKBs for each rx ring. */ for (i = 0; i < tp->rx_pending; i++) { - if (tg3_alloc_rx_skb(tp, RXD_OPAQUE_RING_STD, --1, i) < 0) + if (tg3_alloc_rx_skb(tp, RXD_OPAQUE_RING_STD, -1, i) < 0) { + printk(KERN_WARNING PFX + "%s: Using a smaller RX standard ring, " + "only %d out of %d buffers were allocated " + "successfully.\n", + tp->dev->name, i, tp->rx_pending); + if (i == 0) + return -ENOMEM; + tp->rx_pending = i; break; + } } if (tp->tg3_flags & TG3_FLAG_JUMBO_RING_ENABLE) { for (i = 0; i < tp->rx_jumbo_pending; i++) { if (tg3_alloc_rx_skb(tp, RXD_OPAQUE_RING_JUMBO, --1, i) < 0) +-1, i) < 0) { + printk(KERN_WARNING PFX + "%s: Using a smaller RX jumbo ring, " + "only %d out of %d buffers were " + "allocated successfully.\n", + tp->dev->name, i, tp->rx_jumbo_pending); + if (i == 0) { + tg3_free_rings(tp); + return -ENOMEM; + } + tp->rx_jumbo_pending = i; break; + } } } + return 0; } /* @@ -5969,7 +5989,9 @@ static int tg3_reset_hw(struct tg3 *tp, * can only do this after the hardware has been * successfully reset. */ - tg3_init_rings(tp); + err = tg3_init_rings(tp); + if (err) + return err; /* This value is determined during the probe time DMA * engine test, tg3_test_dma. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: async network I/O, event channels, etc
From: Ulrich Drepper <[EMAIL PROTECTED]> Date: Tue, 25 Jul 2006 12:23:53 -0700 > I was very much surprised by the reactions I got after my OLS talk. > Lots of people declared interest and even agreed with the approach and > asked me to do further ahead with all this. For those who missed it, > the paper and the slides are available on my home page: > > http://people.redhat.com/drepper/ > > As for the next steps I see a number of possible ways. The discussions > can be held on the usual mailing lists (i.e., lkml and netdev) but due > to the raw nature of the current proposal I would imagine that would be > mainly perceived as noise. Since I gave a big thumbs up for Evgivny's kevent work yesterday on linux-kernel, you might want to start by comparing your work to his. Because his has the advantage that 1) we have code now and 2) he has written many test applications and performed many benchmarks against his code which has flushed out most of the major implementation issues. I think most of the people who have encouraged your work are unaware of Evgivny's kevent stuff, which is extremely unfortunate, the two works are more similar than they are different. I do not think discussing all of this on netdev would be perceived as noise. :) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can we have GET_NETDEV_DEV?
On Tue, 2006-07-25 at 14:26 -0700, Stephen Hemminger wrote: > On Tue, 25 Jul 2006 17:08:11 -0400 > Pavel Roskin <[EMAIL PROTECTED]> wrote: > > Considering the drivers that are already in the kernel, you may prefer > > to have a more high-level function that would clone the network device > > by copying most of the net_device structure. I think netdev_get_pdev() > > would be mostly used for such cloning if implemented. > > > > What is the wireless tree using for this? Suppose there is a wireless device that can be used on different buses, e.g. PCI and PCMCIA. The driver consists of the common part that works with the chip and the bus specific parts that allocate the resources. The bus specific parts call SET_NETDEV_DEV to associate the network device with the hardware device. Now, suppose the user requests creating a WDS interface. WDS interfaces use 4-address headers, which allows bridging. As it stands now, the way to support a WDS interface is to make is a separate network device. I wish it wouldn't be necessary, but we would need to change quite a few code inside and outside the kernel to understand WDS addressing (I hope to be wrong about that). There is nothing hardware specific about WDS interfaces. They are just extensions of the main interface with a fixed "receive address" and 4-address headers enabled. So it makes sense to create WDS interface in the common (i.e. bus-agnostic) part of the driver. In the (possibly misguided) attempt to make WDS network devices look like real network interfaces, the "class_dev" field is copied from the master network device, because it's something that can be done in the bus-agnostic way: SET_NETDEV_DEV(dev, mdev->class_dev.dev); This is how it's done in HostAP (in the kernel) and MadWifi (outside the kernel). Sure, this value can be stored in the private part of the driver, so it's a minor issue. (Actually, the main (AP or STA) interface is also made a virtual network device, so that the the master interface appears as the parent of the main and the WDS devices and as the single point for monitoring of all wireless traffic.) -- Regards, Pavel Roskin - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MTU probing bug?
From: John Heffner <[EMAIL PROTECTED]> Date: Tue, 25 Jul 2006 09:56:38 -0400 > David Miller wrote: > > John, have a look at this code in tcp_write_timeout(): > > > > mss = min(sysctl_tcp_base_mss, > > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low)/2); > > mss = max(mss, 68 - tp->tcp_header_len); > > > > That first line looks like it should be a max() instead > > of a min(). > > > > tcp_base_mss is the smallest MSS we should use, therefore > > we should make sure the "mss" is at least that large. > > > > It's also possible that I misread the intention of this code :) From > > what I read, it is trying to half the MSS in use and adjust the MTU > > search low point to be based upon this new value. > > No, the min() is what's intended here. The base_mss is where you want > to start searching from. So, on black hole detection, you drop > immediatly down to the base. The base is configurable, because making > it higher can make searching faster. If it's still too high for some > links, you halve it again on successive timeouts. Thanks for the clarification John. I find it interesting that black hole detection is handled different from a normal probe failure. I guess here we are dealing with a more significant failure, so we should start at the thing which is most guarenteed to work. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can we have GET_NETDEV_DEV?
On Tue, Jul 25, 2006 at 10:20:05AM -0700, Stephen Hemminger wrote: > On Tue, 25 Jul 2006 00:52:25 -0400 > Pavel Roskin <[EMAIL PROTECTED]> wrote: > > > Hello! > > > > gregkh-driver-network-class_device-to-device.patch, which briefly > > appeared in Linux 2.6.18-rc1-mm1 broke MadWifi, which is copying the > > physical device information from the master network device to the > > virtual network devices: > > > > SET_NETDEV_DEV(dev, mdev->class_dev.dev); > > > > I would rather see SET_NETDEV_DEV go away. It was done for source > compatibility between 2.4 and 2.6 network device drivers. This is > no longer really important. It did make my change much simpler to do, which I appreciated :) > It would be better to either access the network device directly. > BUT, there are plans to get rid of class_dev so that would mean > source changes to all the drivers again... > > So how about these wrappers. They look good to me. BTW, I'll be getting the patch back into -mm that does the class_device -> device conversion after I fix the SuSE 10.0 release that doesn't like the change and get that pushed out to users. After that happens, there shouldn't be any userspace problems with this patch. thanks, greg k-h - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can we have GET_NETDEV_DEV?
On Tue, 25 Jul 2006 17:08:11 -0400 Pavel Roskin <[EMAIL PROTECTED]> wrote: > Hello, Stephen! > > On Tue, 2006-07-25 at 10:20 -0700, Stephen Hemminger wrote: > > > So how about these wrappers. > > > +static inline void netdev_set_pdev(struct net_device *dev, struct device > > *pdev) > > +{ > > + dev->class_dev.dev = pdev; > > +} > > + > > +static inline struct device *netdev_get_pdev(struct net_device *dev) > > +{ > > + return dev->class_dev.dev; > > +} > > The positive effect of a macro is that if would allow MadWifi to prepare > in advance by testing if GET_NETDEV_DEV is defined. > > The functions may or may not be useful for the code in the kernel, but > I'll not be able to prepare MadWifi unless I know the kernel version in > which they will appear. > > Considering the drivers that are already in the kernel, you may prefer > to have a more high-level function that would clone the network device > by copying most of the net_device structure. I think netdev_get_pdev() > would be mostly used for such cloning if implemented. > What is the wireless tree using for this? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ip multicast route bug fix
On Wed, 26 Jul 2006 01:17:25 +0400 Alexey Kuznetsov <[EMAIL PROTECTED]> wrote: > Hello! > > > Wouldn't it be better to have a consistent interface (skb always freed), > > and clone the skb if needed for deferred processing? > > I think you mean this. > > Note, it is real skb_clone(), not alloc_skb(). Equeued skb contains > the whole half-prepared netlink message plus room for the rest. > It could be also skb_copy(), if we want to be puristic about mangling > cloned data, but original copy is really not going to be used. > > Alexey > That is what I was thinking of. Doesn't matter copy or clone, I prefer clone. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ip multicast route bug fix
Hello! > Wouldn't it be better to have a consistent interface (skb always freed), > and clone the skb if needed for deferred processing? I think you mean this. Note, it is real skb_clone(), not alloc_skb(). Equeued skb contains the whole half-prepared netlink message plus room for the rest. It could be also skb_copy(), if we want to be puristic about mangling cloned data, but original copy is really not going to be used. Alexey diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 9ccacf5..85893ee 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -1578,6 +1578,7 @@ int ipmr_get_route(struct sk_buff *skb, cache = ipmr_cache_find(rt->rt_src, rt->rt_dst); if (cache==NULL) { + struct sk_buff *skb2; struct net_device *dev; int vif; @@ -1591,12 +1592,18 @@ int ipmr_get_route(struct sk_buff *skb, read_unlock(&mrt_lock); return -ENODEV; } - skb->nh.raw = skb_push(skb, sizeof(struct iphdr)); - skb->nh.iph->ihl = sizeof(struct iphdr)>>2; - skb->nh.iph->saddr = rt->rt_src; - skb->nh.iph->daddr = rt->rt_dst; - skb->nh.iph->version = 0; - err = ipmr_cache_unresolved(vif, skb); + skb2 = skb_clone(skb, GFP_ATOMIC); + if (!skb2) { + read_unlock(&mrt_lock); + return -ENOMEM; + } + + skb2->nh.raw = skb_push(skb2, sizeof(struct iphdr)); + skb2->nh.iph->ihl = sizeof(struct iphdr)>>2; + skb2->nh.iph->saddr = rt->rt_src; + skb2->nh.iph->daddr = rt->rt_dst; + skb2->nh.iph->version = 0; + err = ipmr_cache_unresolved(vif, skb2); read_unlock(&mrt_lock); return err; } - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can we have GET_NETDEV_DEV?
Hello, Stephen! On Tue, 2006-07-25 at 10:20 -0700, Stephen Hemminger wrote: > So how about these wrappers. > +static inline void netdev_set_pdev(struct net_device *dev, struct device > *pdev) > +{ > + dev->class_dev.dev = pdev; > +} > + > +static inline struct device *netdev_get_pdev(struct net_device *dev) > +{ > + return dev->class_dev.dev; > +} The positive effect of a macro is that if would allow MadWifi to prepare in advance by testing if GET_NETDEV_DEV is defined. The functions may or may not be useful for the code in the kernel, but I'll not be able to prepare MadWifi unless I know the kernel version in which they will appear. Considering the drivers that are already in the kernel, you may prefer to have a more high-level function that would clone the network device by copying most of the net_device structure. I think netdev_get_pdev() would be mostly used for such cloning if implemented. -- Regards, Pavel Roskin - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Stop calling phy_stop_interrupts() twice
Prevent phylib from freeing PHY IRQ twice on closing an eth device: phy_disconnect() first calls phy_stop_interrupts(), then it calls phy_stop_machine() which in turn calls phy_stop_interrupts() making the kernel complain on each bootup... Signed-off-by: Sergei Shtylyov <[EMAIL PROTECTED]> Index: linux-2.6/drivers/net/phy/phy.c === --- linux-2.6.orig/drivers/net/phy/phy.c +++ linux-2.6/drivers/net/phy/phy.c @@ -419,9 +419,8 @@ void phy_start_machine(struct phy_device /* phy_stop_machine * - * description: Stops the state machine timer, sets the state to - * UP (unless it wasn't up yet), and then frees the interrupt, - * if it is in use. This function must be called BEFORE + * description: Stops the state machine timer, sets the state to UP + * (unless it wasn't up yet). This function must be called BEFORE * phy_detach. */ void phy_stop_machine(struct phy_device *phydev) @@ -433,9 +432,6 @@ void phy_stop_machine(struct phy_device phydev->state = PHY_UP; spin_unlock(&phydev->lock); - if (phydev->irq != PHY_POLL) - phy_stop_interrupts(phydev); - phydev->adjust_state = NULL; } - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bandwidth limitation help
On 7/26/06, Piotrowski, Ted P. <[EMAIL PROTECTED]> wrote: Hi, I am new to the mailing list so I'm not sure if anybody reads these, but here goes nothing. I recently read: Linux Advanced Routing & Traffic Control HOWTO and have been trying to test my applications using bandwidth limitation. All the examples described in the HOWTO do not simulate the conditions I need to test my software. What I would like is for my bandwidth limitation to empty my UDP buffer at a given rate. I have tried using a simple TBF to do this, but all that happens is that my application floods the TBF buffer at link speed and the TBF buffer quickly overflows and drops packets. I want the packets to actually stay in the UDP buffer and be emptied at a given rate without modifying my application. I don't know if any of you are familiar with netem, but it can be used in conjuction with tc to add delay to a link. Surprisingly, packets delayed by netem appear to remain in the UDP buffer until it is time for them to be sent. I would like this same behavior of keeping the packets in the UDP buffer, but with bandwidth limitation on the rate at which the buffer empties, not just packet delay. Has anybody ever done anything like this or can point me to some resources? Have a look at: http://linux-net.osdl.org/index.php/Netem I have written my own test scenarios using examples from this website but I can also send you my small scripts if you want. Ian -- Ian McDonald Web: http://wand.net.nz/~iam4 Blog: http://imcdnzl.blogspot.com WAND Network Research Group Department of Computer Science University of Waikato New Zealand - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
bandwidth limitation help
Hi, I am new to the mailing list so I'm not sure if anybody reads these, but here goes nothing. I recently read: Linux Advanced Routing & Traffic Control HOWTO and have been trying to test my applications using bandwidth limitation. All the examples described in the HOWTO do not simulate the conditions I need to test my software. What I would like is for my bandwidth limitation to empty my UDP buffer at a given rate. I have tried using a simple TBF to do this, but all that happens is that my application floods the TBF buffer at link speed and the TBF buffer quickly overflows and drops packets. I want the packets to actually stay in the UDP buffer and be emptied at a given rate without modifying my application. I don't know if any of you are familiar with netem, but it can be used in conjuction with tc to add delay to a link. Surprisingly, packets delayed by netem appear to remain in the UDP buffer until it is time for them to be sent. I would like this same behavior of keeping the packets in the UDP buffer, but with bandwidth limitation on the rate at which the buffer empties, not just packet delay. Has anybody ever done anything like this or can point me to some resources? Thank you, Ted - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Who maintains the website ?
On 7/26/06, Christophe Devriese <[EMAIL PROTECTED]> wrote: The http://linux-net.osdl.org/index.php/Main_Page website I mean. It's a Wiki so anybody can alter content on the website. The exception to this is that particular page - the main page. If you want something altered on that particular page send email to one of the sysops. -- Ian McDonald Web: http://wand.net.nz/~iam4 Blog: http://imcdnzl.blogspot.com WAND Network Research Group Department of Computer Science University of Waikato New Zealand - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
bandwidth limitation help
Hi, I am new to the mailing list so I'm not sure if anybody reads these, but here goes nothing. I recently read: Linux Advanced Routing & Traffic Control HOWTO and have been trying to test my applications using bandwidth limitation. All the examples described in the HOWTO do not simulate the conditions I need to test my software. What I would like is for my bandwidth limitation to empty my UDP buffer at a given rate. I have tried using a simple TBF to do this, but all that happens is that my application floods the TBF buffer at link speed and the TBF buffer quickly overflows and drops packets. I want the packets to actually stay in the UDP buffer and be emptied at a given rate without modifying my application. I don't know if any of you are familiar with netem, but it can be used in conjuction with tc to add delay to a link. Surprisingly, packets delayed by netem appear to remain in the UDP buffer until it is time for them to be sent. I would like this same behavior of keeping the packets in the UDP buffer, but with bandwidth limitation on the rate at which the buffer empties, not just packet delay. Has anybody ever done anything like this or can point me to some resources? Thank you, Ted P. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA will be reverted
On Mon, 2006-07-24 at 15:23 -0700, David Miller wrote: > From: Tom Tucker <[EMAIL PROTECTED]> > Date: Wed, 05 Jul 2006 12:09:42 -0500 > > > "A TOE net stack is closed source firmware. Linux engineers have no way > > to fix security issues that arise. As a result, only non-TOE users will > > receive security updates, leaving random windows of vulnerability for > > each TOE NIC's users." > > > > - A Linux security update may or may not be relevant to a vendors > > implementation. > > > > - If a vendor's implementation has a security issue then the customer > > must rely on the vendor to fix it. This is no less true for iWARP than > > for any adapter. > > This isn't how things actually work. > > Users have a computer, and they can rightly expect the community > to help them solve problems that occur in the upstream kernel. > > When a bug is found and the person is using NIC X, we don't > necessarily forward the bug report to the vendor of NIC X. > Instead we try to fix the bug. Many chip drivers are maintained > by people who do not work for the company that makes the chip, > and this works just fine. > > If only the chip vendor can fix a security problem, this makes Linux > less agile to fix. Even aspect of a problem on a Linux system that > cannot be fixed entirely by the community is a net negative for Linux. > All true. What I meant to say was that this is "no less true than for any deep adapter". It is incontrovertible that a deep adapter is less flexible, and more difficult to support than a shallow adapter. > > - iWARP needs to do protocol processing in order to validate and > > evaluate TCP payload in advance of direct data placement. This > > requirement is independent of CPU speed. > > Yet, RDMA itself is just an optimization meant to deal with > limitations of cpu and memory speed. You can rephrase the > situation in whatever way suits your argument, but it does not > make the core issue go away :) Yep. > > > - I suspect that connection rates for RDMA adapters fall well-below the > > rates attainable with a dumb device. That said, all of the RDMA > > applications that I know of are not connection intensive. Even for TOE, > > the later HTTP versions makes connection rates less of an issue. > > This is a very naive evaluation of the situation. Yes, newer > versions of protocols such as HTTP make the per-client connection > burdon lower, but the number of clients will increase in time to > more than makeup for whatever gains are seen due to this. Naive is being kind, my HTTP comment is irrelevant :). > And then you have protocols which by design are connection heavy, > and rightly so, such as bittorrent. > > Being able to handle enormous numbers of connections, with extreme > scalability and low latency, is an absolute requirement of any modern > day serious TCP stack. And this requirement is not going away. > Wishing this requirement away due to HTTP persistent connections is > very unrealistic, at best. > > > - This is the problem we're trying to solve...incrementally and > > responsibly. > > You can't. See my email to Roland about why even VJ net channels > are found to be impractical. To support netfilter properly, you > must traverse the whole netfilter stack, because NAT can rewrite > packets, yet still make them destined for the local system, and > thus they will have a different identity for connection demux > by the time the TCP stack sees the packet. > I'm not claiming that all the problems can be solved, I'm suggesting that integration could be better and that partial integration is better than none. > All of these tranformations occur between normal packet receive > and the TCP stack. You would therefore need to put your card > between netfilter and TCP in the packet input path, and at that > point why bother with the stateful card at all? > > The fact is that stateless approaches will always be better than > stateful things because you cannot replicate the functionality we > have in the Linux stack without replicating 10 years of work into > your chip's firmware. At that point you should just run Linux > on your NIC since that is what you are effectively doing :) > I wish...I'd have a better stack. > In conversations such as these, it helps us a lot if you can be frank > and honest about the true absolute limitations of your technology. I'm trying ... classifying these limitations as "core can't fix" and "fixable with integration" is where we're getting crosswise. > I > can see that your viewpoint is tainted when I hear things such as HTTP > persistent connections being used as a reason why high TCP connection > rates won't matter in the future. Such assertions are understood to > be patently false by anyone who understands TCP and how it is used in > the real world. Partial "Fixable with Integration" Summary - ARP Resolution - ICMP Redirect - Path MTU Change - Route Update - Colliding TCP Port Spaces Partial "Can't Fix" Issues Summary:
Who maintains the website ?
The http://linux-net.osdl.org/index.php/Main_Page website I mean. -- -- Christophe Devriese EURiD Network Adminstrator / Developer [EMAIL PROTECTED] --- http://www.eth1.org -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ip multicast route bug fix
Hello! > Wouldn't it be better to have a consistent interface (skb always freed), > and clone the skb if needed for deferred processing? I am sorry, I misunderstood you. I absolutely agree. It is much better, the variant which I suggested is a good sample of bad programming. :-) Alexey - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can we have GET_NETDEV_DEV?
On Tue, 25 Jul 2006 00:52:25 -0400 Pavel Roskin <[EMAIL PROTECTED]> wrote: > Hello! > > gregkh-driver-network-class_device-to-device.patch, which briefly > appeared in Linux 2.6.18-rc1-mm1 broke MadWifi, which is copying the > physical device information from the master network device to the > virtual network devices: > > SET_NETDEV_DEV(dev, mdev->class_dev.dev); > I would rather see SET_NETDEV_DEV go away. It was done for source compatibility between 2.4 and 2.6 network device drivers. This is no longer really important. It would be better to either access the network device directly. BUT, there are plans to get rid of class_dev so that would mean source changes to all the drivers again... So how about these wrappers. --- a/include/linux/netdevice.h 2006-07-24 10:56:59.0 -0700 +++ b/include/linux/netdevice.h 2006-07-25 10:18:12.0 -0700 @@ -535,10 +535,23 @@ } #define SET_MODULE_OWNER(dev) do { } while (0) + /* Set the sysfs physical device reference for the network logical device * if set prior to registration will cause a symlink during initialization. */ -#define SET_NETDEV_DEV(net, pdev) ((net)->class_dev.dev = (pdev)) +static inline void netdev_set_pdev(struct net_device *dev, struct device *pdev) +{ + dev->class_dev.dev = pdev; +} + +static inline struct device *netdev_get_pdev(struct net_device *dev) +{ + return dev->class_dev.dev; +} + + +/* old style macro for compatiablity */ +#define SET_NETDEV_DEV(net, pdev) netdev_set_pdev(net, pdev) struct packet_type { __be16 type; /* This is really htons(ether_type). */ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Runtime power management for network interfaces
On Tuesday 25 July 2006 8:59 am, Alan Stern wrote: > During a Power Management session at the Ottawa Linux Symposium, it was > generally agreed that network interface drivers ought to automatically > suspend their devices (if possible) whenever: > > (1) The interface is ifconfig'ed down, or > > (2) No link is available. > > Presumably (1) should be easy enough to implement. I can't imagine that there are many network drivers that couldn't put the hardware in some kind of low power mode at that point. All PCI drivers can certainly set_pci_power_state(pdev, PCI_D3hot) and save a certain amount of power. Of course, that means coping with a possible device reset when going to PCI_D0, and I can imagine some devices might prefer to use PCI_D2 rather than reload firmware. Embedded chips probably have an easier time with this; just turn off all the clocks going to that controller and its PHY. > (2) might or might not > be feasible, depending on how much WOL support is available. A network adapter without PHY support in the WOL hardware wouldn't be able to implement (2). And in today's systems, where we can't rely on ACPI+PCI to morph PME# to pci_driver.resume() during runtime suspend, not all platforms can implement (2). Embedded chips probably have an easier time here ... I know for a fact that at91_ether can do that sort of thing easily, at least on boards where the PHY IRQ is hooked up, so that the driver doesn't need to poll link status. The USB analogy here is that _some_ external transceivers (PHY) can report link state changes via IRQs, supporting a mode where everything else is powered off until the link can be active, at which point the controller can come out of its low power state. Those are of course transceivers intended for use in systems with itty bitty batteries. (I keep thinking wakeup event handling is pretty weak in Linux today. As you know, we're still shaking the USB HCDs into shape there, since some system configs have issues. Network adapters seem to be another major use case for wakeup events in Linux, and I get the impression they're not as widely used -- or functional -- as would be good.) > (It might > not be feasible at all for wireless networking.) Still, there can be no > question that it would be a Good Thing for laptops to power-down their > ethernet controllers when the network cable is unplugged. > > Has any progress been made in this direction? If not, a natural approach > would be to start with a reference implementation in one driver which > could then be copied to other drivers. > > Any suggestions? My initial thought was that network drivers could be refactored so that they have ifsuspend() and ifresume() methods to put the hardware into the low power state. The remove(), open() and resume() methods would call ifresume(); probe(), close() and suspend() would call ifsuspend() ... thatid give (1). And for hardware supporting (2) there'd be some housekeeping for the WOL support during suspend() and resume(), in addition to netif_device_{de,at}tach(). For hardware where a PHY can report link state without requiring an active Ethernet controller (e.g. at91_ether for sure) some events could trigger ifresume()/ifsuspend() when the interface is active. - Dave - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ip multicast route bug fix
Hello! > checking tools because the skb lifetime depends on the return value. > Wouldn't it be better to have a consistent interface (skb always freed), > and clone the skb if needed for deferred processing? But skb is not always freed in any case. Normally it is submitted to netlink_unicast(). It is freed only in error case. So, following this logic should not you clone skb to feed to netlink_unicast()? Alexey - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Runtime power management for network interfaces
Alan Stern wrote: During a Power Management session at the Ottawa Linux Symposium, it was generally agreed that network interface drivers ought to automatically suspend their devices (if possible) whenever: (1) The interface is ifconfig'ed down, or (2) No link is available. Presumably (1) should be easy enough to implement. (2) might or might not be feasible, depending on how much WOL support is available. (It might not be feasible at all for wireless networking.) Still, there can be no question that it would be a Good Thing for laptops to power-down their ethernet controllers when the network cable is unplugged. Has any progress been made in this direction? If not, a natural approach would be to start with a reference implementation in one driver which could then be copied to other drivers. Intel's newer e1000's (ich7 onboard e1000 and newer versions for instance) already support this feature partially - the MAC stays on but the PHY can be powered off when no link is present. In order to enable this feature you will need to turn it on explicitly at load time: modprobe e1000 SmartPowerDownEnable=1 Allthough not the entire NIC is powered off, it still saves a significant part of the power consumed by the NIC. All bits help. The power automatically restores once a cable is plugged in. Cheers, Auke - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ip multicast route bug fix
On Tue, 25 Jul 2006 20:20:01 +0400 Alexey Kuznetsov <[EMAIL PROTECTED]> wrote: > Hello! > > > Code was reusing an skb which could lead to use after free or double free. > > No, this does not help. The bug is not here. > > I was so ashamed of this that could not touch the thing. :-) > It startled me a lot, how is it possible that the thing was in production > for several years and such bad bug never was noticed? > > Now it is clear. skbs leaked sometimes before BK changeset 1.889.26.53, > subj: [IPV4]: Fix skb leak in inet_rtm_getroute. But after this fix, > which introduced new bug (goto out_free in the enclosed patch), > the bug showed on surface. > > Please, review this. > > - ipmr_cache_unresolved() does not free skb. Caller does. > - goto out_free in route.c in the case, when skb is enqueued > is returned to goto out. > - some cosmetic cleanup in rt_fill_info() to make it more readable. This looks correct, but it may lead to false reports from automated checking tools because the skb lifetime depends on the return value. Wouldn't it be better to have a consistent interface (skb always freed), and clone the skb if needed for deferred processing? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA will be reverted
> It may be a hardware interpretation, but doesn't it have non-trivial system > implications - where one runs threads/processes etc? Only if you do process context RX processing. If you chose not to it doesn't have much influence. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA will be reverted
David Miller wrote: From: Rick Jones <[EMAIL PROTECTED]> Date: Mon, 24 Jul 2006 17:55:24 -0700 Even enough bits for 1024 or 2048 CPUs in the single system image? I have seen 1024 touted by SGI, and with things going so multi-core, perhaps 16384 while sounding initially bizzare would be in the realm of theoretically possible before to long. Read the RSS NDIS documents from Microsoft. I'll see about hunting them down. You aren't going to want to demux to more than, say, 256 cpus for single network adapter even on the largest machines. I suppose, it just seems to tweak _small_ alarms in my intuition - maybe because it still sounds like networking telling the scheduler where to run threads of execution, and even though I'm a networking guy I seem to have the notion that it should be the other way 'round. That would cover TCP, are there similarly fungible fields in SCTP or other ULPs? And if we were to want to get HW support for the thing, getting it adopted in a de jure standards body would probably be in order :) Microsoft never does this, neither do we. LRO came out of our own design, the network folks found it reasonable and thus they have started to implement it. The same is true for Microsofts RSS stuff. It's a hardware interpretation, therefore it belongs in a driver API specification, nowhere else. It may be a hardware interpretation, but doesn't it have non-trivial system implications - where one runs threads/processes etc? rick jones - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ip multicast route bug fix (rev2)
IP multicast route code was reusing an skb which causes use after free and double free. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- net/ipv4/ipmr.c | 22 -- 1 files changed, 16 insertions(+), 6 deletions(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index ba33f86..f5d3d96 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -1580,6 +1580,7 @@ int ipmr_get_route(struct sk_buff *skb, cache = ipmr_cache_find(rt->rt_src, rt->rt_dst); if (cache==NULL) { + struct sk_buff *skb2; struct net_device *dev; int vif; @@ -1593,12 +1594,21 @@ int ipmr_get_route(struct sk_buff *skb, read_unlock(&mrt_lock); return -ENODEV; } - skb->nh.raw = skb_push(skb, sizeof(struct iphdr)); - skb->nh.iph->ihl = sizeof(struct iphdr)>>2; - skb->nh.iph->saddr = rt->rt_src; - skb->nh.iph->daddr = rt->rt_dst; - skb->nh.iph->version = 0; - err = ipmr_cache_unresolved(vif, skb); + + skb2 = alloc_skb(sizeof(struct iphdr) + sizeof(struct nlmsghdr), +GFP_ATOMIC); + if (!skb2) { + read_unlock(&mrt_lock); + return -ENOMEM; + } + + memset(skb2->data, 0, sizeof(struct iphdr)); + skb2->nh.raw = skb2->data; + skb2->nh.iph->ihl = sizeof(struct iphdr)>>2; + skb2->nh.iph->saddr = rt->rt_src; + skb2->nh.iph->daddr = rt->rt_dst; + + err = ipmr_cache_unresolved(vif, skb2); read_unlock(&mrt_lock); return err; } -- 1.4.0 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ip multicast route bug fix
Hello! > Code was reusing an skb which could lead to use after free or double free. No, this does not help. The bug is not here. I was so ashamed of this that could not touch the thing. :-) It startled me a lot, how is it possible that the thing was in production for several years and such bad bug never was noticed? Now it is clear. skbs leaked sometimes before BK changeset 1.889.26.53, subj: [IPV4]: Fix skb leak in inet_rtm_getroute. But after this fix, which introduced new bug (goto out_free in the enclosed patch), the bug showed on surface. Please, review this. - ipmr_cache_unresolved() does not free skb. Caller does. - goto out_free in route.c in the case, when skb is enqueued is returned to goto out. - some cosmetic cleanup in rt_fill_info() to make it more readable. diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 9ccacf5..1788c04 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -640,8 +640,6 @@ ipmr_cache_unresolved(vifi_t vifi, struc if (atomic_read(&cache_resolve_queue_len)>=10 || (c=ipmr_cache_alloc_unres())==NULL) { spin_unlock_bh(&mfc_unres_lock); - - kfree_skb(skb); return -ENOBUFS; } @@ -662,7 +660,6 @@ ipmr_cache_unresolved(vifi_t vifi, struc spin_unlock_bh(&mfc_unres_lock); kmem_cache_free(mrt_cachep, c); - kfree_skb(skb); return err; } @@ -677,7 +674,6 @@ ipmr_cache_unresolved(vifi_t vifi, struc * See if we can append the packet */ if (c->mfc_un.unres.unresolved.qlen>3) { - kfree_skb(skb); err = -ENOBUFS; } else { skb_queue_tail(&c->mfc_un.unres.unresolved,skb); @@ -1375,6 +1371,7 @@ int ip_mr_input(struct sk_buff *skb) */ if (cache==NULL) { int vif; + int err; if (local) { struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC); @@ -1387,15 +1384,13 @@ int ip_mr_input(struct sk_buff *skb) } vif = ipmr_find_vif(skb->dev); - if (vif >= 0) { - int err = ipmr_cache_unresolved(vif, skb); - read_unlock(&mrt_lock); - - return err; - } + err = -ENODEV; + if (vif >= 0) + err = ipmr_cache_unresolved(vif, skb); read_unlock(&mrt_lock); - kfree_skb(skb); - return -ENODEV; + if (err) + kfree_skb(skb); + return err; } ip_mr_forward(skb, cache, local); @@ -1568,6 +1563,17 @@ rtattr_failure: return -EMSGSIZE; } +/** + * ipmr_get_route - return mrouting information or queue for resolution + * @skb - netlink skb with partially complete rtnelink information + * @rtm - pointer to head of rtmsg + * @nowait - if mroute is not in cache, do not resolve, fail with EAGAIN + * + * Return: + * 1 - if mrouting information is succesfully recorded in skb + * 0 - if information is not available, but skb is queued for resolution + * < 0 - error + */ int ipmr_get_route(struct sk_buff *skb, struct rtmsg *rtm, int nowait) { int err; diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 2dc6dbb..62f53e9 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2703,16 +2703,12 @@ static int rt_fill_info(struct sk_buff * if (MULTICAST(dst) && !LOCAL_MCAST(dst) && ipv4_devconf.mc_forwarding) { int err = ipmr_get_route(skb, r, nowait); - if (err <= 0) { - if (!nowait) { - if (err == 0) - return 0; + if (err == 0) + return 0; + if (err < 0) { + if (err == -EMSGSIZE) goto nlmsg_failure; - } else { - if (err == -EMSGSIZE) - goto nlmsg_failure; - ((struct rta_cacheinfo*)RTA_DATA(eptr))->rta_error = err; - } + ((struct rta_cacheinfo*)RTA_DATA(eptr))->rta_error = err; } } else #endif @@ -2794,7 +2790,7 @@ int inet_rtm_getroute(struct sk_buff *in err = rt_fill_info(skb, NETLINK_CB(in_skb).pid, nlh->nlmsg_seq, RTM_NEWROUTE, 0, 0); if (!err) - goto out_free; + goto out; if (err < 0) { err = -EMSGSIZE; goto out_fre
Runtime power management for network interfaces
During a Power Management session at the Ottawa Linux Symposium, it was generally agreed that network interface drivers ought to automatically suspend their devices (if possible) whenever: (1) The interface is ifconfig'ed down, or (2) No link is available. Presumably (1) should be easy enough to implement. (2) might or might not be feasible, depending on how much WOL support is available. (It might not be feasible at all for wireless networking.) Still, there can be no question that it would be a Good Thing for laptops to power-down their ethernet controllers when the network cable is unplugged. Has any progress been made in this direction? If not, a natural approach would be to start with a reference implementation in one driver which could then be copied to other drivers. Any suggestions? Alan Stern - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUGs in skb_checksum_help() and skb_gso_segment() in 2.6.18-rc2
On Tue, Jul 25, 2006 at 05:15:04PM +0200, Tomasz Torcz ([EMAIL PROTECTED]) wrote: > BUG: warning at net/core/dev.c:1171/skb_checksum_help() > [] show_trace_log_lvl+0x51/0xe6 > [] show_trace+0xa/0xc > [] dump_stack+0x13/0x15 > [] skb_checksum_help+0x4d/0xeb > [] ip_nat_fn+0x47/0x19a It is not a bug, but remind to update nat helper function. > -- > Tomasz Torcz"Funeral in the morning, IDE hacking > [EMAIL PROTECTED]in the afternoon and evening." - Alan Cox -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
BUGs in skb_checksum_help() and skb_gso_segment() in 2.6.18-rc2
Hi, I just noticed in my logs following messages: BUG: warning at net/core/dev.c:1171/skb_checksum_help() [] show_trace_log_lvl+0x51/0xe6 [] show_trace+0xa/0xc [] dump_stack+0x13/0x15 [] skb_checksum_help+0x4d/0xeb [] ip_nat_fn+0x47/0x19a [] ip_nat_local_fn+0x3c/0xba [] nf_iterate+0x40/0x60 [] nf_hook_slow+0x42/0xa2 [] ip_queue_xmit+0x396/0x3e2 [] tcp_transmit_skb+0x387/0x3a4 [] tcp_push_one+0xaf/0xd3 [] tcp_sendmsg+0x7ce/0x9b3 [] inet_sendmsg+0x35/0x3f [] sock_sendmsg+0xbf/0xd8 [] sys_sendto+0xe5/0x106 [] sys_send+0x19/0x1d [] sys_socketcall+0xf2/0x19b [] syscall_call+0x7/0xb [<47bb967e>] 0x47bb967e BUG: warning at net/core/dev.c:1225/skb_gso_segment() [] show_trace_log_lvl+0x51/0xe6 [] show_trace+0xa/0xc [] dump_stack+0x13/0x15 [] skb_gso_segment+0x88/0x174 [] dev_gso_segment+0x5c/0x82 [] dev_hard_start_xmit+0x49/0xaf [] __qdisc_run+0x91/0x117 [] dev_queue_xmit+0x121/0x1ce [] ip_output+0x1a8/0x1de System is fairly standard x86 with e1000 card. 2.6.18-rc2. # ethtool -k ep0 Offload parameters for ep0: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on -- Tomasz Torcz"Funeral in the morning, IDE hacking [EMAIL PROTECTED]in the afternoon and evening." - Alan Cox pgpA2Yg9c5ukp.pgp Description: PGP signature
Re: [PATCH Round 4 2/3] Core network changes to support network event notification.
On Tue, 2006-07-25 at 17:39 +1000, Herbert Xu wrote: > Steve Wise <[EMAIL PROTECTED]> wrote: > > > > Routing redirect events are broadcast as a pair of rtmsgs, RTM_DELROUTE > > and RTM_NEWROUTE. > > This may confuse existing rtnetlink users since you're generating an > RTM_DELROUTE message that's identical to one triggered by something > like 'ip route del'. > Yea, I didn't really want to create a REDIRECT rtmsg, so I punted. :-) But they really are seeing a delete followed by an add. That's what the kernel is doing. > As you're introducing a completely new RTM_ROUTEUPD type, it might > be better to attach any information from the existing route that you > need to the ROUTEUPD message. Yea, the main change is the next hop ip address or gateway field. > > Actually, what was the reason you need the existing route here? > The rdma driver needs to update all established rdma connections that are using the next-hop information of the existing route and make them use the next-hop information of the new route. In addition, the rdma driver might have a reference to the old dst entry. So it can release that ref and add a ref to the new dst entry. > > diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c > > index 5f87533..33d8a83 100644 > > --- a/net/ipv4/fib_semantics.c > > +++ b/net/ipv4/fib_semantics.c > > @@ -44,6 +44,7 @@ #include > > #include > > #include > > #include > > +#include > > > > #include "fib_lookup.h" > > > > @@ -279,6 +280,14 @@ void rtmsg_fib(int event, u32 key, struc > >struct sk_buff *skb; > >u32 pid = req ? req->pid : n->nlmsg_pid; > >int size = NLMSG_SPACE(sizeof(struct rtmsg)+256); > > + struct netevent_route_info nri; > > + int netevent; > > + > > + nri.family = AF_INET; > > + nri.data = &fa->fa_info; > > + netevent = event == RTM_NEWROUTE ? NETEVENT_ROUTE_ADD > > +: NETEVENT_ROUTE_DEL; > > + call_netevent_notifiers(netevent, &nri); > > Hmm, this is broken. These route events are meaningless without the > corresponding IP rule events. Are you sure you really want to make > your hardware/driver grok multiple routing tables? > > Perhaps you should simply stick to dst entries and flush all your > tables when the routes are changed. This is what the Linux IP stack > does. > I have to admit I'm a little fuzzy on the routing stuff. The main netevents I've utilized in the the rdma driver I'm writing is the neighbour update event and the redirect event. Route add/del was added for completeness of "routing" netevents. Can you expand further or point me to code where the IP stack "flushes its tables" when routes are changed? >From my experience, all the rdma driver needs is the dst entry. It using the routing table to determine the dst_entry at connection establish time. And it needs to know if the next-hop or PMTU ever changes. > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > > index 2dc6dbb..18879e6 100644 > > --- a/net/ipv4/route.c > > +++ b/net/ipv4/route.c > > @@ -1117,6 +1120,52 @@ static void rt_del(unsigned hash, struct > >spin_unlock_bh(rt_hash_lock_addr(hash)); > > } > > > > +static void rtm_redirect(struct rtable *old, struct rtable *new) > > +{ > > + struct netevent_redirect netevent; > > + struct sk_buff *skb; > > + int err; > > + > > + netevent.old = &old->u.dst; > > + netevent.new = &new->u.dst; > > + > > + /* notify netevent subscribers */ > > + call_netevent_notifiers(NETEVENT_REDIRECT, &netevent); > > + > > + /* Post NETLINK messages: RTM_DELROUTE for old route, > > + RTM_NEWROUTE for new route */ > > + skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC); > > Please use a better size estimate rather than NLMSG_GOODSIZE here since > you're doing GFP_ATOMIC. > ok > > @@ -1442,6 +1493,32 @@ unsigned short ip_rt_frag_needed(struct > >return est_mtu ? : new_mtu; > > } > > > > +static void rtm_pmtu_update(struct rtable *rt) > > +{ > > + struct sk_buff *skb; > > + int err; > > + > > + call_netevent_notifiers(NETEVENT_PMTU_UPDATE, &rt->u.dst); > > + > > + skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC); > > Ditto. > ok Thanks, Steve. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][IPv4/IPv6] Setting 0 for unused port field. (fwd)
In article <[EMAIL PROTECTED]> (at Tue, 25 Jul 2006 10:45:51 -0400 (EDT)), James Morris <[EMAIL PROTECTED]> says: > The recvmsg() for raw socket seems to return random u16 value > from the kernel stack memory since port field is not initialized. > But I'm not sure this patch is correct. > Does raw socket return any information stored in port field? > > -- Start of patch -- > diff -ur before/net/ipv4/raw.c after/net/ipv4/raw.c > --- before/net/ipv4/raw.c 2006-06-18 10:49:35.0 +0900 > +++ after/net/ipv4/raw.c2006-07-25 16:15:26.0 +0900 > @@ -609,6 +609,7 @@ > if (sin) { > sin->sin_family = AF_INET; > sin->sin_addr.s_addr = skb->nh.iph->saddr; > + sin->sin_port = 0; > memset(&sin->sin_zero, 0, sizeof(sin->sin_zero)); > } > if (inet->cmsg_flags) Well, instead, should it be initalized to protocol number, shouldn't it? --yoshfuji - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][IPv4/IPv6] Setting 0 for unused port field. (fwd)
-- Forwarded message -- Date: Tue, 25 Jul 2006 16:38:05 +0900 From: [EMAIL PROTECTED] To: linux-kernel@vger.kernel.org Subject: [PATCH][IPv4/IPv6] Setting 0 for unused port field. Hello. The recvmsg() for raw socket seems to return random u16 value from the kernel stack memory since port field is not initialized. But I'm not sure this patch is correct. Does raw socket return any information stored in port field? -- Start of patch -- diff -ur before/net/ipv4/raw.c after/net/ipv4/raw.c --- before/net/ipv4/raw.c 2006-06-18 10:49:35.0 +0900 +++ after/net/ipv4/raw.c2006-07-25 16:15:26.0 +0900 @@ -609,6 +609,7 @@ if (sin) { sin->sin_family = AF_INET; sin->sin_addr.s_addr = skb->nh.iph->saddr; + sin->sin_port = 0; memset(&sin->sin_zero, 0, sizeof(sin->sin_zero)); } if (inet->cmsg_flags) diff -ur before/net/ipv6/raw.c after/net/ipv6/raw.c --- before/net/ipv6/raw.c 2006-06-18 10:49:35.0 +0900 +++ after/net/ipv6/raw.c2006-07-25 16:16:52.0 +0900 @@ -411,6 +411,7 @@ /* Copy the address. */ if (sin6) { sin6->sin6_family = AF_INET6; + sin6->sin6_port = 0; ipv6_addr_copy(&sin6->sin6_addr, &skb->nh.ipv6h->saddr); sin6->sin6_flowinfo = 0; sin6->sin6_scope_id = 0; -- End of patch -- Regards. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MTU probing bug?
David Miller wrote: John, have a look at this code in tcp_write_timeout(): mss = min(sysctl_tcp_base_mss, tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low)/2); mss = max(mss, 68 - tp->tcp_header_len); That first line looks like it should be a max() instead of a min(). tcp_base_mss is the smallest MSS we should use, therefore we should make sure the "mss" is at least that large. It's also possible that I misread the intention of this code :) From what I read, it is trying to half the MSS in use and adjust the MTU search low point to be based upon this new value. No, the min() is what's intended here. The base_mss is where you want to start searching from. So, on black hole detection, you drop immediatly down to the base. The base is configurable, because making it higher can make searching faster. If it's still too high for some links, you halve it again on successive timeouts. -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Networking lock problem. general protection fault: 0000 [#1]
This happened several seconds after plugging in my wireless network card - zd1211 - with the driver from http://sourceforge.net/projects/zd1211/ However. It was after the module was loaded, when my setup script was being run for the device - the relevant portion: ifconfig $device:1 $alternate route add $router $device:1 route add default gw router I'm not sure which route command triggered it. Hope this is the right place. general protection fault: [#1] PREEMPT Modules linked in: zd1211 toshiba_acpi aes_i586 rd uhci_hcd snd_usb_audio snd_usb_lib snd_hwdep usbhid ehci_hcd ohci_hcd pcmcia firmware_class yenta_socket rsrc_nonstatic pcmcia_core eepro100 twofish tea sha512 sha256 sha1 serpent michael_mic md5 md4 khazad des deflate zlib_deflate zlib_inflate crypto_null cast6 cast5 blowfish arc4 cryptoloop loop sd_mod usb_storage libusual usbcore snd_es1968 snd_ac97_codec snd_ac97_bus snd_mpu401_uart snd_rawmidi CPU:0 EIP:0060:[]Not tainted VLI EFLAGS: 00010286 (2.6.17laptop-3110-firstcut #7) EIP is at fib_find_info+0x4d/0xed eax: 0015 ebx: ecx: c6c75aa0 edx: esi: c72421bc edi: c6c75a9c ebp: c7242160 esp: c21a9dac ds: 007b es: 007b ss: 0068 Process route (pid: 1856, threadinfo=c21a8000 task=c5a02540) Stack: 0001 0001 0003 c21a9ed8 c72421e0 c7242160 c02f1182 c7242160 c7242160 c21a9ed8 bffada60 c7120ca0 c02f219d c21a9e68 c21a9ed8 c21a9e58 Call Trace: fib_create_info+0x232/0x3ec fn_hash_insert+0x94/0x3f7 ip_rt_ioctl+0xb5/0x113 sock_attach_fd+0x72/0xd2 inet_ioctl+0x34/0x63 sock_ioctl+0x96/0x1b6 do_ioctl+0x55/0x68 vfs_ioctl+0x59/0x191 sys_ioctl+0x2b/0x46 syscall_call+0x7/0xb Code: 55 24 31 d0 89 54 24 10 8b 55 28 31 d0 89 54 24 0c 89 c2 c1 ea 07 31 c2 c1 e8 0c 31 c2 a1 c0 1e 42 c0 21 ca 8b 1c 90 85 db 74 20 <8b> 03 89 44 24 08 8d 74 26 00 8b 53 5c 8b 04 24 89 54 24 04 39 EIP: [] fib_find_info+0x4d/0xed SS:ESP 0068:c21a9dac BUG: route/1856, lock held at task exit time! [c03a04a0] {rtnl_mutex} .. held by: route: 1856 [c5a02540, 112] ... acquired at: ip_rt_ioctl+0x5d/0x113 d[2774]: Caught signal 15, un-registering and exiting. Config is at http://www.mauve.plus.com/config.gz - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[no subject]
- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to handle kernel paging request, another 2.6.16.25 server reboots
In-Reply-To: <[EMAIL PROTECTED]> On Mon, 24 Jul 2006 12:01:48 +0400, Jim Klimov wrote: > I recently wrote about problems with a fileserver rebooting > frequently. Another similar server got under NFS load today > and rebooted at least twice in the past few hours. > > This server has a similar motherboard (Supermicro X5DP8-G2), > dual [EMAIL PROTECTED], two older 3Ware controllers (7506+8506) and > a reiserfs v3 archive. > > The server reported last week has two 3Ware 9550 controllers, > ext3 archives and primarily a Samba usage. I decoded your oops. It's in netfilter: Unable to handle kernel paging request at virtual address f9445d43 printing eip: c0392bba *pde = 32e59067 Oops: [#1] SMP Modules linked in: w83781d hwmon_vid hwmon i2c_isa i2c_core w83627hf_wdt CPU:0 EIP:0060:[]Not tainted VLI EFLAGS: 00010282 (2.6.16.25 #3) EIP is at ipt_do_table+0xae/0x385 eax: 0003 ebx: ecx: cbf4b8d8 edx: f944a2c8 esi: e2262940 edi: f9445cf0 ebp: 8000 esp: f700fac8 ds: 007b es: 007b ss: 0068 Process nfsd (pid: 10685, threadinfo=f700e000 task=f36a7ab0) Stack: f9446b10 0282 c33e2180 f36cda80 c047deec f944a2c8 f9418000 f7788800 c0530cd4 cbf4b8d8 0003 f700fba0 f700fba0 0003 c052f0d8 8000 c03947e7 f7788800 c047dec0 Call Trace: [] ipt_local_out_hook+0x72/0x77 [] nf_iterate+0x69/0x83 [] dst_output+0x0/0x7 [] dst_output+0x0/0x7 [] nf_hook_slow+0x5d/0xea [] dst_output+0x0/0x7 [] ip_queue_xmit+0x3d4/0x4f5 [] dst_output+0x0/0x7 [] __rcu_process_callbacks+0x7d/0xc5 [] activate_task+0x99/0xa5 [] try_to_wake_up+0x29c/0x33b [] tcp_v4_send_check+0x4a/0xdc [] tcp_transmit_skb+0x2e6/0x45a [] tcp_push_one+0x97/0x104 [] tcp_sendmsg+0x36b/0xb4d [] nf_iterate+0x69/0x83 [] tcp_v4_rcv+0x4e6/0x81f [] inet_sendmsg+0x47/0x5f [] sock_sendmsg+0xc9/0xe3 [] ip_rcv+0x2bc/0x56f [] netif_receive_skb+0x227/0x2d7 [] __kfree_skb+0x3a/0xc3 [] autoremove_wake_function+0x0/0x43 [] update_wall_time_one_tick+0x6/0x7e [] update_wall_time+0x8/0x35 [] timer_interrupt+0x5b/0x86 [] handle_IRQ_event+0x26/0x59 [] kernel_sendmsg+0x2e/0x3c [] sock_no_sendpage+0x80/0x9f [] tcp_sendpage+0x49/0x85 [] svc_sendto+0x134/0x250 [] net_rx_action+0x88/0x15f [] do_IRQ+0x1e/0x24 [] common_interrupt+0x1a/0x20 [] svc_tcp_sendto+0x4d/0x99 [] _atomic_dec_and_lock+0x33/0x4c [] svc_send+0xaa/0xed [] fh_put+0x133/0x17d [] svcauth_unix_release+0x43/0x45 [] nfs3svc_release_fhandle+0x0/0xe [] svc_process+0x1b1/0x619 [] default_wake_function+0x0/0xc [] nfsd+0x178/0x301 [] nfsd+0x0/0x301 [] kernel_thread_helper+0x5/0xb 6: 8b 40 10 mov0x10(%eax),%eax 9: 8b 44 86 34 mov0x34(%esi,%eax,4),%eax d: 89 44 24 1c mov%eax,0x1c(%esp) 11: 89 c7 mov%eax,%edi 13: 8b 44 24 34 mov0x34(%esp),%eax 17: 8b 54 24 1c mov0x1c(%esp),%edx 1b: 03 7c 86 0c add0xc(%esi,%eax,4),%edi 1f: 03 54 86 20 add0x20(%esi,%eax,4),%edx 23: 89 5c 24 10 mov%ebx,0x10(%esp) 27: 89 54 24 18 mov%edx,0x18(%esp) 0: 0f b6 5f 53 movzbl 0x53(%edi),%ebx <= 4: 89 d8 mov%ebx,%eax 6: 24 08 and$0x8,%al 8: 84 c0 test %al,%al a: 0f 84 b4 02 00 00 je 2c4 <_EIP+0x2c4> 10: 8b 47 08 mov0x8(%edi),%eax This is in net/ipv4/netfiler/ip_tables.c::ipt_do_table(): table_base = (void *)private->entries[smp_processor_id()]; e = get_entry(table_base, private->hook_entry[hook]); /* For return from builtin chain */ back = get_entry(table_base, private->underflow[hook]); do { IP_NF_ASSERT(e); IP_NF_ASSERT(back); ===>if (ip_packet_match(ip, indev, outdev, &e->ip, offset)) { 'e' is an invalid pointer. (ip_packet_match() was inlined.) hook == 3 The call trace seems to show that svc_tcp_sendto() was interrupted by an IRQ for an incoming packet, or maybe the timer interrupt? Can you build with CONFIG_FRAME_POINTERS and see if you can get a cleaner trace? -- Chuck - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [IPROUTE]: Add support for multipath route realms
Herbert Xu wrote: > Patrick McHardy <[EMAIL PROTECTED]> wrote: > >>[IPROUTE]: Add support for multipath route realms >> >>Routing realms exist per nexthop, but iproute currently only allows to send >>a single route realm, which is refused by the kernel for multipath routes. >>Add support for specifying per nexthop realms. Old kernels only return the >>first realm back to userspace when dumping, so the others can't be displayed, >>besides that it will also behave correctly on old kernels. >> >>old kernel: >> >>1.2.3.4 realm 1 >> nexthop dev dummy0 weight 1 >> nexthop dev dummy1 weight 1 >> nexthop dev dummy2 weight 1 >> nexthop dev dummy3 weight 1 >> >>new kernel: >> >>1.2.3.4 >> nexthop realm 1 dev dummy0 weight 1 >> nexthop realm 2 dev dummy1 weight 1 >> nexthop realm 3 dev dummy2 weight 1 >> nexthop realm 4 dev dummy3 weight 1 > > > This really looks like papering over fundamental brokenness of > IP_ROUTE_MULTIPATH_CACHED since you wouldn't otherwise get these > entries in the routing cache. > > This reminds me that I better revisit the reasons that people gave > for actually using IP_ROUTE_MULTIPATH_CACHED the last time we tried > to get rid of it. Actually it has nothing to do with MULTIPATH_CACHED, it didn't even use it in these tests unless I seriously mixed something up. The realm can be used by netfilter to find out which nexthop was chosen and bind the connection to this nexthop manually using CONNMARK and MARK. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [IPROUTE]: Add support for multipath route realms
On Tue, Jul 25, 2006 at 06:19:33PM +1000, Herbert Xu wrote: > > > new kernel: > > > > 1.2.3.4 > >nexthop realm 1 dev dummy0 weight 1 > >nexthop realm 2 dev dummy1 weight 1 > >nexthop realm 3 dev dummy2 weight 1 > >nexthop realm 4 dev dummy3 weight 1 > > This really looks like papering over fundamental brokenness of > IP_ROUTE_MULTIPATH_CACHED since you wouldn't otherwise get these > entries in the routing cache. Nevermind, I misread your changelog. Your patch is obviously not related to IP_ROUTE_MULTIPATH_CACHED :) -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [IPROUTE]: Add support for multipath route realms
Patrick McHardy <[EMAIL PROTECTED]> wrote: > > [IPROUTE]: Add support for multipath route realms > > Routing realms exist per nexthop, but iproute currently only allows to send > a single route realm, which is refused by the kernel for multipath routes. > Add support for specifying per nexthop realms. Old kernels only return the > first realm back to userspace when dumping, so the others can't be displayed, > besides that it will also behave correctly on old kernels. > > old kernel: > > 1.2.3.4 realm 1 >nexthop dev dummy0 weight 1 >nexthop dev dummy1 weight 1 >nexthop dev dummy2 weight 1 >nexthop dev dummy3 weight 1 > > new kernel: > > 1.2.3.4 >nexthop realm 1 dev dummy0 weight 1 >nexthop realm 2 dev dummy1 weight 1 >nexthop realm 3 dev dummy2 weight 1 >nexthop realm 4 dev dummy3 weight 1 This really looks like papering over fundamental brokenness of IP_ROUTE_MULTIPATH_CACHED since you wouldn't otherwise get these entries in the routing cache. This reminds me that I better revisit the reasons that people gave for actually using IP_ROUTE_MULTIPATH_CACHED the last time we tried to get rid of it. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ip multicast route bug fix
Stephen Hemminger <[EMAIL PROTECTED]> wrote: > > @@ -1593,12 +1594,19 @@ int ipmr_get_route(struct sk_buff *skb, >read_unlock(&mrt_lock); >return -ENODEV; >} > - skb->nh.raw = skb_push(skb, sizeof(struct iphdr)); > - skb->nh.iph->ihl = sizeof(struct iphdr)>>2; > - skb->nh.iph->saddr = rt->rt_src; > - skb->nh.iph->daddr = rt->rt_dst; > - skb->nh.iph->version = 0; > - err = ipmr_cache_unresolved(vif, skb); > + > + iskb = alloc_skb(sizeof(struct iphdr), GFP_KERNEL); > + if (!iskb) { > + read_unlock(&mrt_lock); > + return -ENOMEM; > + } > + memset(iskb->data, 0, sizeof(struct iphdr)); > + iskb->nh.raw = iskb->data; > + iskb->nh.iph->ihl = sizeof(struct iphdr)>>2; > + iskb->nh.iph->saddr = rt->rt_src; > + iskb->nh.iph->daddr = rt->rt_dst; > + > + err = ipmr_cache_unresolved(vif, iskb); I'm afraid this is still broken in a different way. If ipmr_cache_unresolved queues the skb onto the unresolved list things it's going to try to use the skb as a netlink skb instead :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA will be reverted
On Tue, Jul 25, 2006 at 12:33:44AM -0700, David Miller ([EMAIL PROTECTED]) wrote: > From: Evgeniy Polyakov <[EMAIL PROTECTED]> > Date: Tue, 25 Jul 2006 10:59:21 +0400 > > > As a side completely unrelated to either my or others work note :) - > > I think it is a nanooptimisation - we get a bit of performance here, > > and lose those bit in other place. > > When bag is filled, there is no much sence of rearranging some stuff > > inside to be able to place another one - it is better to buy new bag. > > It is a matter of what the viewpoint is, I suppose. Definitely. > I think in this specific case it might turn out to be > better for the scheduler to respond to what the device > throws at it, rather than the other way around. And > in that case we need no feedback from scheduler to > cpu demux engine. That's exactly one bit lose/gain - if CPU is loafing - we get a gain, and lose otherwise - so instead of generally predictible steady behaviour we can end up with bursty shapes. Actually without real tests all it is just a handwaving, so let's see when modern NICs get that capability, so network softirq scheduling would be changed accordingly. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH Round 4 2/3] Core network changes to support network event notification.
Steve Wise <[EMAIL PROTECTED]> wrote: > > Routing redirect events are broadcast as a pair of rtmsgs, RTM_DELROUTE > and RTM_NEWROUTE. This may confuse existing rtnetlink users since you're generating an RTM_DELROUTE message that's identical to one triggered by something like 'ip route del'. As you're introducing a completely new RTM_ROUTEUPD type, it might be better to attach any information from the existing route that you need to the ROUTEUPD message. Actually, what was the reason you need the existing route here? > diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c > index 5f87533..33d8a83 100644 > --- a/net/ipv4/fib_semantics.c > +++ b/net/ipv4/fib_semantics.c > @@ -44,6 +44,7 @@ #include > #include > #include > #include > +#include > > #include "fib_lookup.h" > > @@ -279,6 +280,14 @@ void rtmsg_fib(int event, u32 key, struc >struct sk_buff *skb; >u32 pid = req ? req->pid : n->nlmsg_pid; >int size = NLMSG_SPACE(sizeof(struct rtmsg)+256); > + struct netevent_route_info nri; > + int netevent; > + > + nri.family = AF_INET; > + nri.data = &fa->fa_info; > + netevent = event == RTM_NEWROUTE ? NETEVENT_ROUTE_ADD > +: NETEVENT_ROUTE_DEL; > + call_netevent_notifiers(netevent, &nri); Hmm, this is broken. These route events are meaningless without the corresponding IP rule events. Are you sure you really want to make your hardware/driver grok multiple routing tables? Perhaps you should simply stick to dst entries and flush all your tables when the routes are changed. This is what the Linux IP stack does. > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > index 2dc6dbb..18879e6 100644 > --- a/net/ipv4/route.c > +++ b/net/ipv4/route.c > @@ -1117,6 +1120,52 @@ static void rt_del(unsigned hash, struct >spin_unlock_bh(rt_hash_lock_addr(hash)); > } > > +static void rtm_redirect(struct rtable *old, struct rtable *new) > +{ > + struct netevent_redirect netevent; > + struct sk_buff *skb; > + int err; > + > + netevent.old = &old->u.dst; > + netevent.new = &new->u.dst; > + > + /* notify netevent subscribers */ > + call_netevent_notifiers(NETEVENT_REDIRECT, &netevent); > + > + /* Post NETLINK messages: RTM_DELROUTE for old route, > + RTM_NEWROUTE for new route */ > + skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC); Please use a better size estimate rather than NLMSG_GOODSIZE here since you're doing GFP_ATOMIC. > @@ -1442,6 +1493,32 @@ unsigned short ip_rt_frag_needed(struct >return est_mtu ? : new_mtu; > } > > +static void rtm_pmtu_update(struct rtable *rt) > +{ > + struct sk_buff *skb; > + int err; > + > + call_netevent_notifiers(NETEVENT_PMTU_UPDATE, &rt->u.dst); > + > + skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC); Ditto. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA will be reverted
From: Evgeniy Polyakov <[EMAIL PROTECTED]> Date: Tue, 25 Jul 2006 10:59:21 +0400 > As a side completely unrelated to either my or others work note :) - > I think it is a nanooptimisation - we get a bit of performance here, > and lose those bit in other place. > When bag is filled, there is no much sence of rearranging some stuff > inside to be able to place another one - it is better to buy new bag. It is a matter of what the viewpoint is, I suppose. I think in this specific case it might turn out to be better for the scheduler to respond to what the device throws at it, rather than the other way around. And in that case we need no feedback from scheduler to cpu demux engine. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
MTU probing bug?
John, have a look at this code in tcp_write_timeout(): mss = min(sysctl_tcp_base_mss, tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low)/2); mss = max(mss, 68 - tp->tcp_header_len); That first line looks like it should be a max() instead of a min(). tcp_base_mss is the smallest MSS we should use, therefore we should make sure the "mss" is at least that large. It's also possible that I misread the intention of this code :) From what I read, it is trying to half the MSS in use and adjust the MTU search low point to be based upon this new value. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA will be reverted
On Mon, Jul 24, 2006 at 11:48:53PM -0700, David Miller ([EMAIL PROTECTED]) wrote: > > And if that CPU is very busy? > > Linux should somehow tell NIC that some CPUs are valid and some are not > > right now, but not in a second, so scheduler must be tightly bound with > > network internals. > > Yes, it is research problem. > > Most of the time, even stateless version will improve things. > From another viewpoint, even in worst case, it can be no > worse than current situation. :) > > BTW, such dynamic remapping is provided for in the NDIS interfaces. > There is an indexing table that is gone through using computed hash to > get "cpu number". I think we should force Linux scheduler to export some easily accessed CPU statistic, so that info might be used by irq layer/protocol processing. As a side completely unrelated to either my or others work note :) - I think it is a nanooptimisation - we get a bit of performance here, and lose those bit in other place. When bag is filled, there is no much sence of rearranging some stuff inside to be able to place another one - it is better to buy new bag. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html