[RFC] Best method to control a transmit-only mode on fiber NICs (specifically sky2)
Hi, The company I'm working for has an unusual fiber NIC configuration that we use for one of our network appliances. We connect only a single fiber from the TX port on one NIC to the RX port on another NIC, providing a physically-one-way path for enhanced security. Unfortunately this doesn't work with most NIC drivers, as even with auto-negotiation off they look for link probe pulses before they consider the link up and are willing to send packets. We have been able to use Myricom 10GigE NICs with a custom firmware image. More recently we have patched the sky2 driver to turn on the FIB_FORCE_LNK flag in the PHY control register; this seems to work on the Marvell-chipset boards we have here. What would be the preferred way to control this force link flag? Right now we are accessing it using ethtool; we have added an additional duplex mode: DUPLEX_TXONLY, with a value of 2. When you specify a speed and turn off autonegotiation (./patched-ethtool -s eth2 speed 1000 autoneg off duplex txonly), it will turn on the specified bit in the PHY control register and the link will automatically come up. We also have one related bug-fix^Wdirty hack for sky2 to reset the PHY a second time during netif-up after enabling interrupts; otherwise the immediate link up interrupt gets lost. Once I get approval from the company I will patch the post itself for review. I look forward to your comments and suggestions Cheers, Kyle Moffett -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[NET/IPv6] Race condition with flow_cache_genid?
Whoops, I accidentally sent this to [EMAIL PROTECTED] instead of [EMAIL PROTECTED] Original email below: Hi, I was poking around trying to figure out how to install the Mobile IPv6 daemons this evening and noticed they required a kernel patch, although upon further inspection the kernel patch seemed to already be applied in 2.6.24. Unfortunately the flow cache appears to be horribly racy. Attached below are the only uses of the variable flow_cache_genid in 2.6.24. Now, I am no expert in this particular area of the code, but the atomic_t flow_cache_genid variable is ONLY ever used with atomic_inc() and atomic_read(). There are no memory barriers or other dec_and_test()-style functions, so that variable could just as easily be replaced with a plain old C int. Basically either there is some missing locking here or it does not need to be atomic_t. Judging from the way it *appears* to be used to check if cache entries are up-to-date with the latest changes in policy, I would guess the former. In particular that whole flow_cache_lookup() thing looks racy as hell, since somebody could be in the middle of that looking at if (fle-genid == atomic_read(flow_cache_genid)). It does the atomic_read(), which BTW is literally implemented as: #define atomic_read(atomicvar) ((atomicvar)-value) on some platforms. Immediately after the atomic read (or even before, since there's no cache-flush or read-modify-write), somebody calls into selinux_xfrm_notify_policyload() and increments the flow_cache_genid becase selinux just loaded a security policy. Now we're accepting a cache entry which applies to PREVIOUS security policy. I can only assume that's really bad. Even worse, there seems to be a race between SELinux loading a new policy and calling selinux_xfrm_notify_policyload(), since we could easily get packets and process them according to the old cache entry on one CPU before SELinux has had a chance to update the generation ID from the other. Furthermore, there's no guarantee the CPU caches will get updated in reasonable time. Clearly SELinux needs to have some way of atomically invalidating the flow cache of all CPUs *simultaneously* with loading a new policy, which probably means they both need to be under the same lock, or something. The same problem appears to occur with updating the XFRM policy and incrementing flow_cache_genid. Probably the fastest solution is to put the flow cache under the xfrm_policy_lock (which already disables local bottom-halves), and either take that lock during SELinux policy load or if there are lock ordering problems then add a variable flow_cache_ignore and change the xfrm_notify hooks: void selinux_xfrm_notify_policyload_pre(void) { write_lock_bh(xfrm_policy_lock); flow_cache_genid++; flow_cache_ignore = 1; write_unlock_bh(xfrm_policy_lock); } void selinux_xfrm_notify_policyload_post(void) { write_lock_bh(xfrm_policy_lock); flow_cache_ignore = 0; write_unlock_bh(xfrm_policy_lock); } Cheers, Kyle Moffett BEGIN QUOTED CODE INVOLVING flow_cache_genid: include/net/flow.h:94: extern atomic_t flow_cache_genid; net/core/flow.c:39: atomic_t flow_cache_genid = ATOMIC_INIT(0); net/core/flow.c:169:flow_cache_lookup(): if (flow_hash_rnd_recalc(cpu)) flow_new_hash_rnd(cpu); hash = flow_hash_code(key, cpu); head = flow_table(cpu)[hash]; for (fle = *head; fle; fle = fle-next) { if (fle-family == family fle-dir == dir flow_key_compare(key, fle-key) == 0) { if (fle-genid == atomic_read(flow_cache_genid)) { void *ret = fle-object; if (ret) atomic_inc(fle-object_ref); local_bh_enable(); return ret; } break; } } net/xfrm/xfrm_policy.c:1025: int xfrm_policy_delete(struct xfrm_policy *pol, int dir) { write_lock_bh(xfrm_policy_lock); pol = __xfrm_policy_unlink(pol, dir); write_unlock_bh(xfrm_policy_lock); if (pol) { if (dir XFRM_POLICY_MAX) atomic_inc(flow_cache_genid); xfrm_policy_kill(pol); return 0; } return -ENOENT; } net/ipv6/inet6_connection_sock.c:142: static inline void __inet6_csk_dst_store(struct sock *sk, struct dst_entry *dst, struct in6_addr *daddr, struct in6_addr *saddr) { __ip6_dst_store(sk, dst, daddr, saddr); #ifdef CONFIG_XFRM { struct rt6_info *rt = (struct rt6_info *)dst; rt-rt6i_flow_cache_genid = atomic_read(flow_cache_genid); } #endif } security/selinux/include/xfrm.h:41: static inline void selinux_xfrm_notify_policyload
Re: [PATCH 1/2] bnx2: factor out gzip unpacker
On Sep 24, 2007, at 13:32:23, Lennart Sorensen wrote: On Fri, Sep 21, 2007 at 11:37:52PM +0100, Denys Vlasenko wrote: But I compile net/* into bzImage. I like netbooting :) Isn't it possible to netboot with an initramfs image? I am pretty sure I have seen some systems do exactly that. Yeah, I've got Debian boxes that have never *not* netbooted (one Dell Op^?^?Craptiplex box whose BIOS and ACPI sucks so bad it can't even load GRUB/LILO, although Windows somehow works fine). So they boot PXELinux using the PXE boot ROM on the NICs and it loads both a kernel and an initramfs into memory. Kernel is stock Debian and hardly has enough built-in to spit at you, let alone find network/ disks, but it manages to load everything it needs off the automagically-generated initramfs. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
-accessed inode objects and creates non-fragmented copies before deleting the old ones. There's a lot of other technical details which would need resolution in an actual implementation, but this is enough of a summary to give you the gist of the concept. Most likely there will be some major flaw which makes it impossible to produce reliably, but the concept contains the things I would be interested in for a real networked filesystem. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
On Sep 10, 2007, at 06:56:29, Denys Vlasenko wrote: On Sunday 09 September 2007 19:18, Arjan van de Ven wrote: On Sun, 9 Sep 2007 19:02:54 +0100 Denys Vlasenko [EMAIL PROTECTED] wrote: Why is all this fixation on volatile? I don't think people want volatile keyword per se, they want atomic_read(x) to _always_ compile into an memory-accessing instruction, not register access. and ... why is that? is there any valid, non-buggy code sequence that makes that a reasonable requirement? Well, if you insist on having it again: Waiting for atomic value to be zero: while (atomic_read(x)) continue; gcc may happily convert it into: reg = atomic_read(x); while (reg) continue; Bzzt. Even if you fixed gcc to actually convert it to a busy loop on a memory variable, you STILL HAVE A BUG as it may *NOT* be gcc that does the conversion, it may be that the CPU does the caching of the memory value. GCC has no mechanism to do cache-flushes or memory- barriers except through our custom inline assembly. Also, you probably want a cpu_relax() in there somewhere to avoid overheating the CPU. Thirdly, on a large system it may take some arbitrarily large amount of time for cache-propagation to update the value of the variable in your local CPU cache. Finally, if atomics are based on based on spinlock+interrupt-disable then you will sit in a tight busy- loop of spin_lock_irqsave()-spin_unlock_irqrestore(). Depending on your system's internal model this may practically lock up your core because the spin_lock() will take the cacheline for exclusive access and doing that in a loop can prevent any other CPU from doing any operation on it! Since your IRQs are disabled you even have a very small window that an IRQ will come along and free it up long enough for the update to take place. The earlier code segment of: while(atomic_read(x) 0) atomic_dec(x); is *completely* buggy because you could very easily have 4 CPUs doing this on an atomic variable with a value of 1 and end up with it at negative 3 by the time you are done. Moreover all the alternatives are also buggy, with the sole exception of this rather obvious- seeming one: atomic_set(x, 0); You simply CANNOT use an atomic_t as your sole synchronizing primitive, it doesn't work! You virtually ALWAYS want to use an atomic_t in the following types of situations: (A) As an object refcount. The value is never read except as part of an atomic_dec_return(). Why aren't you using struct kref? (B) As an atomic value counter (number of processes, for example). Just reading the value is racy anyways, if you want to enforce a limit or something then use atomic_inc_return(), check the result, and use atomic_dec() if it's too big. If you just want to return the statistics then you are going to be instantaneous-point-in-time anyways. (C) As an optimization value (statistics-like, but exact accuracy isn't important). Atomics are NOT A REPLACEMENT for the proper kernel subsystem, like completions, mutexes, semaphores, spinlocks, krefs, etc. It's not useful for synchronization, only for keeping track of simple integer RMW values. Note that atomic_read() and atomic_set() aren't very useful RMW primitives (read-nomodify-nowrite and read-set-zero- write). Code which assumes anything else is probably buggy in other ways too. So while I see no real reason for the volatile on the atomics, I also see no real reason why it's terribly harmful. Regardless of the volatile on the operation the CPU is perfectly happy to cache it anyways so it doesn't buy you any actual always-access-memory guarantees. If you are just interested in it as an optimization you could probably just read the properly-aligned integer counter directly, an atomic read on most CPUs. If you really need it to hit main memory *every* *single* *time* (Why? Are you using it instead of the proper kernel subsystem?) then you probably need a custom inline assembly helper anyways. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
On Sep 10, 2007, at 12:46:33, Denys Vlasenko wrote: My point is that people are confused as to what atomic_read() exactly means, and this is bad. Same for cpu_relax(). First one says read, and second one doesn't say barrier. QA: Q: When is it OK to use atomic_read()? A: You are asking the question, so never. Q: But I need to check the value of the atomic at this point in time... A: Your code is buggy if it needs to do that on an atomic_t for anything other than debugging or optimization. Use either atomic_*_return() or a lock and some normal integers. Q: So why can't the atomic_read DTRT magically? A: Because the right thing depends on the situation and is usually best done with something other than atomic_t. If somebody can post some non-buggy code which is correctly using atomic_read() *and* depends on the compiler generating extra nonsensical loads due to volatile then the issue *might* be reconsidered. This also includes samples of code which uses atomic_read() and needs memory barriers (so that we can fix the buggy code, not so we can change atomic_read()). So far the only code samples anybody has posted are buggy regardless of whether or not the value and/or accessors are flagged volatile or not. And hey, maybe the volatile ops *should* be implemented in inline ASM for future- proof-ness, but that's a separate issue. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [DRIVER SUBMISSION] DRBD wants to go mainline
On Jul 25, 2007, at 22:03:37, [EMAIL PROTECTED] wrote: On Wed, 25 Jul 2007, Satyam Sharma wrote: On 7/25/07, Lars Ellenberg [EMAIL PROTECTED] wrote: On Wed, Jul 25, 2007 at 04:41:53AM +0530, Satyam Sharma wrote: [...] But where does the send come into the picture over here -- a send won't block forever, so I don't foresee any issues whatsoever w.r.t. kthreads conversion for that. [ BTW I hope you're *not* using any signals-based interface for your kernel thread _at all_. Kthreads disallow (ignore) all signals by default, as they should, and you really shouldn't need to write any logic to handle or do-certain-things-on-seeing a signal in a well designed kernel thread. ] and the sending latency is crucial to performance, while the recv will not timeout for the next few seconds. Again, I don't see what sending latency has to do with a kernel_thread to kthread conversion. Or with signals, for that matter. Anyway, as Kyle Moffett mentioned elsewhere, you could probably look at other examples (say cifs_demultiplexer_thread() in fs/cifs/connect.c). the basic problem, and what we use signals for, is: it is waiting in recv, waiting for the peer to say something. but I want it to stop recv, and go send something right now. That's ... weird. Most (all?) communication between any two parties would follow a protocol where someone recv's stuff, does something with it, and sends it back ... what would you send right now if you didn't receive anything? becouse even though you didn't receive anything you now have something important to send. remember that both sides can be sitting in receive mode. this puts them both in a position to respond to the other if the other has something to say. Why not just have 2 threads, one for sending and one for receiving. When your receiving thread gets data it takes appropriate locks and processes it, then releases the locks and goes back to waiting for packets. Your sending thread would take appropriate locks, generate data to send, release locks, and transmit packets. You don't have to interrupt the receive thread to send packets, so where's the latency problem, exactly? If I were writing that in userspace I would have: (A) The pool of IO-generating threads (IE: What would ordinarily be userspace) (B) One or a small number of data-reception threads. (C) One or a small number of data-transmission threads. When you get packets to process in your network-reception thread(s), you queue appropriate disk IOs and any appropriate responses with your transmission thread(s). You can basically just sit in a loop on tcp_recvmsg=demultiplex=do-stuff. When your IO-generators actually make stuff to send you queue such data for disk IO, then packetize it and hand it off to your data-transmission threads. If you made all your sockets and inter-thread pipes nonblocking then in userspace you would just epoll_wait() on the sockets and pipes and be easily able to react to any IO from anywhere. In kernel space there are similar nonblocking interfaces, although it would probably be easier just to use a couple threads. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [DRIVER SUBMISSION] DRBD wants to go mainline
For the guys on netdev, would you please look at the tcp_recvmsg- threading and TCP_NAGLE_CORK issues below and give opinions on the best way to proceed? One thing to remember, you don't necessarily have to merge every feature right away. As long as the new code is configured off by default with an (EXPERIMENTAL) warning, you can start getting the core parts and the cleanups upstream before you have to resolve all the issues with low-latency, dynamic-tracing-frameworks, etc. On Jul 23, 2007, at 09:32:02, Lars Ellenberg wrote: On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote: +/* I don't remember why XCPU ... + * This is used to wake the asender, + * and to interrupt sending the sending task + * on disconnect. + */ +#define DRBD_SIG SIGXCPU Don't use signals between kernel threads, use proper primitives like notifiers and waitqueues, which means you should also probably switch away from kernel_thread() to the kthread_*() APIs. Also you should fix this FIXME or remove it if it no longer applies:-D. right. but how to I tell a network thread in tcp_recvmsg to stop early, without using signals? I'm not really a kernel-networking guy, so I can't answer this definitively, but I'm pretty sure the problem has been solved in many network filesystems and such, so I've added a netdev CC. The way I'd do it in userspace is with nonblocking IO and epoll(), that way I don't actually have to stop or signal the thread, I can just add a socket to epoll fd when I want to pay attention to it, and remove it from my epoll fd when I'm done with it. I'd assume there's some equivalent way in kernelspace based around the struct kiocb *iocb and int nonblock parameters to the tcp_recvmsg() kernel function. +/* see kernel/printk.c:printk_ratelimit + * macro, so it is easy do have independend rate limits at different locations + * initializer element not constant ... with kernel 2.4 :( + * so I initialize toks to something large + */ +#define DRBD_ratelimit(ratelimit_jiffies, ratelimit_burst) \ Any particular reason you can't just use printk_ratelimit for this? I want to be able to do a rate-limit per specific message/code fragment, without affecting other messages or execution paths. Ok, so could you change your patch to modify __printk_ratelimit() to also accept a struct printk_rate datastructure and make printk_ratelimit() call __printk_ratelimit(global_printk_rate);?? Typically if $KERNEL_FEATURE is insufficient for your needs you should fix $KERNEL_FEATURE instead of duplicating a replacement in your driver. This applies to basically all of the things I'm talking about, kernel-threads, workqueues (BTW: I believe you can make your own custom workqueue thread(s) instead of using the default events/ * ones), debugging macros, fault-insertion, integer math, lock- checking, dynamic tracing, etc. If you find some reason that some generic code won't work for you, please try to fix it first so we can all benefit from it. Umm, how about fixing this to actually use proper workqueues or something instead of this open-coded mess? unlikely to happen right now. but it is on our todo list... Unfortunately problems like these need to be fixed before a mainline merge. Merging duplicated code is a big no-no, and historically there have been problems with people who merge code and never properly maintain it once it's in tree. As a result the rule is your code has to be easily maintainable before anybody will even *consider* merging it. +/* I want the packet to fit within one page + * THINK maybe use a special bitmap header, + * including offset and compression scheme and whatnot + * Do not use PAGE_SIZE here! Use a architecture agnostic constant! + */ +#define BM_PACKET_WORDS ((4096-sizeof(struct Drbd_Header))/sizeof (long)) Yuck. Definitely use PAGE_SIZE here, so at least if it's broken on an arch with multiple page sizes, somebody can grep for PAGE_SIZE to fix it. It also means that on archs/configs with 8k or 64k pages you won't waste a bunch of memory. No. This is not to allocate anything, but defines the chunk size with which we transmit the bitmap, when we have to. We need to be able to talk from one arch (say i586) to some other (say s390, or sparc, or whatever). The receiving side has a one-page buffer, from which it may or may not to endian-conversion. The hardcoded 4096 is the minimal common denominator here. Ahhh. Please replace the constant 4096 with: /* This is the maximum amount of bitmap we will send per packet */ # define MAX_BITMAP_CHUNK_SIZE 4096 # define BM_PACKET_WORDS \ ((MAX_BITMAP_CHUNK_SIZE - sizeof(struct Drbd_Header))/sizeof(long)) It's more text but dramatically improves the readability by eliminating more magic numbers. This is a much milder case than I've seen in the past, so it's not that big of a deal. +/* Dynamic tracing framework */ guess we
Re: PM policy, hotplug, power saving (was Re: [PATCH] b44: power down PHY when interface down)
On Jun 30, 2007, at 12:42:06, Jeff Garzik wrote: Definitely matters. Switch renegotiation can take a while, and you must take into account the common case of interface bouncing (immediate down, then up). Hoards actively complained the few times we experimented with this, because of e.g. DHCP's habit of bouncing the interface, which resulted in PHY power bouncing, which resulted in negotiation, which resulted in an excrutiating wait on various broken or stupid switches. Overall, this may be classed with other problems of a similar sort: we can power down a PHY, but that removes hotplug capability and extends partner/link negotiation time. Like SATA, we actually want to support BOTH -- active hotplug and PHY power-down -- and so this wanders into power management policy. Give me a knob, and we can program plenty of ethernet|SATA|USB|... drivers to power down the PHY and save power. With some buggy switches and other hardware you actually *want* to bounce the link to get them to properly renegotiate. I can also see wanting to power off and on a single-PoE-port NIC to restart whatever device is at the other end, although I don't know if any such devices exist. Currently the tg3 driver turns the PHY off and on during down/ up on a few of my systems, which I use to make a buggy no-name switch recognize STP changes properly. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scaling Max IP address limitation
On Jun 24, 2007, at 13:20:01, David Jones wrote: Hi, I am trying to add multiple IP addresses ( v6 ) to my FC7 box on eth0. But I am hitting a max limit of 4000 IP address . Seems like there is a limiting variable in linux kernel (which one? ) that prevents from adding more IP addresses than 4096. What do I need to change in Linux kernel ( and then recompile ) to be able to add more IP addresses than 4K addresses per system? .. Do you really need that many IP addresses? When somebody finally gets around to implementing REDIRECT support for ip6tables then you could just redirect them all to the same port on the local system. Then with a happy little getsockopt() you can find out the original IP address for use in whatever application you are running. That's likely to be a thousand times more efficient than binary searching through 5000+ mostly-sequential IP addresses per received packet. Unrelated wishful thinking I keep having hopeful dreams that one day netfilter will grow support for cross-protocol NAT (IE: NAT a TCPv4 connection over TCPv6 to the IPv6-only local web server, or vice versa). It would seem that would require a merged xtables program. Having routing table operations, IPsec transformations, etc just be another step in the firewall rules would also be useful. It would be handy to be able to -j ROUTE, then -j IPSEC, then -j ROUTE again, to re-route the now-encapsulated IPsec packets to their proper destination. That would also eliminate the sort-of-hacky problems with destination network interface in the bridging code: -j BRIDGE might be another step in the process, and conceivably you could have independent bridge MAC tables too. You'd probably also want -j BRIDGE_TEST and -j ROUTE_TEST to compute the output network interface without actually modifying the addresses. That would also appear to get rid of the need for all tables other than filter and all predefined chains other than INPUT and OUTPUT. Default rules would be these: nettables -A INPUT -j CONNTRACK nettables -A INPUT -j LOCALMATCH nettables -A INPUT --for-this-host -j ACCEPT nettables -A INPUT -j OUTPUT nettables -A OUTPUT -j ROUTE nettables -A OUTPUT -j TRANSMIT Forwarded packets would be sent right into the OUTPUT chain from the INPUT chain by appropriate rules. Instead of turning off ip_forwarding in /proc/sys, you could just change the -j OUTPUT in the INPUT chain to -j ACCEPT, and it would be impossible to forward packets. I can't see any functionality that we have today which a mechanism like this wouldn't support, aside from the fact that it hands the admin a loaded nuclear missile aimed at their foot (Flushing the INPUT chain would basically be analogous to committing network suicide, although there exist other ways to do that with netfilter today. /Unrelated wishful thinking Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG][debian-2.6.20-1-686] bridging + vlans + vconfig rem == stuck kernel
On May 11, 2007, at 01:49:27, Kyle Moffett wrote: On May 10, 2007, at 00:34:11, Kyle Moffett wrote: On May 10, 2007, at 00:25:54, Ben Greear wrote: Looks like a deadlock in the vlan code. Any chance you can run this test with lockdep enabled? You could also add a printk in vlan_device_event() to check which event it is hanging on, and the netdevice that is passed in. Ok, I'll try building a 2.6.21 kernel with lockdep and some debugging printk()s in the vlan_device_event() function and get back to you tomorrow. Thanks for the quick response! [snip] ifup -a brings up the interfaces in this order (See previous email for configuration details): lo net0 wfi0 world0 lan lan:0 world ifdown -a appears to bring them down in the same order (at least, until it gets stuck). Hmm, turns out that it always hung downing this entry in my interfaces file, independent of ordering: iface world0 inet manual mac-address 8b:8d:cb:91:e2:4c minimally-up yes vlan-dev net0 vlan-id 4094 By commenting out the MAC address line it worked. Yes, I realize the MAC address specified there is bogus, I managed to {think,type}o that one somehow. I had been intending to specify a locally-allocated virtual MAC address on world0 but instead I managed to somehow assign one with the MAC multicast bit set (01:00:00:00:00:00) If I change the above garbage MAC to 02:00:00:00:00:01 (first 02 is the locally-administrated bit) then it seems to work perfectly fine, My guess that the bridging code doesn't properly drop all references to world0 when it has that garbage MAC address on it (since the problem only shows up when both the invalid mac-address is present *AND* I start the world bridge). I suppose this isn't really a big problem, but it would be nice if things didn't leak refcounts on invalid input. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG][debian-2.6.20-1-686] bridging + vlans + vconfig rem == stuck kernel
On May 10, 2007, at 00:34:11, Kyle Moffett wrote: On May 10, 2007, at 00:25:54, Ben Greear wrote: Looks like a deadlock in the vlan code. Any chance you can run this test with lockdep enabled? You could also add a printk in vlan_device_event() to check which event it is hanging on, and the netdevice that is passed in. Ok, I'll try building a 2.6.21 kernel with lockdep and some debugging printk()s in the vlan_device_event() function and get back to you tomorrow. Thanks for the quick response! Progress!!! I built a 2.6.21.1 kernel with a 1MB dmesg buffer, almost all of the locking debugging options on (as well as a few others just for kicks), a VLAN debug #define turned on in the net/ 8021q/vlan.h file, and lots of extra debugging messages added to the functions in vlan.c. My initial interpretation is that due to the funny order in which ifdown -a takes down interfaces, it tries to delete the VLAN interfaces before the bridges running atop them have been taken down. Ordinarily this seems to work, but when the underlying physical ethernet is down already, the last VLAN to be deleted seems to hang somehow. The full results are as follows: The lock dependency validator at startup passes all 218 testcases, indicating that all the locking crap is probably working correctly (those debug options chew up another meg of RAM). ifup -a brings up the interfaces in this order (See previous email for configuration details): lo net0 wfi0 world0 lan lan:0 world ifdown -a appears to bring them down in the same order (at least, until it gets stuck). Attached below is filtered debugging information. I cut out 90% of the crap in the syslog, but there's still a lot left over to sift through; sorry. If you want my .config or the full text of the log then email me privately and I'll send it to you, as it's kinda big. I appreciate any advice, thanks for all your help Cheers, Kyle Moffett This first bit is the ifup -a -v -i interfaces: ADDRCONF(NETDEV_UP): net0: link is not ready vlan_ioctl_handler: args.cmd: 6 vlan_ioctl_handler: args.cmd: 0 register_vlan_device: if_name -:net0:-^Ivid: 2 About to allocate name, vlan_name_type: 3 Allocated new name -:net0.2:- About to go find the group for idx: 2 vlan_transfer_operstate: net0 state transition applies to net0.2 too: vlan_proc_add, device -:net0.2:- being added. Allocated new device successfully, returning. wfi0: add 33:33:00:00:00:01 mcast address to master interface wfi0: add 01:00:5e:00:00:01 mcast address to master interface ADDRCONF(NETDEV_UP): wfi0: link is not ready vlan_ioctl_handler: args.cmd: 6 vlan_ioctl_handler: args.cmd: 0 register_vlan_device: if_name -:net0:-^Ivid: 4094 About to allocate name, vlan_name_type: 3 Allocated new name -:net0.4094:- About to go find the group for idx: 2 vlan_transfer_operstate: net0 state transition applies to net0.4094 too: vlan_proc_add, device -:net0.4094:- being added. Allocated new device successfully, returning. world0: add 33:33:00:00:00:01 mcast address to master interface world0: add 01:00:5e:00:00:01 mcast address to master interface ADDRCONF(NETDEV_UP): world0: link is not ready tg3: net0: Link is up at 1000 Mbps, full duplex. tg3: net0: Flow control is on for TX and on for RX. ADDRCONF(NETDEV_CHANGE): net0: link becomes ready Propagating NETDEV_CHANGE for device net0... ... to wfi0 vlan_transfer_operstate: net0 state transition applies to wfi0 too: ...found a carrier, applying to VLAN device ... to world0 vlan_transfer_operstate: net0 state transition applies to world0 too: ...found a carrier, applying to VLAN device lan: port 1(net0) entering listening state ADDRCONF(NETDEV_CHANGE): wfi0: link becomes ready wfi0: dev_set_promiscuity(master, 1) wfi0: add 33:33:ff:5f:60:92 mcast address to master interface lan: port 2(wfi0) entering listening state ADDRCONF(NETDEV_CHANGE): world0: link becomes ready world0: add 33:33:ff:91:e2:4c mcast address to master interface lan: no IPv6 routers present world: no IPv6 routers present net0: no IPv6 routers present world0: no IPv6 routers present wfi0: no IPv6 routers present lan: port 1(net0) entering learning state lan: port 2(wfi0) entering learning state lan: topology change detected, propagating lan: port 1(net0) entering forwarding state lan: topology change detected, propagating lan: port 2(wfi0) entering forwarding state This bit is for ifdown -a -v -i interfaces: Propagating NETDEV_DOWN for device net0... ... to wfi0 wfi0: del 33:33:ff:5f:60:92 mcast address from vlan interface wfi0: del 33:33:ff:5f:60:92 mcast address from master interface wfi0: del 01:00:5e:00:00:01 mcast address from vlan interface wfi0: del 01:00:5e:00:00:01 mcast address from master interface wfi0: del 33:33:00:00:00:01 mcast address from vlan interface wfi0: del 33:33:00:00:00:01 mcast address from master interface lan: port 2(wfi0) entering disabled state ... to world0 world0
[BUG][debian-2.6.20-1-686] bridging + vlans + vconfig rem == stuck kernel
script in charge of disassembling VLAN interfaces. There is an equivalent zz-km- bridge script for bridge interfaces, as well as if-pre-up.d scripts called 00-km-vlan and 00-km-bridge to create the interfaces. If anyone has any suggestions, patches, or debugging tips I'm very interested to hear from you. Thanks! Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG][debian-2.6.20-1-686] bridging + vlans + vconfig rem == stuck kernel
On May 10, 2007, at 00:25:54, Ben Greear wrote: Kyle Moffett wrote: vconfig D 83CCD8CE 0 16564 16562 (NOTLB) efdd7e7c 0086 ee120afb 83ccd8ce 98f00788 b7083ffa 5384b49a c76c0b05 9ebaf791 0004 efdd7e4e 0007 f1468a90 2ab74174 0362 0326 f1468b9c c180e420 0001 0286 c012933c efdd7e8c df98a000 c180e468 Call Trace: [c012933c] lock_timer_base+0x15/0x2f [c0129445] __mod_timer+0x91/0x9b [c02988f5] schedule_timeout+0x70/0x8d [f8b75209] vlan_device_event+0x13/0xf8 [8021q] Looks like a deadlock in the vlan code. Any chance you can run this test with lockdep enabled? You could also add a printk in vlan_device_event() to check which event it is hanging on, and the netdevice that is passed in. Ok, I'll try building a 2.6.21 kernel with lockdep and some debugging printk()s in the vlan_device_event() function and get back to you tomorrow. Thanks for the quick response! Since the vlan code holds RTNL at this point, then most other network tasks will eventually hang as well. Well, it's less of an eventually and more of an almost immediately. When that happens pretty close to everything more complicated than socket(), bind(), and connect() with straightforward UNIX or INET sockets tends to stick completely. Thanks again! Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] [RFC] AF_RXRPC socket family implementation [try #2]
On Mar 16, 2007, at 10:11:41, Alan Cox wrote: I know what they are; and I don't think that what's available covers it. and use a proper standard socket type. Assuming that that list is exhaustive... SOCK_RDM seems to match perfectly well here. The point isn't to enumerate everything in the universe the point is to find works like parallels good enough to avoid more special casing. IMHO the problem with classifying RxRPC as a reliable datagram socket is that even an atomic unidirectional communication isn't a single datagram, it's at least 3; there is shared connection state and security context on both sides which pertains to a collection of independent and possibly simultaneous RxRPC calls. From the digging around that I did in the kernel socket code a while ago I don't see a cleaner way of implementing it than a new SOCK_RXRPC. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2.6 patch] the scheduled removal of the frame diverter
On Nov 13, 2006, at 16:04:25, Adrian Bunk wrote: This patch contains the scheduled removal of the frame diverter. [snip] -config NET_DIVERT - bool Frame Diverter (EXPERIMENTAL) - depends on EXPERIMENTAL BROKEN - ---help--- - The Frame Diverter allows you to divert packets from the - network, that are not aimed at the interface receiving it (in - promisc. mode). Typically, a Linux box setup as an Ethernet bridge - with the Frames Diverter on, can do some *really* transparent www - caching using a Squid proxy for example. From my understanding of iptables/ebtables, identical functionality is already avaialble within that framework; and as such this patch is just removing broken experimental and redundant code. The IPTables code also properly handles IPv6 and all the other old warts of the frame diverter as well. I agree that this should go. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: wireless: recap of current issues (configuration)
On Jan 17, 2006, at 13:41, Stuffed Crust wrote: On Mon, Jan 16, 2006 at 10:24:41PM +, Alan Cox wrote: If I have told my equipment to obey UK law I expect it to do so. If I hop on the train to France and forget to revise my configuration I'd prefer it also believed the APs It's not that you might forget to revise your configuration, but that the vast majority of users will not revise anything, and still expect things to just work. Kind of like multi-band cell phones. Alan's point is still very valid. From a poweruser point of view, if I specifically tell my wireless client You must obey US laws, and then I wander over past a broken imported AP, I don't want my client to _expand_ its allowable range. IMHO, userspace should be able to forcibly restrict wireless frequencies to a certain regdomain (or leave unrestricted and passive-scan-only), and specify how AP/ configured regdomains act. Given the range of possibilities, I think that a userspace daemon monitoring events and dynamically configuring the useable frequencies would best. That way the userspace daemon could be configured to ignore APs, union/intersect the APs with the configured regdomain, ignore the configured regdomain in the presence of APs, etc. Cheers, Kyle Moffett -- I lost interest in blade servers when I found they didn't throw knives at people who weren't supposed to be in your machine room. -- Anthony de Boer - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] Fine-grained memory priorities and PI
On Dec 15, 2005, at 03:21, David S. Miller wrote: Not when we run out, but rather when we reach some low water mark, the critical sockets would still use GFP_ATOMIC memory but only critical sockets would be allowed to do so. But even this has faults, consider the IPSEC scenerio I mentioned, and this applies to any kind of encapsulation actually, even simple tunneling examples can be concocted which make the critical socket idea fail. The knee jerk reaction is mark IPSEC's sockets critical, and mark the tunneling allocations critical, and... and... well you have GFP_ATOMIC then my friend. In short, these seperate page pool and critical socket ideas do not work and we need a different solution, I'm sorry folks spent so much time on them, but they are heavily flawed. What we really need in the kernel is a more fine-grained memory priority system with PI, similar in concept to what's being done to the scheduler in some of the RT patchsets. Currently we have a very black-and-white memory subsystem; when we go OOM, we just start killing processes until we are no longer OOM. Perhaps we should have some way to pass memory allocation priorities throughout the kernel, including a this request has X priority, this request will help free up X pages of RAM, and drop while dirty under certain OOM to free X memory using this method. The initial benefit would be that OOM handling would become more reliable and less of a special case. When we start to run low on free pages, it might be OK to kill the [EMAIL PROTECTED] process long before we OOM if such action might prevent the OOM. Likewise, you might be able to flag certain file pages as being less critical, such that the kernel can kill a process and drop its dirty pages for files in / tmp. Or the kernel might do a variety of other things just by failing new allocations with low priority and forcing existing allocations with low priority to go away using preregistered handlers. When processes request memory through any subsystem, their memory priority would be passed through the kernel layers to the allocator, along with any associated information about how to free the memory in a low-memory condition. As a result, I could configure my database to have a much higher priority than [EMAIL PROTECTED] (or boinc or whatever), so that when the database server wants to fill memory with clean DB cache pages, the kernel will kill [EMAIL PROTECTED] for it's memory, even if we could just leave some DB cache pages unfaulted. Questions? Comments? This is a terrible idea that should never have seen the light of day? Both constructive and destructive criticism welcomed! (Just please keep the language clean! :-D) Cheers, Kyle Moffett -- Q: Why do programmers confuse Halloween and Christmas? A: Because OCT 31 == DEC 25. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Fine-grained memory priorities and PI
On Dec 15, 2005, at 07:45, Con Kolivas wrote: I have some basic process-that-called the memory allocator link in the -ck tree already which alters how aggressively memory is reclaimed according to priority. It does not affect out of memory management but that could be added to said algorithm; however I don't see much point at the moment since oom is still an uncommon condition but regular memory allocation is routine. My thought would be to generalize the two special cases of writeback of dirty pages or dropping of clean pages under memory pressure and OOM to be the same general case. When you are trying to free up pages, it may be permissible to drop dirty mbox pages and kill the postfix process writing them in order to satisfy allocations for the mission-critical database server. (Or maybe it's the other way around). If a large chunk of the allocated pages have priorities and lossless/lossy free functions, then the kernel can be much more flexible and configurable about what to do when running low on RAM. Cheers, Kyle Moffett -- I lost interest in blade servers when I found they didn't throw knives at people who weren't supposed to be in your machine room. -- Anthony de Boer - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html