Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
On Mon, 19 Mar 2007, Linus Torvalds wrote: So *please* don't believe that you can make it as cheap to have some automatic fixup of two sequences, one inlined and one as a call. It may look so when you look at the single instruction generated, but you're ignoring all the instructions *around* the site. Side note, you can certainly fix things like this at least in theory, but it requires: - the call instruction that is used instead of the inlining should basically have no callee-clobbers. Any paravirt_ops called this way should have a special calling sequence where they simple save and restore all the registers they use. This is usually not that bad. Just create a per-architecture wrapper function that saves/restores anything that the C calling convention on that architecture says is clobbered by calls. - if the function has arguments, and the inlined sequence can take the arguments in arbitrary registers, you are going to penalize the inlined sequence anyway (by forcing some fixed arbitrary register allocation policy).This thing is largely unfixable without some really extreme compiler games (like post-processing the assembler output and having different entry-points depending on where the arguments are..) .. it will obviously depend on how thngs are done whether these things are useful or not. But it does mean that it's always a good idea to just have a config option of turn off all the paravirt crap, because it *does* add overhead, and replacing instructions on the fly doesn't make that overhead go away. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
Eric W. Biederman wrote: Rusty Russell [EMAIL PROTECTED] writes: On Sun, 2007-03-18 at 13:08 +0100, Andi Kleen wrote: The idea is _NOT_ that you go look for references to the paravirt_ops members structure, that would be stupid and you wouldn't be able to use the most efficient addressing mode on a given cpu, you'd be patching up indirect calls and crap like that. Just say no... That wouldn't handle inlines though. At least some of the current paravirtops like cli/sti are critical enough to require inlining. Well, we'd patch the inline over the call if we have room. Magic patching would be neat, but the downsides are that (1) we can't expand the patching room and (2) there's no way of attaching clobber info to the call site (doing register liveness analysis is not appealing). True. You can use all of the call clobbered registers. The trouble is that there are places where we want to intercept things like sti/cli in entry.S in a context where all the registers are in use, so we can't generally make the assumption that caller-saved regs are clobberable. For any call-site in compiler-generated code we can make that assumption, of course. Now, this may not be fatal. 5 bytes is enough for all the native ops to be patched inline. For lguest this covers popf and pushf, but not cli and sti (10 bytes): they'd have to be calls. As for clobber info, it turns out that almost all of the calls can clobber %eax, which is probably enough. We just need to mark the handful of asm ones where this isn't true. I guess if the code is larger than a function call I'm failing to see the disadvantage in making it a direct function call. Any modern processor ought to be able to predict it perfectly, and processors like the P4 may even optimize the call out of their L1 instruction cache. For current processors, I think i-cache pollution is really the only downside to a call. But maybe you could put multiple pv-op functions into a cacheline if they're used in close proximity (cli/sti/save_fl/restore_fl would all fit into a single cacheline). If what David is suggesting works, making all of these direct calls looks easy and very maintainable. At which point patching instructions inline is quite possibly overkill. I spent some time looking at what it would take to get this going. It would probably look something like: 1. make CONFIG_PARAVIRT select CONFIG_RELOCATABLE 2. generate vmlinux 3. extract relocs, and process the references to paravirt symbols into a table 4. compile and link that table into the kernel again (like the ksyms dance) 5. at boot time, use that table to redirect calls/references into the paravirt backend to the appropriate one I think if we store the reloc references as section-relative, and the reloc table itself is linked into its own section, then there's no need for multipass linking to combine the reloc table into the kernel image, since adding that table won't affect any existing reloc. All this is doable; I'd probably end up hacking boot/compressed/relocs.c to generate the appropriate reloc table. My main concern is hacking the kernel build process itself; I'm unsure of what it would actually take to implement all this. Is it truly critical to inline any of these instructions? Possibly not, but I'd like to be able to say with confidence that running a PARAVIRT kernel on bare hardware has no performance loss compared to running a !PARAVIRT kernel. There's the case of small instruction sequences which have been replaced with calls (such as sti/cli/push;popf/etc), and also the case of hooks which are noops on native (like make_pte/pte_val, etc). In those cases it would be nice to patch in a replacement, either a real instruction or just nops. J - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] [RFC] AF_RXRPC socket family implementation [try #2]
From: David Howells [EMAIL PROTECTED] Date: Mon, 19 Mar 2007 15:41:38 + Alan Cox [EMAIL PROTECTED] wrote: So use SOCK_DGRAM, its clearly near enough. No, it's not. SOCK_DGRAM is an unreliable, unidirectional datagram passing service. David we're not looking for a precise match, so please stop discussing this in those terms. We're looking for something close enough. The more and more I read Alan's arguments and your resistence to his logic, the more I side with Alan. He's definitely right on all the basic counts here. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
Linus Torvalds wrote: On Mon, 19 Mar 2007, Eric W. Biederman wrote: True. You can use all of the call clobbered registers. Quite often, the biggest single win of inlining is not so much the code size (although if done right, that will be smaller too), but the fact that inlining DOES NOT CLOBBER AS MANY REGISTERS! Yes, that's something that we've been conscious of when designing the pv_ops patching mechanism. As it stands, each time you emit a call to a pv-ops operation, it not only stores the start and end of the patchable area, but also which registers are available for clobbering at that callsite by the patched-in code. For things like sti/cli, where the surrounding code expects nothing to be clobbered, we make sure that the regs are preserved on the pv-ops side, even though there's a call to a normal C function in the middle. That gives in the pvops backend the flexibility to patch over that site with either some inline code or a call out to some function which doesn't necessarily clobber the full caller-save set (or even any registers). In short: people here seem to think that inlining is about avoiding the call/ret sequence. Not so. The real advantages of inlining are elsewhere. Yes. Probably the biggest win of inlining is constant propagation into the inline function, but register-use flexibility is good small-scale win (vs the micro-scale call/ret elimination). Side note, you can certainly fix things like this at least in theory, but it requires: - the call instruction that is used instead of the inlining should basically have no callee-clobbers. Any paravirt_ops called this way should have a special calling sequence where they simple save and restore all the registers they use. This is usually not that bad. Just create a per-architecture wrapper function that saves/restores anything that the C calling convention on that architecture says is clobbered by calls. Most of the places we intercept are normal C calls anyway, so this isn't a big issue. Its mainly the places where we replace single instructions that this will have a big effect. The trouble with this is that we're back to having to have special wrappers around each call to hide the normal C calling convention from the compiler, so that it knows that it has more registers to play with. This was the main complaint about the original version of the patch, because it all looks very ugly. - if the function has arguments, and the inlined sequence can take the arguments in arbitrary registers, you are going to penalize the inlined sequence anyway (by forcing some fixed arbitrary register allocation policy).This thing is largely unfixable without some really extreme compiler games (like post-processing the assembler output and having different entry-points depending on where the arguments are..) Yeah, that doesn't sound like much fun. I think using the normal regparm calling convention will be OK. Aside from some slightly longer instruction encodings, all the registers are more or less interchangeable anyway. .. it will obviously depend on how thngs are done whether these things are useful or not. But it does mean that it's always a good idea to just have a config option of turn off all the paravirt crap, because it *does* add overhead, and replacing instructions on the fly doesn't make that overhead go away. Yes, there's a big switch to turn all this off. It would be nice if we could get things to the point that it isn't necessary to have (ie, running on bare hardware really is indistinguishable either way), but we're a fair way from being able to prove that. In the meantime, the goal is to try to keep the source-level changes a local as possible so that maintaining the CONFIG_PARAVIRT vs non-PARAVIRT distinction is straightforward. J - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
From: Jeremy Fitzhardinge [EMAIL PROTECTED] Date: Mon, 19 Mar 2007 12:10:08 -0700 All this is doable; I'd probably end up hacking boot/compressed/relocs.c to generate the appropriate reloc table. My main concern is hacking the kernel build process itself; I'm unsure of what it would actually take to implement all this. 32-bit Sparc's btfixup might be usable as a guide. Another point worth making is that for function calls you can fix things up lazily if you want. So you link, build the reloc tables, then link in a *.o file that does provide the functions in the form of stubs. The stubs intercept the call, and patch the callsite, then revector to the real handler. I don't like this idea actually because it essentially means you either: 1) Only allow one setting of the operations OR 2) Need to have code which walks the whole reloc table anyways to handle settings after the first so you can revector everyone back to the stubs and lazy reloc properly again In fact forget I mentioned this idea :) As another note, I do agree with Linus about the register usage arguments. It is important. I think it's been mentioned but what you could do is save nothing (so that sti and cli are just that and cost nothing), but the more complicated versions save and restore enough registers to operate. It all depends upon what you're trying to do. For example, it's easy to use patching to make different PTE layouts be supportable in the same kernel image. We do this on sparc64 since sun4v has a different PTE layout than sun4u, you can see the code in asm-sparc64/pgtable.h for details (search for sun4v_*_patch) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
David Miller wrote: Another point worth making is that for function calls you can fix things up lazily if you want. [...] In fact forget I mentioned this idea :) OK :) I think we'll only ever want to bind to a hypervisor once, since the underlying hypervisor can't change on the fly (well, in principle it might if you migrate, but you'll have more problems than just dealing with the hook points). And lazy binding doesn't buy much; I think its much better to get it all out of the way at once, and then throw the reloc info away. As another note, I do agree with Linus about the register usage arguments. It is important. I think it's been mentioned but what you could do is save nothing (so that sti and cli are just that and cost nothing), but the more complicated versions save and restore enough registers to operate. Right, that's pretty much what we do now. It all depends upon what you're trying to do. For example, it's easy to use patching to make different PTE layouts be supportable in the same kernel image. We do this on sparc64 since sun4v has a different PTE layout than sun4u, you can see the code in asm-sparc64/pgtable.h for details (search for sun4v_*_patch) I see. We want to do something conceptually like this, but we need to handle more than just adjusting which of two constants to use as a mask. For example, Xen needs to run pfns through a pfn-mfn (machine frame number) conversion and back when making/unpacking pagetable entries. J - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 24/31] IrDA: irttp_dup spin_lock initialisation
-stable review patch. If anyone has any objections, please let us know. -- From: Samuel Ortiz [EMAIL PROTECTED] Without this initialization one gets kernel BUG at kernel/rtmutex_common.h:80! This patch should also be included in the -stable kernel. Signed-off-by: G. Liakhovetski [EMAIL PROTECTED] Signed-off-by: Samuel Ortiz [EMAIL PROTECTED] Cc: David Miller [EMAIL PROTECTED] Signed-off-by: Greg Kroah-Hartman [EMAIL PROTECTED] --- net/irda/irttp.c |1 + 1 file changed, 1 insertion(+) --- a/net/irda/irttp.c +++ b/net/irda/irttp.c @@ -1455,6 +1455,7 @@ struct tsap_cb *irttp_dup(struct tsap_cb /* Not everything should be copied */ new-notify.instance = instance; + spin_lock_init(new-lock); init_timer(new-todo_timer); skb_queue_head_init(new-rx_queue); -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND] NET: Add packet sock option to return orig_dev to userspace when bonded
I got it the first time, it's just very low priority and very deep in my backlog and I'm also about to be away for 3 days. I have it so you don't need to keep resending it. Thanks. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
Possibly not, but I'd like to be able to say with confidence that running a PARAVIRT kernel on bare hardware has no performance loss compared to running a !PARAVIRT kernel. There's the case of small instruction sequences which have been replaced with calls (such as sti/cli/push;popf/etc), My guess is that most critical pushf/popf are in spin_lock_irqsave(). It would be possible to special case that one -- inline it -- and use out of line versions for all the others. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
On Mon, 2007-03-19 at 11:38 -0700, Linus Torvalds wrote: On Mon, 19 Mar 2007, Eric W. Biederman wrote: True. You can use all of the call clobbered registers. Quite often, the biggest single win of inlining is not so much the code size (although if done right, that will be smaller too), but the fact that inlining DOES NOT CLOBBER AS MANY REGISTERS! Thanks Linus. *This* was the reason that the current hand-coded calls only clobber % eax. It was a compromise between native (no clobbers) and others (might need a reg). Now, since we decided to allow paravirt_ops operations to be normal C (ie. the patching is optional and done late), we actually push and pop % ecx and %edx. This makes the call site 10 bytes long, which is a nice size for patching anyway (enough for a movl $0, addr, a-la lguest's cli, or movw $0, %gs:addr if we supported SMP). The current 6 paravirt ops which are patched cover the vast majority of calls (until the Xen patches, then we need ~4 more?). Jeremy chose to expand patching to cover *all* paravirt ops, rather than just the new hot ones, and that's where we tipped over the ugliness threshold. Cheers, Rusty. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] x25_forward_call(): fix NULL dereferences
From: Adrian Bunk [EMAIL PROTECTED] Date: Mon, 19 Mar 2007 10:24:03 +0100 This patch fixes two NULL dereferences spotted by the Coverity checker. For a better understanding, the diff -uwp output (that ignores the indentation changes) is: I'll apply this, thanks Adrian. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
Jeremy Fitzhardinge wrote: For example, say we wanted to put a general call for sti into entry.S, where its expected it won't touch any registers. In that case, we'd have a sequence like: push %eax push %ecx push %edx call paravirt_cli pop %edx pop %ecx pop %eax If we parse the relocs, then we'd find the reference to paravirt_cli. If we look at the byte before and see 0xe8, then we can see if its a call. If we then work out in each direction and see matched push/pops, then we know what registers can be trashed in the call. This also allows us to determine the callsite size, and therefore how much space we need for inlining. No, that is a very dangerous suggestion. You absolutely *cannot* do this safely without explicitly marking the start EIP of this code. You *must* use metadata to do that. It is never safe to disassemble backwards or rewind EIP for x86 code. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
Zachary Amsden wrote: Jeremy Fitzhardinge wrote: If we then work out in each direction and see matched push/pops, then we know what registers can be trashed in the call. This also allows us to determine the callsite size, and therefore how much space we need for inlining. No, that is a very dangerous suggestion. You absolutely *cannot* do this safely without explicitly marking the start EIP of this code. You *must* use metadata to do that. It is never safe to disassemble backwards or rewind EIP for x86 code. What do you mean the instruction before is mov $0x52515000,%eax? Yeah, you're right. Oh well. J - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
Rusty Russell wrote: On Mon, 2007-03-19 at 11:38 -0700, Linus Torvalds wrote: On Mon, 19 Mar 2007, Eric W. Biederman wrote: True. You can use all of the call clobbered registers. Quite often, the biggest single win of inlining is not so much the code size (although if done right, that will be smaller too), but the fact that inlining DOES NOT CLOBBER AS MANY REGISTERS! For VMI, the default clobber was cc, and you need a way to allow at least that, because saving and restoring flags is too expensive on x86. Thanks Linus. *This* was the reason that the current hand-coded calls only clobber % eax. It was a compromise between native (no clobbers) and others (might need a reg). I still don't think this was a good trade. The primary motivation for clobbering %eax was that Xen wanted a free register to use for computing the offset into the shared data in the case of SMP preemptible kernels. Xen no longer needs such a register, they can use the PDA offset instead. And it does hurt native performance by unconditionally stealing a register in the four most commonly invoked paravirt-ops code sequences. Now, since we decided to allow paravirt_ops operations to be normal C (ie. the patching is optional and done late), we actually push and pop % ecx and %edx. This makes the call site 10 bytes long, which is a nice size for patching anyway (enough for a movl $0, addr, a-la lguest's cli, or movw $0, %gs:addr if we supported SMP). You can do it in 11 bytes with no clobbers and normal C semantics by linking to a direct address instead of calling to an indirect, but then you need some gross fixup technology in paravirt_patch: if (call_addr == (void*)native_sti) { ... } I think we should probably try to do it in 12 bytes. Freeing eax to the inline caller is likely to make up the 2 bytes of space more we have to nop. One thing I always tried to get in VMI was to encapsulate the actual code which went through the business of computing arguments that were not even used in the native case. Unfortunately, that seems impossible in the current design, but I don't think it is an issue because I don't think there is actually a way to express: SWITCHABLE_CODE_BLOCK_BEGIN { /* arbitrary C code for native */ } SWITCHABLE_CODE_BLOCK_ALTERNATIVE { /* arbitrary C code for something else */ } Dave's linker suggestion is probably the best for things like that. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [6/6] 2.6.21-rc4: known regressions
From: Adrian Bunk [EMAIL PROTECTED] Date: Sun, 18 Mar 2007 19:49:38 +0100 Subject: ipv6 crash References : http://lkml.org/lkml/2007/3/10/2 Submitter : Len Brown [EMAIL PROTECTED] Status : unknown This is caused by some problem in the router round-robin code in net/ipv6/route.c:rt6_select() Somehow it NULLs out fn-leaf, and then fib6_add_1() crashes dererencing that NULL pointer as is seen in the report. Deleting the router round-robin list mangling code in rt6_select() makes the crash go away, but such a change causes regressions in the ipv6 conformance tests. Thomas Graf discovered this bug some time ago, but we still haven't come up with a fix suitable for upstream :-/ This bug has been there for a very long time and is not a regression of 2.6.21 I'll see if I can come up with something to fix this properly. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.22 3/3] Add LED trigger to libata core
Tony Vroon wrote: The first user of ata_ac_issue_prot_with_ledtrigger, the ServerWorks Frodo/ Apple K2 driver. Used by the IDE LED trigger on G5 towers. Respin of an earlier patch, based on comments by Tejun Heo Alan Cox. Just two comments. 1. IMHO, ata_qc_issue_prot_ledtrigger() without 'with' is good enough. This is just my personal preference. Feel free to ignore it. 2. Patch #1 and #2 should be merged. They're one logical change of adding ata_qc_issue_prot_with_ledtrigger(). Patch #3 is a logically separate change of using it, but unless it's a wide conversion, implementing something and using something can be merged. So, please merge #1 and #2 and possibly #3. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PATCH] USB fixes for 2.6.21-rc4
Here are some USB fixes and new device ids against 2.6.21-rc4. These patches contain a fix for a regression that is on Adrian's list with regards to usb-serial drivers oopsing, and they also add a number of different device ids and other minor updates. All of these have been in the -mm releases. Please pull from: master.kernel.org:/pub/scm/linux/kernel/git/gregkh/usb-2.6.git/ The full patches will be sent to the linux-usb-devel mailing list, if anyone wants to see them. thanks, greg k-h drivers/usb/class/usblp.c |6 +- drivers/usb/misc/berry_charge.c|2 +- drivers/usb/net/dm9601.c |8 drivers/usb/serial/airprime.c |4 drivers/usb/serial/mos7720.c |1 + drivers/usb/serial/option.c| 15 +-- drivers/usb/serial/usb-serial.c| 11 +-- drivers/usb/storage/unusual_devs.h | 10 ++ 8 files changed, 43 insertions(+), 14 deletions(-) --- Alan Stern (1): usblp: quirk flag and device entry for Seiko Epson M129C printer Greg Kroah-Hartman (1): USB: new Novatel device ids for option driver Jim Radford (1): USB: fix usb-serial regression Jon Dowland (1): USB: two more device ids for dm9601 usbnet driver Ken L Johnson (1): USB: berry_charge: correct dbg string for second magic command Mark Glines (1): airprime: USB ID for Novatel EV620 mini PCI-E card Oliver Neukum (1): USB: necessary update for mos7720 driver Pete Zaitcev (1): USB: RAZR v3i unusual_devs - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 03/31] Fix user copy length in ipv6_sockglue.c
On Mon, Mar 19, 2007 at 03:01:25PM -0700, Chris Wright wrote: * Greg KH ([EMAIL PROTECTED]) wrote: From: Chris Wright [EMAIL PROTECTED] [IPV6] fix ipv6_getsockopt_sticky copy_to_user leak User supplied len 0 can cause leak of kernel memory. Use unsigned compare instead. You can drop this one. It's dependent on a patch that is not in 2.6.20. Ok, thanks for letting me know, it is now dropped. greg k-h - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [stable] [patch 29/31] Input: i8042 - fix AUX IRQ delivery check
On Mon, Mar 19, 2007 at 05:48:55PM -0400, Dmitry Torokhov wrote: On 3/19/07, Greg KH [EMAIL PROTECTED] wrote: -stable review patch. If anyone has any objections, please let us know. -- From: Dmitry Torokhov [EMAIL PROTECTED] Input: i8042 - fix AUX IRQ delivery check On boxes that do not implement AUX LOOP command we can not verify AUX IRQ delivery and must assume that it is wired properly. Greg, There is another piece missing in AUX delivery test, commit 3ca5de6dd4ec5a139b2b8f00dce3e4726ca91af1 Unfortunately I can't send you a patch at the moment but if you could get it from the mainline that would be great. Thanks for letting me know, I've added it to the queue now. greg k-h - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sysctl: vfs_cache_divisor
Randy Dunlap wrote: The we duplicate all the relevant /proc knobs: cat /proc/sys/vm/dirty_ratio 30 cat /proc/sys/vm/hires-dirty_ratio/ 30 Or we do something else ;) Sounds better. I wasn't very keen on the userspace interface that this exposed. Will look at those. Okay... may be I could throw a spanner in the machinery, and suggest another option: perhaps we should add a way to do sysctl which can handle fractional (fixed-point) values... more coherent/detailed message tomorrow. -hpa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far
On Sat, Mar 17, 2007 at 02:26:57PM +0100, Andi Kleen wrote: Arjan van de Ven [EMAIL PROTECTED] writes: well we can do the handshake to take ownership like we do much later in boot, but that requires PCI to be there and fully discovered, which we don't have this early. That's not true - we do early pci discovery. Doing USB handsoff there would be quite possible. What, we don't do USB handoff early enough in the boot process? It's happening at PCI quirk time now, which I think should be early enough for everyone (and too early for some who rely on USB keyboards and initramfs shells...) thanks, greg k-h - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [QUICKLIST 1/5] Quicklists for page table pages V3
Christoph Lameter writes: +static inline void *quicklist_alloc(int nr, gfp_t flags, void (*ctor)(void *)) +{ ... + p = (void *)__get_free_page(flags | __GFP_ZERO); This will cause problems on 64-bit powerpc, at least with 4k pages, since the pmd and pgd levels only use 1/4 of a page. Paul. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.21-rc4-mm1
Temporarily at http://userweb.kernel.org/~akpm/2.6.21-rc4-mm1/ Will appear later at ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc4/2.6.21-rc4-mm1/ - Restored the RSDL CPU scheduler (a new version thereof) Boilerplate: - See the `hot-fixes' directory for any important updates to this patchset. - To fetch an -mm tree using git, use (for example) git-fetch git://git.kernel.org/pub/scm/linux/kernel/git/smurf/linux-trees.git tag v2.6.16-rc2-mm1 git-checkout -b local-v2.6.16-rc2-mm1 v2.6.16-rc2-mm1 - -mm kernel commit activity can be reviewed by subscribing to the mm-commits mailing list. echo subscribe mm-commits | mail [EMAIL PROTECTED] - If you hit a bug in -mm and it is not obvious which patch caused it, it is most valuable if you can perform a bisection search to identify which patch introduced the bug. Instructions for this process are at http://www.zip.com.au/~akpm/linux/patches/stuff/bisecting-mm-trees.txt But beware that this process takes some time (around ten rebuilds and reboots), so consider reporting the bug first and if we cannot immediately identify the faulty patch, then perform the bisection search. - When reporting bugs, please try to Cc: the relevant maintainer and mailing list on any email. - When reporting bugs in this kernel via email, please also rewrite the email Subject: in some manner to reflect the nature of the bug. Some developers filter by Subject: when looking for messages to read. - Occasional snapshots of the -mm lineup are uploaded to ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/ and are announced on the mm-commits list. Changes since 2.6.21-rc3-mm1: origin.patch git-acpi.patch git-alsa.patch git-arm-master.patch git-arm.patch git-avr32.patch git-cifs.patch git-cpufreq.patch git-powerpc.patch git-drm.patch git-dvb.patch git-gfs2-nmw.patch git-hid.patch git-ia64.patch git-ieee1394.patch git-infiniband.patch git-input.patch git-kbuild.patch git-kvm.patch git-leds.patch git-libata-all.patch git-md-accel.patch git-mmc.patch git-mtd.patch git-ubi.patch git-netdev-all.patch git-ioat.patch git-ocfs2.patch git-parisc.patch git-selinux.patch git-pciseg.patch git-s390.patch git-sh.patch git-scsi-misc.patch git-scsi-rc-fixes.patch git-unionfs.patch git-wireless.patch git-ipwireless_cs.patch git-gccbug.patch git trees -uml-hostfs-fix-double-free.patch -uml-hostfs-make-hostfs=-option-work-as-a-jail-as-intended.patch -uml-fix-a-memory-leak-in-the-multicast-driver.patch -uml-remove-dead-code-about-os_usr1_signal-and-os_usr1_process.patch -uml-mark-both-consoles-as-con_anytime.patch -uml-fix-confusion-irq-early-reenabling.patch -uml-activate_fd-return-enomem-only-when-appropriate.patch -uml-fix-errno-usage.patch -x86_64-fix-2618-regression-ptrace_oldsetoptions-should-be-accepted.patch -bluetooth-fix-socket-locking-in-hci_sock_dev_event.patch -add-epoll-compat_-code-to-fs-compatc.patch -check_partition-fix-error-check.patch -uml-arch_prctl-should-set-thread-fs.patch -connector-bugfix-for-cn_call_callback.patch -26-altix-console-fix-for-config_debug_shirq-usage.patch -ecryptfs-nested-locking-annotation.patch -swsusp-disable-nonboot-cpus-before-entering-platform-suspend.patch -paravirt-build-fixes.patch -acpi-disabled-due-to-dmi-failure-or-blacklisted-year-should-be-noted-as-is-done-with-other-acpi-blacklisting.patch -git-alsa-oops-fix.patch -avr32-dma-mappingh.patch -gregkh-driver-device-symlink.patch -gregkh-driver-platform-reorder-platform_device_del.patch -gregkh-driver-remove-devfs-from-maintainers.patch -gregkh-driver-driver-core-export-device_rename.patch -gregkh-driver-uio-irq.patch -scheduled-removal-of-sa_xxx-interrupt-flags-fixups-4.patch -make-drivers-char-drm-drm_vmcdrm_io_prot-static.patch -fix-saa7146_clipping_mem-size.patch -drivers-media-video-cpia_ppc-dont-use-_work_nar.patch -dvb-core-fix-several-locking-related-problems.patch -saa7134-fix-modules=n-compilation.patch -ivtv-warning-fix.patch -jdelvare-i2c-i2c-03-use-i2c_adapterdevparent-for-messages.patch -jdelvare-i2c-i2c-i801-restore-initial-state.patch -jdelvare-i2c-ds1374-check-for-workqueue-creation.patch -crash-on-evdev-disconnect.patch -expose-set_mode-method-so-it-can-be-wrapped.patch -ata_piix-remove-ugly-layering-violation.patch -pata_cmd640-multiple-updates.patch -ide-cmd64x-fix-recovery-time-calculation-take2.patch -mtd-maps-ck804xromc-pci_module_init-to-pci_register_driver.patch -mtd-chips-oops-in-cfi_amdstd_sync.patch -mtd-esb2-check-for-closed-rom-window.patch -dilnetpc-fix-warning.patch -mtd-correct-misspelled-preprocessor-variable.patch -git-netdev-all-ipw2200-fix.patch -mv643xx-ethernet-driver-irq-registration-fix.patch -via-rhine-set-avoid_d3-for-broken-bioses.patch -netxen-fix-warnings.patch -e1000-fix-be-ready-for-incoming-irq-at-pci_request_irq.patch -e1000-fix-firmware-handover-bits.patch -e1000-fix-stop-raw-interrupts-disabled-nag-from-rt.patch
Re: mm snapshot broken-out-2007-03-18-02-44.tar.gz uploaded
On Tue, 20 Mar 2007 13:47:53 +1100 Nick Piggin [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Mon, 19 Mar 2007 17:58:52 -0800 Andrew Morton [EMAIL PROTECTED] wrote: The kernel without Nick's patchset but with the assert runs OK too. Under the principle of mm-has-been-too-flakey-lately, I'll drop the patches: mm-debug-check-for-the-fault-vs-invalidate-race.patch mm-simplify-filemap_nopage.patch mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch mm-merge-populate-and-nopage-into-fault-fixes-nonlinear-tidy.patch mm-merge-nopfn-into-fault.patch mm-merge-nopfn-into-fault-fix.patch mm-remove-legacy-cruft.patch ug, too many rejects. I'll leave them in, minus mm-debug-check-for-the-fault-vs-invalidate-race.patch Hang on a sec... I'll try fixing the thing before you next make a release. Too late. hot-fixes/ awaits thee. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far
On 3/16/07, Thomas Gleixner [EMAIL PROTECTED] wrote: Yes, this is probably caused by SMM code trying to emulate a PS/2 keyboard from a (maybe connected or not) USB keyboard. Unfortunately we have no way to disable this BIOS misfeature in the early boot process. https://mail.rtai.org/pipermail/rtai/2003-March/002949.html http://www.embeddedrelated.com/usenet/embedded/show/50333-1.php I think CONFIG_TRY_TO_DISABLE_SMI would be excellent for debugging, not to mention people trying to spec out hardware for RT applications... Lee - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/31] 2.6.20-stable review
On Monday 19 March 2007, Greg KH wrote: This is the start of the stable review cycle for the 2.6.20.4 release. There are 31 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let us know. If anyone is a maintainer of the proper subsystem, and wants to add a Signed-off-by: line to the patch, please respond with it. These patches are sent out with a number of different people on the Cc: line. If you wish to be a reviewer, please email [EMAIL PROTECTED] to add your name to the list. If you want to be off the reviewer list, also email us. Responses should be made by Thursday March, 22, 15:00:00 UTC. Anything received after that time might be too late. BINGO! One of these 31 patches may be the guilty party that's playing tricks with tar's mind. I'm running 2.6.20.4-rc1 on an older athlon xp2800 with a gig of ram. Amanda has gotten through the estimate phase and is now doing the backup. It will fail, out of tape. Here is an amstatus output as its running right now. coyote:/GenesAmandaHelper-0.5 3 planner: [dumps way too big, 350850 KB, must skip incremental dumps] coyote:/GenesAmandaHelper-0.6 1 planner: [dumps way too big, 184977 KB, must skip incremental dumps] coyote:/bin 1 planner: [dumps way too big, 1110 KB, must skip incremental dumps] coyote:/boot 13m wait for dumping coyote:/dev 1 planner: [dumps way too big, 290 KB, must skip incremental dumps] coyote:/etc 1 planner: [dumps way too big, 18291 KB, must skip incremental dumps] coyote:/home 0 1018m wait for dumping coyote:/lib 3 planner: [dumps way too big, 11705 KB, must skip incremental dumps] coyote:/opt 15m wait for dumping coyote:/root 3 planner: [dumps way too big, 785963 KB, must skip incremental dumps] coyote:/sbin 1 planner: [dumps way too big, 10 KB, must skip incremental dumps] coyote:/tmp 4 32m wait for dumping coyote:/usr/X11R6 12m wait for dumping coyote:/usr/bin 1 planner: [dumps way too big, 339170 KB, must skip incremental dumps] coyote:/usr/dlds 1 planner: [dumps way too big, 2140 KB, must skip incremental dumps] coyote:/usr/dlds-misc 30m wait for dumping coyote:/usr/dlds-rpms 1 planner: [dumps way too big, 3130 KB, must skip incremental dumps] coyote:/usr/dlds-tgzs 1 planner: [dumps way too big, 10 KB, must skip incremental dumps] coyote:/usr/games 00m wait for dumping coyote:/usr/include 1 planner: [dumps way too big, 10557 KB, must skip incremental dumps] coyote:/usr/kerberos 10m wait for dumping coyote:/usr/lib 1 planner: [dumps way too big, 474409 KB, must skip incremental dumps] coyote:/usr/libexec 2 planner: [dumps way too big, 11285 KB, must skip incremental dumps] coyote:/usr/local 2 279m wait for dumping coyote:/usr/man 10m wait for dumping coyote:/usr/movies2 7271m dumping 5485m ( 75.44%) (0:12:47) coyote:/usr/music 1 planner: [dumps way too big, 2448290 KB, must skip incremental dumps] coyote:/usr/pix 2 17m wait for dumping coyote:/usr/sbin 1 planner: [dumps way too big, 3254 KB, must skip incremental dumps] coyote:/usr/share 3 planner: [dumps way too big, 40514 KB, must skip incremental dumps] coyote:/usr/src 3 6822m wait for dumping coyote:/var 1 366m wait for dumping SUMMARY part real estimated size size partition : 32 estimated : 3231973m flush : 0 0m failed : 1816155m ( 50.53%) wait for dumping: 13 8547m ( 26.73%) dumping to tape : 00m ( 0.00%) dumping : 1 5485m 7271m ( 75.44%) ( 17.16%) dumped : 0 0m 0m ( 0.00%) ( 0.00%) wait for writing: 0 0m 0m ( 0.00%) ( 0.00%) wait to flush : 0 0m 0m (100.00%) ( 0.00%) writing to tape : 0 0m 0m ( 0.00%) ( 0.00%) failed to tape : 0 0m 0m ( 0.00%) ( 0.00%) taped : 0 0m 0m ( 0.00%) ( 0.00%) tape 1: 0 0m 0m ( 0.00%) Dailys-19 8 dumpers idle : not-idle taper idle network free kps: 6800 holding space : 71118m (100.00%) dumper0 busy : 0:00:00 ( 0.00%) 0 dumpers busy : 0:00:00 ( 0.00%) 1 dumper busy : 0:00:00 ( 0.00%) The directory shown on line one of this report actually has: [EMAIL PROTECTED] /]# du -h /GenesAmandaHelper-0.5/ 1.6G
Re: mm snapshot broken-out-2007-03-18-02-44.tar.gz uploaded
Andrew Morton wrote: On Tue, 20 Mar 2007 13:47:53 +1100 Nick Piggin [EMAIL PROTECTED] wrote: Andrew Morton wrote: Hang on a sec... I'll try fixing the thing before you next make a release. Too late. hot-fixes/ awaits thee. Awww... well thanks very much Michal for reporting the bug, I reproduced it easily and it turns out to be a typo. In my testing I never had a lot of writeout going on, so most of the pages will have been truncated in the first loop... -- SUSE Labs, Novell Inc. Fix typo in do_no_page vs invalidate race fix patch. Index: linux-2.6/mm/truncate.c === --- linux-2.6.orig/mm/truncate.c +++ linux-2.6/mm/truncate.c @@ -235,7 +235,7 @@ void truncate_inode_pages_range(struct a wait_on_page_writeback(page); if (page_mapped(page)) { unmap_mapping_range(mapping, - (loff_t)page_indexPAGE_CACHE_SHIFT, + (loff_t)page-indexPAGE_CACHE_SHIFT, PAGE_CACHE_SIZE, 0); } if (page-index next)
Re: mm snapshot broken-out-2007-03-18-02-44.tar.gz uploaded
Nick Piggin wrote: Andrew Morton wrote: On Tue, 20 Mar 2007 13:47:53 +1100 Nick Piggin [EMAIL PROTECTED] wrote: Andrew Morton wrote: Hang on a sec... I'll try fixing the thing before you next make a release. Too late. hot-fixes/ awaits thee. Awww... well thanks very much Michal for reporting the bug, I reproduced it easily and it turns out to be a typo. In my testing I never had a lot of writeout going on, so most of the pages will have been truncated in the first loop... Also, noticed another problem in the same general area. Andrew you were indeed right to question the removal of that unmap_mapping_range call, but I think even it alone it wasn't enough... -- SUSE Labs, Novell Inc. The nopage vs invalidate race fix patch did not take care of truncating private COW pages. Mind you, I'm pretty sure this was previously racy even for regular truncate, not to mention vmtruncate_range. Anyway, fix that omission. Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -1905,7 +1905,18 @@ int vmtruncate(struct inode * inode, lof if (IS_SWAPFILE(inode)) goto out_busy; i_size_write(inode, offset); + + /* +* unmap_mapping_range is called twice, first simply for efficiency +* so that truncate_inode_pages does fewer single-page unmaps. However +* after this first call, and before truncate_inode_pages finishes, +* it is possible for private pages to be COWed, which remain after +* truncate_inode_pages finishes, hence the second unmap_mapping_range +* call must be made for correctness. +*/ + unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1); truncate_inode_pages(mapping, offset); + unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1); goto out_truncate; do_expand: @@ -1943,7 +1954,9 @@ int vmtruncate_range(struct inode *inode mutex_lock(inode-i_mutex); down_write(inode-i_alloc_sem); + unmap_mapping_range(mapping, offset, (end - offset), 1); truncate_inode_pages_range(mapping, offset, end); + unmap_mapping_range(mapping, offset, (end - offset), 1); inode-i_op-truncate_range(inode, offset, end); up_write(inode-i_alloc_sem); mutex_unlock(inode-i_mutex);
Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far
On Tue, 2007-20-03 at 01:04 -0400, Lee Revell wrote: I think CONFIG_TRY_TO_DISABLE_SMI would be excellent for debugging, not to mention people trying to spec out hardware for RT applications... There is a SMI disabling module in RTAI, check the smi-module.c in this: https://www.rtai.org/RTAI/rtai-3.5.tar.bz2 More infos: http://www.captain.at/rtai-smi-high-latency.php http://www.captain.at/xenomai-smi-high-latency.php It might make sense to merge this code, at least in the -rt tree. - Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
On Mon, Mar 19, 2007 at 09:44:28PM +0100, Blaisorblade wrote: On Sunday 18 March 2007 03:50, Nick Piggin wrote: Yes, I believe that is the case, however I wonder if that is going to be a problem for you to distinguish between write faults for clean writable ptes, and write faults for readonly ptes? I wouldn't be able to distinguish them, but am I going to get write faults for clean ptes when vma_wants_writenotify() is false (as seems to be for tmpfs)? I guess not. For tmpfs pages, clean writable PTEs are mapped as writable so they won't give any problem, since vma_wants_writenotify() is false for tmpfs. Correct? Yes, that should be the case. So would this mean that nonlinear protections don't work on regular files? They still work in most cases (including for UML), but if the initial mmap() specified PROT_WRITE, that is ignored, for pages which are not remapped via remap_file_pages(). UML uses PROT_NONE for the initial mmap, so that's no problem. But how are you going to distinguish a write fault on a readonly pte for dirty page accounting vs a read-only nonlinear protection? You can't store any more data in a present pte AFAIK, so you'd have to have some out of band data. At which point, you may as well just forget about vma_wants_writenotify vmas, considering that everybody is using shmem/ramfs. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] split file and anonymous page queues #2
Rik van Riel wrote: Split the anonymous and file backed pages out onto their own pageout queues. This we do not unnecessarily churn through lots of anonymous pages when we do not want to swap them out anyway. This should (with additional tuning) be a great step forward in scalability, allowing Linux to run well on very large systems where scanning through the anonymous memory (on our way to the page cache memory we do want to evict) is slowing systems down significantly. This patch has been stress tested and seems to work, but has not been fine tuned or benchmarked yet. For now the swappiness parameter can be used to tweak swap aggressiveness up and down as desired, but in the long run we may want to simply measure IO cost of page cache and anonymous memory and auto-adjust. We apply pressure to each of sets of the pageout queues based on: - the size of each queue - the fraction of recently referenced pages in each queue, not counting used-once file pages - swappiness (file IO is more efficient than swap IO) Please take this patch for a spin and let me know what goes well and what goes wrong. This ignores whether a file page is mapped, doesn't it? Even so, it could be a good approach anyway. There are a couple of little nice improvements you have there, such as treating shmem pages in the same class as anon pages. We found that we needed something similar, so some of those things should go upstream on their own. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RSDL v0.31
On Mon, Mar 19, 2007 at 08:11:55PM -0700, Linus Torvalds wrote: Quite frankly, I was *planning* on merging RSDL very early after 2.6.21, but there is one thing that has turned me completely off the whole thing: - the people involved seem to be totally unwilling to even admit there might be a problem. This is like alcoholism. If you cannot admit that you might have a problem, you'll never get anywhere. And quite frankly, the RSDL proponents seem to be in denial (we're always better, it's your problem if the old scheduler works better, just one report of old scheduler being better). And the thing is, if people aren't even _willing_ to admit that there may be issues, there's *no*way*in*hell* I will merge it even for testing. Because the whole and only point of merging RSDL was to see if it could replace the old scheduler, and the most important feature in that case is not whether it is perfect, BUT WHETHER ANYBODY IS INTERESTED IN TRYING TO FIX THE INEVITABLE PROBLEMS! Linus, you're unfair with Con. He initially was on this position, and lately worked with Mike by proposing changes to try to improve his X responsiveness. But he's ill right now and cannot touch the keyboard, so only his supporters speak for him, and as you know, speech is not code and does not fix problems. Leave him a week or so to relieve and let's see what he can propose. Hopefully a week away from the keyboard will help him think with a more general approach. Also, Mike has already modified the code a bit to get better experience. Also, while I don't agree with starting to renice X to get something usable, it seems real that there's something funny on Mike's system which makes it behave particularly strangely when combined with RSDL, because other people in comparable tests (including me) have found X perfectly smooth even with loads in the tens or even hundreds. I really suspect that we will find a bug in RSDL which triggers the problem and that this fix will help discover another problem on Mike's hardware which was not triggered by mainline. Regards, Willy - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images
On Mon, 2007-03-19 at 20:05 -0500, Matt Mackall wrote: On Tue, Mar 20, 2007 at 01:42:46AM +0100, Thomas Gleixner wrote: On Mon, 2007-03-19 at 17:32 -0500, Matt Mackall wrote: This is exactly the same problem as booting on a desktop PC. But somehow LILO manages. My first Linux box had a hell of a lot less disk than the platform I bootstrapped (and wrote NAND drivers for) last month had in NAND. No, it is not. You get the absolute sector address of your second stage and this is a complete nobrainer. The translation is done in the DISK device. LILO and friends manage to boot systems that use software RAID and LVM. There are multiple methods. Some use block lists, some use tiny boot partitions, etc. All of them are applicable to controllerless NAND. Yes, by using fixed addresses, which is not what I want. You simply ignore the fact, that inside each disk, USB Stick, CF-CARD, whatever - there is a more or less intellegent controller device, which does the mapping to the physical storage location. There is _NO_ such thing on a bare FLASH chip. How many times do I have to tell you that I wrote a driver for controllerless NAND just last month? Wow. I'm impressed because I'm pulling my opinion out of thin air. How exactly does device mapper: A) across device wear levelling ? The same way UBI does, but encapsulated in a device mapper layer. Does the device mapper do that ? B) dynamic partitioning for FLASH aware file systems ? See above. Does the device mapper do that ? C) across device wear levelling for FLASH aware file systems ? See above. Look at your own drawing. D) background bit-flip corrections (copying affected blocks and recylce the old one) ? See above. Repeating patterns do not impress me. Your drawing tells otherwise E) allow position independent placement of the second stage bootloader ? See way above to my LILO response. Neither LILO nor GRUB have search capabilities for randomly located second stage loaders. You need to implement a clever journalling block device emulator in order to keep the data alive and the FLASH not weared out within no time. You need the wear levelling, otherwise you can throw away your FLASH in no time. And that's why it's in my picture. Yes, it is in your picture, but: 1) it excludes FLASH aware file systems and UBI does not. 2) your picture does still not explain how it does achive the above A), B), C), D) and E) Your extra path for partitioning(4) and JFFS2 is just a weird hack, which makes your proposal completely absurd. No, it's just there to show the flexibility of device mapper. But I have the sneaking suspicion you have no idea how device mapper works. Sigh. Layering violation == flexibility. In brief: device mapper takes one or more devices, applies a mapping to them, and returns a new device. For example, take various spans of /dev/hda1 and /dev/sda3 and present them as new-device1. Take new-device1 and transform it with dm-crypt to get new-device2. The kernel doesn't decide how to do this, any more than it decides where to mount your filesystems. Userspace does. I know how it works. But your blurb does not answer any of my questions. 5. We don't reimplement higher pieces of the stack (dm-crypt, snapshot, etc.). Why should we reimplement that ? So that you can get encryption and snapshot, etc.? 1. On top of a clever block device. 2. UBI can do snapshots by design. Oh, so you HAVE reimplemented it. No, it already works 3. Encryption should be done on the VFS layer and not below the filesystem layer. Doing it inside the block layer or the device mapper is broken by design. That's highly debatable and not a topic for this thread. I see, you define, what has to be discussed. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far
On Mon, 2007-03-19 at 21:27 -0700, Greg KH wrote: On Sat, Mar 17, 2007 at 02:26:57PM +0100, Andi Kleen wrote: Arjan van de Ven [EMAIL PROTECTED] writes: well we can do the handshake to take ownership like we do much later in boot, but that requires PCI to be there and fully discovered, which we don't have this early. That's not true - we do early pci discovery. Doing USB handsoff there would be quite possible. What, we don't do USB handoff early enough in the boot process? It's happening at PCI quirk time now, which I think should be early enough for everyone (and too early for some who rely on USB keyboards and initramfs shells...) It happens way after the CPUs are brought up. At this point both the delay loop calibration and the local APIC calibration are already done. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Sanitize filesystem NLS handling
OGAWA Hirofumi wrote: Alexander E. Patrakov [EMAIL PROTECTED] writes: But, anyway, this is a separate issue that my patch doesn't attempt to correct. The conclusion so far is that we disagree, and that there are situations where using utf8 iocharset is the least of all evils, so the warning is not justified enough. Reproducible testcase: Again, I don't care about read at all. And why don't you use utf8 option, instead of iocharset=utf8. iocharset=utf8 is warned until it is fixed. The utf8 also doesn't work correctly in some case though. Would it be OK for you if I add the mount-time check for iocharset=utf8 to the fat filesystem and silently replace this with the utf8 option, instead of overly actively warning the users? This way, the sysfs option and the nls_base.iocharset module parameter will still work as I want. I'm talking about two filesystems on a system here, not two encoding on one filesystem. I am also talking about this. Mounting two filesystems with different iocharsets is insane, because this will result in one of the following outcomes: 1) ls will show wrong characters in filenames on one of the filesystems 2) one of the two filesystems will contain wrong on-disk data for filenames, that, when misinterpreted by mounting with wrong iocharset, results in seemingly-correct output, but is misunderstood by the properly set up reference implementation (that's what is likely to happen with jfs in your example). Because you didn't change the locale. And it is your policy, right? Yes. This is because I have some files with non-ASCII names in my home directory. Changing the locale would make these filenames look wrong until I change it back. -- Alexander E. Patrakov - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vt: fix a potential race in the VT_WAITACTIVE handler
On Thu, 15 Mar 2007 15:10:23 +0100 Michal Januszewski [EMAIL PROTECTED] wrote: On a multiprocessor machine the VT_WAITACTIVE ioctl call may return 0 if fg_console has already been updated in redraw_screen(), but the console switch itself hasn't been completed. Fix this by checking fg_console in vt_waitactive() with the console sem held. Signed-off-by: Michal Januszewski [EMAIL PROTECTED] --- diff --git a/drivers/char/vt_ioctl.c b/drivers/char/vt_ioctl.c index 3a5d301..00b5b34 100644 --- a/drivers/char/vt_ioctl.c +++ b/drivers/char/vt_ioctl.c @@ -1041,8 +1041,12 @@ int vt_waitactive(int vt) for (;;) { set_current_state(TASK_INTERRUPTIBLE); retval = 0; - if (vt == fg_console) + acquire_console_sem(); + if (vt == fg_console) { + release_console_sem(); break; + } + release_console_sem(); retval = -EINTR; if (signal_pending(current)) break; OK. I think. It's hard to tell. I assume that the acquire_console_sem() in here is to synchronise against some other function which also takes acquire_console_sem(), but it is not clear which. So could you please redo this with a comment which tells the reader exactly what's being protected against what, and why? Also, I always feel a bit worried by: set_current_state(TASK_INTERRUPTIBLE); down(...); because if it hits contention, the down() will undo the set_curremt_state(). Now that's normally OK because we loop, and because the semaphore won't normally be 100% contended all the time. Unless someone reimplements down() so it happens to return in state TASK_RUNNING all the time, which they could legitimately do (although this would probably break stuff such as the above). But still, it is nicer to do down(...); set_current_state(TASK_INTERRUPTIBLE); if possible, and I think it is possible here. Thanks. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: XFS internal error xfs_da_do_buf(2) at line 2087 of file fs/xfs/xfs_da_btree.c. Caller 0xc01b00bd
On Mon, Mar 19, 2007 at 11:32:27AM +0100, Marco Berizzi wrote: Marco Berizzi wrote: David Chinner wrote: Ok, so an ipsec change. And I see from the history below it really has nothing to do with this problem. it seems the problem has something to do with changes between 2.6.19.1 and 2.6.19.2. indeed. Yesterday at 13:00 I have switched from 2.6.19.1 to 2.6.19.2 (without the ipsec fix) and at about 17:30 linux has crashed again. I have recompiled 2.6.19.2 with all kernel debugging options enabled and rebooted. Now I'm waiting for the crash... Linux has not been crashed. However here is dmesg output with all debugging option enabled: (search for 'INFO: possible recursive locking detected'). Is that normal? . = [ INFO: possible recursive locking detected ] 2.6.19.2 #1 - rm/470 is trying to acquire lock: ((ip-i_lock)-mr_lock){}, at: [c01cd64a] xfs_ilock+0x5b/0xa1 but task is already holding lock: ((ip-i_lock)-mr_lock){}, at: [c01cd64a] xfs_ilock+0x5b/0xa1 other info that might help us debug this: 3 locks held by rm/470: #0: (inode-i_mutex/1){--..}, at: [c016e5a7] do_unlinkat+0x70/0x115 #1: (inode-i_mutex){--..}, at: [c030be35] mutex_lock+0x1c/0x1f #2: ((ip-i_lock)-mr_lock){}, at: [c01cd64a] xfs_ilock+0x5b/0xa1 stack backtrace: [c0103bc0] dump_trace+0x215/0x21a [c0103c68] show_trace_log_lvl+0x1a/0x30 [c0103c90] show_trace+0x12/0x14 [c0103d8d] dump_stack+0x19/0x1b [c01357e7] print_deadlock_bug+0xc0/0xcf [c0135860] check_deadlock+0x6a/0x79 [c01372e1] __lock_acquire+0x350/0x970 [c0137fd1] lock_acquire+0x75/0x97 [c01331ab] down_write+0x3a/0x54 [c01cd64a] xfs_ilock+0x5b/0xa1 [c01eda0e] xfs_lock_dir_and_entry+0x105/0x11b [c01edcc5] xfs_remove+0x180/0x47f [c01f8a9e] xfs_vn_unlink+0x22/0x4f [c016e533] vfs_unlink+0x9e/0xa2 [c016e5df] do_unlinkat+0xa8/0x115 [c016e68b] sys_unlink+0x10/0x12 [c0102cdb] syscall_call+0x7/0xb [b7efaa7d] 0xb7efaa7d === That's no problem - lockdep just doesn't know that we can nest i_lock (we've got to get the annotations for this sorted out). Here is the relevant results: Phase 2 - found root inode chunk Phase 3 - ... agno = 0 ... agno = 12 LEAFN node level is 1 inode 1610612918 bno = 8388608 Hmmm - single bit error in the bno - that reminds of this: http://oss.sgi.com/projects/xfs/faq.html#dir2 So I'd definitely make sure that is repaired Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
PNPACPI probes serial twice, messes up serial console
Dell SC1425 x86_64 running in i386 mode (the problem also occurs in x86_64 mode). Kernel 2.6.21-rc4, gcc 4.1.0. Config extract at end. Booting with 'console=tty console=ttyS0,9600'. The serial console on ttyS0 (0x3f8, irq 4) is probed twice, once from serial8250_init() and again from serial_pnp_probe(). The serial console output is correct until the second probe (from PNP) gets to these lines in serial8250_config_port() if (flags UART_CONFIG_TYPE) autoconfig(up, probeflags); After the call to autoconfig(), the serial console starts printing NUL characters instead of the console output. The number of NUL characters corresponds closely with the number of characters written to the VT console, IOW it outputs each serial character as NUL instead of the correct character. When the kernel boots /sbin/init, the console resets to printing normal characters. AFAICT, the second probe of the UART is doing something nasty to the hardware. This is not a recent problem, I can reproduce the problem on 2.6.16. Booting with pnpacpi=off removes the problem, but that supresses all the PNPACPI code, not just the second probe of the serial devices. Should pnpacpi probe and setup the serial devices even when thay have already been setup? Or this is something strange about the UART in this particular box? FWIW, the serial console is plugged into a serial to USB converter (pl2303), my laptop has no serial ports. That should not make a difference, but just in case it does ... Config extract: X86_32=y GENERIC_TIME=y CLOCKSOURCE_WATCHDOG=y GENERIC_CLOCKEVENTS=y GENERIC_CLOCKEVENTS_BROADCAST=y LOCKDEP_SUPPORT=y STACKTRACE_SUPPORT=y SEMAPHORE_SLEEPERS=y X86=y MMU=y ZONE_DMA=y GENERIC_ISA_DMA=y GENERIC_IOMAP=y GENERIC_BUG=y GENERIC_HWEIGHT=y ARCH_MAY_HAVE_PC_FDC=y DMI=y DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config EXPERIMENTAL=y LOCK_KERNEL=y INIT_ENV_ARG_LIMIT=32 LOCALVERSION=-i386-kaos LOCALVERSION_AUTO=y SWAP=y SYSVIPC=y SYSVIPC_SYSCTL=y POSIX_MQUEUE=y IKCONFIG=y IKCONFIG_PROC=y SYSFS_DEPRECATED=y CC_OPTIMIZE_FOR_SIZE=y SYSCTL=y EMBEDDED=y SYSCTL_SYSCALL=y KALLSYMS=y KALLSYMS_ALL=y HOTPLUG=y PRINTK=y BUG=y ELF_CORE=y BASE_FULL=y FUTEX=y EPOLL=y SHMEM=y SLAB=y VM_EVENT_COUNTERS=y RT_MUTEXES=y BASE_SMALL=0 MODULES=y MODULE_UNLOAD=y KMOD=y STOP_MACHINE=y BLOCK=y LBD=y LSF=y IOSCHED_NOOP=y IOSCHED_AS=y IOSCHED_DEADLINE=y IOSCHED_CFQ=y DEFAULT_DEADLINE=y DEFAULT_IOSCHED=deadline TICK_ONESHOT=y HIGH_RES_TIMERS=y SMP=y X86_PC=y MPENTIUM4=y X86_CMPXCHG=y X86_L1_CACHE_SHIFT=7 RWSEM_XCHGADD_ALGORITHM=y GENERIC_CALIBRATE_DELAY=y X86_WP_WORKS_OK=y X86_INVLPG=y X86_BSWAP=y X86_POPAD_OK=y X86_CMPXCHG64=y X86_GOOD_APIC=y X86_INTEL_USERCOPY=y X86_USE_PPRO_CHECKSUM=y X86_TSC=y HPET_TIMER=y HPET_EMULATE_RTC=y NR_CPUS=8 SCHED_SMT=y PREEMPT_NONE=y X86_LOCAL_APIC=y X86_IO_APIC=y X86_MCE=y X86_MCE_NONFATAL=y X86_MCE_P4THERMAL=y MICROCODE=m MICROCODE_OLD_INTERFACE=y X86_MSR=m X86_CPUID=m HIGHMEM4G=y VMSPLIT_3G=y PAGE_OFFSET=0xC000 HIGHMEM=y ARCH_FLATMEM_ENABLE=y ARCH_SPARSEMEM_ENABLE=y ARCH_SELECT_MEMORY_MODEL=y ARCH_POPULATES_NODE_MAP=y SELECT_MEMORY_MODEL=y FLATMEM_MANUAL=y FLATMEM=y FLAT_NODE_MEM_MAP=y SPARSEMEM_STATIC=y SPLIT_PTLOCK_CPUS=4 ZONE_DMA_FLAG=1 MTRR=y IRQBALANCE=y HZ_250=y HZ=250 PHYSICAL_START=0x10 PHYSICAL_ALIGN=0x10 COMPAT_VDSO=y ARCH_ENABLE_MEMORY_HOTPLUG=y PM=y ACPI=y ACPI_PROCFS=y ACPI_BUTTON=m ACPI_FAN=m ACPI_PROCESSOR=m ACPI_BLACKLIST_YEAR=0 ACPI_EC=y ACPI_POWER=y ACPI_SYSTEM=y PCI=y PCI_GOANY=y PCI_BIOS=y PCI_DIRECT=y PCI_MMCONFIG=y PCIEPORTBUS=y PCIEAER=y PCI_MSI=y HT_IRQ=y ISA_DMA_API=y BINFMT_ELF=y BINFMT_MISC=m NET=y PACKET=y PACKET_MMAP=y UNIX=y XFRM=y INET=y IP_MULTICAST=y IP_ADVANCED_ROUTER=y ASK_IP_FIB_HASH=y IP_FIB_HASH=y IP_ROUTE_MULTIPATH=y IP_ROUTE_VERBOSE=y SYN_COOKIES=y INET_XFRM_MODE_BEET=y INET_DIAG=y INET_TCP_DIAG=y TCP_CONG_CUBIC=y DEFAULT_TCP_CONG=cubic NETFILTER=y NETFILTER_NETLINK=m NETFILTER_NETLINK_LOG=m NETFILTER_XTABLES=y NETFILTER_XT_TARGET_CLASSIFY=m NETFILTER_XT_TARGET_MARK=m NETFILTER_XT_MATCH_COMMENT=m NETFILTER_XT_MATCH_DCCP=m NETFILTER_XT_MATCH_ESP=m NETFILTER_XT_MATCH_LENGTH=m NETFILTER_XT_MATCH_LIMIT=m NETFILTER_XT_MATCH_MAC=m NETFILTER_XT_MATCH_MARK=m NETFILTER_XT_MATCH_MULTIPORT=m NETFILTER_XT_MATCH_PKTTYPE=m NETFILTER_XT_MATCH_QUOTA=m NETFILTER_XT_MATCH_REALM=m NETFILTER_XT_MATCH_SCTP=m NETFILTER_XT_MATCH_STATISTIC=m NETFILTER_XT_MATCH_TCPMSS=m IP_NF_IPTABLES=y IP_NF_MATCH_IPRANGE=m IP_NF_MATCH_TOS=m IP_NF_MATCH_RECENT=m IP_NF_MATCH_ECN=m IP_NF_MATCH_AH=m IP_NF_MATCH_TTL=m IP_NF_MATCH_OWNER=m IP_NF_MATCH_ADDRTYPE=m IP_NF_FILTER=y IP_NF_TARGET_REJECT=y IP_NF_TARGET_ULOG=y VLAN_8021Q=y NET_CLS_ROUTE=y STANDALONE=y PREVENT_FIRMWARE_BUILD=y FW_LOADER=m CONNECTOR=m PNP=y PNP_DEBUG=y PNPACPI=y BLK_DEV_FD=m BLK_DEV_LOOP=m IDE=m IDE_MAX_HWIFS=4 BLK_DEV_IDE=m BLK_DEV_IDEDISK=m IDEDISK_MULTI_MODE=y BLK_DEV_IDECD=m IDE_TASK_IOCTL=y BLK_DEV_IDEPCI=y IDEPCI_SHARE_IRQ=y BLK_DEV_IDEDMA_PCI=y IDEDMA_PCI_AUTO=y BLK_DEV_PIIX=m BLK_DEV_IDEDMA=y
Re: [PATCH] powerpc minor pagefault optimization with kprobes enabled
I've attached a patch below the optimizes this code path for powerpc, but the scheme applies to all architectures aswell. It just rips out all the callachin madness, and does as good as it gets in the pagefault handler: NAK, patch on the way to get rid of all the debugger() crap by using this very hook. Anton - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
Christoph Hellwig wrote: On Mon, Mar 19, 2007 at 09:11:31PM +1100, Nick Piggin wrote: I've got the patches in -mm now. I hope they will get merged when the the next window opens. I didn't submit the -page_mkwrite conversion yet, because I didn't have any callers to look at. It is is slightly less trivial than for nopage and nopfn, so having David's block_page_mkwrite is helpful. Yes. I was just wondering whether it makes more sense to do this functionality directly ontop of -fault instead of converting i over real soon. I would personally prefer that, but I don't want to block David's patch from being merged if the -fault patches do not get in next cycle. If the fault patches do make it in first, then yes we should do the page_mkwrite conversion before merging David's patch. I'll keep an eye on it, and try to do the right thing. Thanks, Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
On Mon, 2007-03-19 at 18:00 -0800, Zachary Amsden wrote: Rusty Russell wrote: *This* was the reason that the current hand-coded calls only clobber % eax. It was a compromise between native (no clobbers) and others (might need a reg). I still don't think this was a good trade. ... Xen no longer needs such a register Hmm, well, if VMI is happy, Xen is happy, and lguest is happy, then perhaps we're better off with a cc-only clobber rule? Certainly makes life simpler. Now, since we decided to allow paravirt_ops operations to be normal C (ie. the patching is optional and done late), we actually push and pop % ecx and %edx. This makes the call site 10 bytes long, which is a nice size for patching anyway (enough for a movl $0, addr, a-la lguest's cli, or movw $0, %gs:addr if we supported SMP). You can do it in 11 bytes with no clobbers and normal C semantics by linking to a direct address instead of calling to an indirect, but then you need some gross fixup technology in paravirt_patch: if (call_addr == (void*)native_sti) { ... } Well, I don't think we need such hacks: since we have to use handcoded asm and mark the callsites anyway, marking what they're calling is trivial. The other idea from btfixup is that we can do the patching *much* earlier, so we don't need the initial code to be valid at all if we wanted to: we just need room to patch in a call insn. We could then generate trampolines which do the necessary pushes pops automatically for backends which want to use C calling conventions. Perhaps it's time for code and benchmarks? Rusty. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
Zachary Amsden wrote: For VMI, the default clobber was cc, and you need a way to allow at least that, because saving and restoring flags is too expensive on x86. According to lore (Andi, I think), asm() always clobbers cc. I still don't think this was a good trade. The primary motivation for clobbering %eax was that Xen wanted a free register to use for computing the offset into the shared data in the case of SMP preemptible kernels. Xen no longer needs such a register, they can use the PDA offset instead. And it does hurt native performance by unconditionally stealing a register in the four most commonly invoked paravirt-ops code sequences. Actually, it still does need a temp register. The sequence for cli is: mov %fs:xen_vcpu, %eax movb $1,1(%eax) At some point I hope to move the vcpu structure directly into the pda/percpu variables, at which point it will need no temps. J - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
On Monday 19 March 2007 00:46, Jeremy Fitzhardinge wrote: Andi Kleen wrote: Yes. All inline assembly tells gcc what registers are clobbered and it fills in the tables. Hand clobbering in inline assembly cannot be expressed with the current toolchain, so we moved all those out of line. But again I'm not sure it will work anyways. For once you would need large padding around the calls anyways for inline replacement -- how would you generate that? I expect you would need to put the calls into asm() again and with that a custom annotiation format looks reasonable. Inlining is most important for very small code: sti, cli, pushf;pop eax, etc (in many cases, no-ops). We'd have at least 5 bytes to work in, and maybe more if there are surrounding push/pops to be consumed. For example, say we wanted to put a general call for sti into entry.S, where its expected it won't touch any registers. In that case, we'd have a sequence like: push %eax push %ecx push %edx call paravirt_cli pop %edx pop %ecx pop %eax This cannot right now be expressed as inline assembly in the unwinder at all because there is no way to inject the push/pops into the compiler generated ehframe tables. [BTW I plan to resubmit the unwinder with some changes] If we parse the relocs, then we'd find the reference to paravirt_cli. If we look at the byte before and see 0xe8, then we can see if its a call. If we then work out in each direction and see matched push/pops, then we know what registers can be trashed in the call. This also allows us to determine the callsite size, and therefore how much space we need for inlining. gcc normally doesn't generate push/pops around directly around the call site, but somewhere else due to the way its register allocator works. It can be anywhere in the function or even not there at all if the register didn't contain anything useful. And they're not necessarily push/pops of course. So you would need to write it as inline assembly. I'm not sure it would be significantly cleaner than just having tables then. So in this case, we see that there are 5 bytes for the call and a further 6 bytes of push/pops available for inlining. Of course this is hand-written code anyway, so there's no particular burden to having some extra metadata stashed away in another section. For compiler-generated code, we know that it's already expecting standard C ABI calling conventions. The downside, of course, is that only the 5 byte call space is available for inline patching. It's unlikely you can do much useful in 5 bytes I guess. Regarding cli/sti: i've been actually thinking about changing it in the non paravirt kernel. IIRC most save_flags/restore_flags are inside spin_lock_irqsave/restore() and that is a separate function anyways so a little larger special case code is ok as long as it is not slower. There is some evidence that at least on P4 a software cli/sti flag without pushf/popf would be faster. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
Andi Kleen wrote: push %eax push %ecx push %edx call paravirt_cli pop %edx pop %ecx pop %eax This cannot right now be expressed as inline assembly in the unwinder at all because there is no way to inject the push/pops into the compiler generated ehframe tables. I was thinking more of entry.S rather than inline asms. The only ones I can think of which would be relevant are the interrupt operations in spinlocks, and even then we get a bit more leeway. gcc normally doesn't generate push/pops around directly around the call site, but somewhere else due to the way its register allocator works. It can be anywhere in the function or even not there at all if the register didn't contain anything useful. And they're not necessarily push/pops of course. Right, I'm not making any assumptions about what gcc generates for calls, other than assuming it uses a standard call X direct call instruction. So you would need to write it as inline assembly. I'm not sure it would be significantly cleaner than just having tables then. No, my intention is that the vast majority of pv_ops uses, which are calls from C code, would simply be normal C calls, syntatically and semantically. It's unlikely you can do much useful in 5 bytes I guess. I think the main value is retaining non-PARAVIRT native performance levels. In the native case, the inlined instructions amount to 1 or 2 bytes, and having small patch sites is an advantage because there's less space wasted on nops. My understanding is that a direct call has almost zero overhead on modern processors, because both the call and the return can be completely predicted and prefetched, so the threshhold at which you make inline vs call tradeoff is pretty small. Regarding cli/sti: i've been actually thinking about changing it in the non paravirt kernel. IIRC most save_flags/restore_flags are inside spin_lock_irqsave/restore() and that is a separate function anyways so a little larger special case code is ok as long as it is not slower. There is some evidence that at least on P4 a software cli/sti flag without pushf/popf would be faster. There are also the ones in entry.S. I suppose spinlocks do get used more than syscalls and interrupts. J - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
From: Andi Kleen [EMAIL PROTECTED] Date: Mon, 19 Mar 2007 11:57:28 +0100 On Monday 19 March 2007 00:46, Jeremy Fitzhardinge wrote: Andi Kleen wrote: For example, say we wanted to put a general call for sti into entry.S, where its expected it won't touch any registers. In that case, we'd have a sequence like: push %eax push %ecx push %edx call paravirt_cli pop %edx pop %ecx pop %eax This cannot right now be expressed as inline assembly in the unwinder at all because there is no way to inject the push/pops into the compiler generated ehframe tables. [BTW I plan to resubmit the unwinder with some changes] It's inability to handle sequences like the above sounds to me like a very good argument to _not_ merge the unwinder back into the tree. To me, that unwinder is nothing but trouble, it severly limits what cases you can use special calling conventions via inline asm (and we have done that on several occaisions) and even ignoring that the unwinder only works half the time. Please don't subject us to another couple months of hair-pulling only to have Linus yank the thing out again, there are certainly more useful things to spend time on :-) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
It's inability to handle sequences like the above sounds to me like a very good argument to _not_ merge the unwinder back into the tree. The unwinder can handle it fine, it is just that gcc right now cannot be taught to generate correct unwind tables for it. If paravirt ops is widely used i guess it would be possible to find some workaround for that though. There is one case it truly cannot handle (it would be possible by switching to dwarf3), but that was relatively easily eliminated and wasn't a problem imho. To me, that unwinder is nothing but trouble, it severly limits what cases you can use special calling conventions via inline asm (and we have done that on several occaisions) and even ignoring that the unwinder only works half the time. Initially we had some bugs that accounted for near all failures, but they were all fixed in the latest version. Please don't subject us to another couple months of hair-pulling only to have Linus yank the thing out again, there are certainly more useful things to spend time on :-) I think better backtraces are a good use of my time. Sorry if you disagree on that. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
From: Linus Torvalds [EMAIL PROTECTED] Date: Mon, 19 Mar 2007 20:18:14 -0700 (PDT) Please don't subject us to another couple months of hair-pulling only to have Linus yank the thing out again, there are certainly more useful things to spend time on :-) Good call. Dwarf2 unwinding simply isn't worth doing. But I won't yank it out, I simply won't merge it. It was more than just totally buggy code, it was an inability of the people to understand that even bugfree code isn't enough - you have to be able to also handle buggy data. Thank you. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
David Miller [EMAIL PROTECTED] writes: From: Linus Torvalds [EMAIL PROTECTED] Date: Mon, 19 Mar 2007 20:18:14 -0700 (PDT) Please don't subject us to another couple months of hair-pulling only to have Linus yank the thing out again, there are certainly more useful things to spend time on :-) Good call. Dwarf2 unwinding simply isn't worth doing. But I won't yank it out, I simply won't merge it. It was more than just totally buggy code, it was an inability of the people to understand that even bugfree code isn't enough - you have to be able to also handle buggy data. Thank you. Hmm.. I know the feeling I have had a similar rant about the kexec on panic code path. The code is still no where near as paranoid about normal kernel things not working as it could be, but by ranting about it periodically the people doing the work are gradually making it better. I'm conflicted about the dwarf unwinder. I was off doing other things at the time so I missed the pain, but I do have a distinct recollection of the back traces on x86_64 being distinctly worse the on i386. Lately I haven't seen that so it may be I was misinterpreting what I was seeing, and the compiler optimizations were what gave me such weird back traces. But if the quality of our backtraces has gone down and dwarf unwinder could give us better back traces it is likely worth pursuing. Of course it would need to start with the assumption that it's tables may be borked (the kernel is busted after all) and be much more careful than Andi's last attempt. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
new warning in ata_sg_clean
In todays 2.6.21-rc4+git the following news warning has appeared on my ppc computer: CC [M] drivers/ata/libata-core.o drivers/ata/libata-core.c: In function 'ata_sg_clean': drivers/ata/libata-core.c:3558: warning: unused variable 'dir' -- Meelis Roos ([EMAIL PROTECTED]) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.21-rc4: known regressions with patches available
Let's not bore people running -rc kernels with regressions that have patches available - let's get this patches into the tree for giving them the pure exciting experience of the many unfixed regressions. This email lists some known regressions in Linus' tree compared to 2.6.20 with patches available. If you find your name in the Cc header, you are either submitter of one of the bugs, maintainer of an affectected subsystem or driver, a patch of you caused a breakage or I'm considering you in any other way possibly involved with one or more of these issues. Due to the huge amount of recipients, please trim the Cc when answering. Subject: acpi_serialize locks system during boot References : http://bugzilla.kernel.org/show_bug.cgi?id=8171 Submitter : Colchao [EMAIL PROTECTED] Handled-By : Len Brown [EMAIL PROTECTED] Patch : http://bugzilla.kernel.org/show_bug.cgi?id=8171 Status : patch available Subject: laptop immediately resumes after suspend References : http://lkml.org/lkml/2007/3/8/469 Submitter : Ray Lee [EMAIL PROTECTED] Caused-By : Alexey Starikovskiy [EMAIL PROTECTED] commit ed41dab90eb40ac4911e60406bc653661f0e4ce1 Handled-By : Len Brown [EMAIL PROTECTED] Patch : http://lkml.org/lkml/2007/3/12/228 Status : patch available Subject: suspend/resume hangs (until keypress) (CONFIG_NO_HZ) References : http://bugzilla.kernel.org/show_bug.cgi?id=8181 http://lkml.org/lkml/2007/3/11/112 Submitter : Tomas Janousek [EMAIL PROTECTED] Thomas Meyer [EMAIL PROTECTED] Milan Broz [EMAIL PROTECTED] Handled-By : Thomas Gleixner [EMAIL PROTECTED] Patch : http://lkml.org/lkml/2007/3/16/406 Status : patch available Subject: snd-intel8x0: no 3d surround sound References : http://lkml.org/lkml/2007/3/5/164 Submitter : Michal Piotrowski [EMAIL PROTECTED] Caused-By : Randy Cushman [EMAIL PROTECTED] commit 831466f4ad2b5fe23dff77edbe6a7c244435e973 Handled-By : Takashi Iwai [EMAIL PROTECTED] Status : patch available Subject: USB: Oops when connecting USB 1.1 docks References : http://lkml.org/lkml/2007/3/4/266 Submitter : Mark Lord [EMAIL PROTECTED] Caused-By : Jim Radford [EMAIL PROTECTED] commit d9a7ecacac5f8274d2afce09aadcf37bdb42b93a Handled-By : Oliver Neukum [EMAIL PROTECTED] Jim Radford [EMAIL PROTECTED] Patch : http://lkml.org/lkml/2007/3/13/217 Status : patch available - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-pm] [2/6] 2.6.21-rc2: known regressions
Jim Gettys wrote: On Sun, 2007-03-18 at 17:07 +0100, Ingo Molnar wrote: * Pavel Machek [EMAIL PROTECTED] wrote: Some day we may have modesetting support in the kernel for some graphics hw, right now it's pretty damn spotty. Yep, that's the way to go. hey, i wildly supported this approach ever since 1996, when GGI came up :-/ So wildly you wrote tons of code ;-). More seriously, at the time, XFree86 would have spat in your face for any such thing. Thankfully, times are changing. Also more seriously, a somewhat hybrid approach is in order for mode setting: simple mode setting isn't much code and is required for sane behavior on crash (it is nice to get oopses onto a screen); but the full blown mode setting/configuration problem is so large that on some hardware, it is likely left best left to a helper process (not the X server). Also key to get sane behavior out of the scheduler is to get the X server to yield (sleep in the kernel) rather than busy waiting when the GPU is busy; a standardized interface for this for both fbdev and dri is in order. Right now, X is a misbehaving compute bound process rather than the properly interactive process it can/should/will be, releasing the CPU whenever the hardware is busy. Needless to say, this wastes cycles and hurts interactivity with just about any scheduler you can devise. It isn't as if this is hard; on UNIX systems we did it in 1984 or thereabouts. What you say sounds good, assuming that the cost of a sleep is less than the cost of the busy wait. But this may be hardware, the waits may be very small and frequent, and if it's hitting a small hardware window like retrace, delays in response will cause the time period to be missed completely. This probably less critical with very smart cards, many of us don't run them. Of course, in 1996, XFree86 would have ignored any such interfaces, in its insane quest for operating system independent user space drivers requiring no standard kernel interfaces (it is the second part of this where the true insanity lay). - Jim -- Bill Davidsen [EMAIL PROTECTED] We have more to fear from the bungling of the incompetent than from the machinations of the wicked. - from Slashdot - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-pm] [2/6] 2.6.21-rc2: known regressions
On Mon, 2007-03-19 at 16:33 -0400, Bill Davidsen wrote: What you say sounds good, assuming that the cost of a sleep is less than the cost of the busy wait. But this may be hardware, the waits may be very small and frequent, and if it's hitting a small hardware window like retrace, delays in response will cause the time period to be missed completely. This probably less critical with very smart cards, many of us don't run them. Actually, various strategies involving short busy waiting, or looking at DMA address registers before sleeping were commonplace. But a syscall/sleep/wakeup is/was pretty fast. If you have an operation blitting the screen (e.g. scrolling), it takes a bit of time for the GPU to execute the command. I see this right now on OLPC, where a wonderful music application needs to scroll (most of) the screen left), periodically, and we're losing samples sometimes at those operation. Remember also, that being nice to everyone else by sleeping, there are more cycles to go around, and the scheduler can nicely boost the X server's priority as it will for interactive processes that are being cooperative. - Jim -- Jim Gettys One Laptop Per Child - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ofa-general] Re: dst_ifdown breaks infiniband?
Hello! Well I don't think the loopback device is currently but as soon as we get network namespace support we will have multiple loopback devices and they will get unregistered when we remove the network namespace. There is no logical difference. At the moment when namespace is gone there is nobody who can hold unrevokable references to this loopback. Alexey - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ofa-general] Re: dst_ifdown breaks infiniband?
Hello! Does this look sane (untested)? It does not, unfortunately. Instead of regular crash in infiniband you will get numerous random NULL pointer dereferences both due to dst-neighbour and due to dst-dev. Alexey - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ofa-general] Re: dst_ifdown breaks infiniband?
Quoting Alexey Kuznetsov [EMAIL PROTECTED]: Subject: Re: [ofa-general] Re: dst_ifdown breaks infiniband? Does this look sane (untested)? It does not, unfortunately. Instead of regular crash in infiniband you will get numerous random NULL pointer dereferences both due to dst-neighbour and due to dst-dev. Right, I saw this clearly in the morning. Thanks. -- MST - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ofa-general] Re: dst_ifdown breaks infiniband?
Hello! I think the thing to do is to just leave the loopback references in place, try to unregister the per-namespace loopback device, and that will safely wait for all the references to go away. Yes, it is exactly how it works in openvz. All the sockets are killed, queues are cleared, nobody holds references and virtual loopback can be unregistered just like another device. Alexey - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dst_ifdown breaks infiniband?
Any simpler ideas? Well, if inifiniband destructor really needs to take that lock... no. Right now I do not see. OK, this is actually not hard to fix - for infiniband, we can just look at neighbour-dev-type or compare neighbour-dev and neighbour-parms-dev - if they are different, device is being unregistered, so we do not need to do anything in the destructor. I'll send a patch to openfabrics, shortly. However, after implementing this fix, I hit what could be use after free at module unloading. Dave, Alexey, Roland, could you take a look at the following please? Works fine for me (survived a couple of hours of crazy device loading/unloading/up/down/hotplug + link data and state activity) and seems to fix the issue. - If a device driver sets neigh_destructor in neigh_params, this could get called after the device has been unregistered and the driver module removed. This is an old bug, but apparently, started to get triggered more infiniband after recent multicast and connected mode changes. Fix this by delaying dev_put until the neigh_params object is removed. Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 3183142..cb34f1a 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -1313,8 +1313,6 @@ void neigh_parms_release(struct neigh_table *tbl, struct neigh_parms *parms) *p = parms-next; parms-dead = 1; write_unlock_bh(tbl-lock); - if (parms-dev) - dev_put(parms-dev); call_rcu(parms-rcu_head, neigh_rcu_free_parms); return; } @@ -1325,6 +1323,8 @@ void neigh_parms_release(struct neigh_table *tbl, struct neigh_parms *parms) void neigh_parms_destroy(struct neigh_parms *parms) { + if (parms-dev) + dev_put(parms-dev); kfree(parms); } -- MST - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dst_ifdown breaks infiniband?
Quoting Michael S. Tsirkin [EMAIL PROTECTED]: Subject: Re: dst_ifdown breaks infiniband? Any simpler ideas? Well, if inifiniband destructor really needs to take that lock... no. Right now I do not see. OK, this is actually not hard to fix - for infiniband, we can just look at neighbour-dev-type or compare neighbour-dev and neighbour-parms-dev - if they are different, device is being unregistered, so we do not need to do anything in the destructor. I'll send a patch to openfabrics, shortly. However, after implementing this fix, I hit what could be use after free at module unloading. Dave, Alexey, Roland, could you take a look at the following please? Works fine for me (survived a couple of hours of crazy device loading/unloading/up/down/hotplug + link data and state activity) and seems to fix the issue. - If a device driver sets neigh_destructor in neigh_params, this could get called after the device has been unregistered and the driver module removed. This is an old bug, but apparently, started to get triggered more infiniband after recent multicast and connected mode changes. Fix this by delaying dev_put until the neigh_params object is removed. Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] The problem seems real enough but the fix seems no good - device unregister gets blocked with unregister_netdevice: waiting for ib0 to become free. Usage count = 1 It seems the parms object can survive indefinitely after device is removed. How about creating a new parms object in dst_ifdown, and pointing neighbour to this? Would that work? The advantage of this approach is that neigh-parms is already protected by RCU. -- MST - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dst_ifdown breaks infiniband?
Hello! If a device driver sets neigh_destructor in neigh_params, this could get called after the device has been unregistered and the driver module removed. It is the same problem: if dst-neighbour holds neighbour, it should not hold device. parms-dev is not supposed to be used after neigh_parms_release(). F.e. set parms-dev to NULL to catch bad references. Do you search for a way to find real inifiniband device in ipoib_neigh_destructor()? I guess you will not be able. The problem is logical: if destructor needs device, neighbour entry _somehow_ have to hold reference to the device (via neigh-dev, neigh-parms, whatever). Hence, if we hold neighbour entry, unregister cannot be completed. Therefore, destructor cannot refer to device. Q.E.D. :-) Seems, releasing dst-neighbour is inevitable. Alexey - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dst_ifdown breaks infiniband?
Quoting Alexey Kuznetsov [EMAIL PROTECTED]: Subject: Re: dst_ifdown breaks infiniband? Hello! If a device driver sets neigh_destructor in neigh_params, this could get called after the device has been unregistered and the driver module removed. It is the same problem: if dst-neighbour holds neighbour, it should not hold device. parms-dev is not supposed to be used after neigh_parms_release(). F.e. set parms-dev to NULL to catch bad references. Yes. I fixed that - simply checking that neighbour-dev is a loopback device is sufficient to detect the fact that the device is being unregistered. Do you search for a way to find real inifiniband device in ipoib_neigh_destructor()? No, not anymore. I guess you will not be able. I agree it's not possible. The problem is logical: if destructor needs device, neighbour entry _somehow_ have to hold reference to the device (via neigh-dev, neigh-parms, whatever). Hence, if we hold neighbour entry, unregister cannot be completed. Therefore, destructor cannot refer to device. Q.E.D. :-) Seems, releasing dst-neighbour is inevitable. infiniband sets parm-neigh_destructor, and I search for a way to prevent this destructor from being called after the module has been unloaded. Ideas? -- MST - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dst_ifdown breaks infiniband?
Quoting Alexey Kuznetsov [EMAIL PROTECTED]: Subject: Re: dst_ifdown breaks infiniband? Hello! If a device driver sets neigh_destructor in neigh_params, this could get called after the device has been unregistered and the driver module removed. It is the same problem: if dst-neighbour holds neighbour, it should not hold device. parms-dev is not supposed to be used after neigh_parms_release(). F.e. set parms-dev to NULL to catch bad references. Yes. I fixed that - simply checking that neighbour-dev is a loopback device is sufficient to detect the fact that the device is being unregistered. Do you search for a way to find real inifiniband device in ipoib_neigh_destructor()? No, not anymore. I guess you will not be able. I agree it's not possible. The problem is logical: if destructor needs device, neighbour entry _somehow_ have to hold reference to the device (via neigh-dev, neigh-parms, whatever). Hence, if we hold neighbour entry, unregister cannot be completed. Therefore, destructor cannot refer to device. Q.E.D. :-) Seems, releasing dst-neighbour is inevitable. infiniband sets parm-neigh_destructor, and I search for a way to prevent this destructor from being called after the module has been unloaded. Ideas? -- MST - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dst_ifdown breaks infiniband?
Hello! infiniband sets parm-neigh_destructor, and I search for a way to prevent this destructor from being called after the module has been unloaded. Ideas? It must be called in any case to update/release internal ipoib structures. The idea is to move call of parm-neigh_destructor from neighbour destructor to the moment when it is unhashed, right after n-dead is set. infiniband is the only user (atm clip uses it too, but that use is obviously dummy), so that nobody will be harmed. But ipoib will have to check for validity of skb-dst-neighbour before attempt to reinitialize private data on dead (n-dead != 0) neighbour. Alexey - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ofa-general] Re: dst_ifdown breaks infiniband?
David Miller [EMAIL PROTECTED] writes: I think the thing to do is to just leave the loopback references in place, try to unregister the per-namespace loopback device, and that will safely wait for all the references to go away. Right. The only thing I have found that needs to be changed so far in this area is specifying which loopback device I want to replace it with. If you do it that way, you should need absolutely no changes to the other code in this area. As per Herbert, I think he works on Xen rather than vserver :-) Perhaps you're thinking of Alexey Kuznetsov or another one of the vserver guys. I think you are thinking of a different Herbert. I was thinking of Herbert Poetzl the vserver maintainer. Alexey works on OpenVZ. Until we get the basic architecture merged they are rival projects. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dst_ifdown breaks infiniband?
Quoting Alexey Kuznetsov [EMAIL PROTECTED]: Subject: Re: dst_ifdown breaks infiniband? Hello! infiniband sets parm-neigh_destructor, and I search for a way to prevent this destructor from being called after the module has been unloaded. Ideas? It must be called in any case to update/release internal ipoib structures. I don't think there's a problem. All we do in destructor is release the ipoib_neigh resource. And on device unregister we release all resources anyway. The idea is to move call of parm-neigh_destructor from neighbour destructor to the moment when it is unhashed, right after n-dead is set. infiniband is the only user (atm clip uses it too, but that use is obviously dummy), so that nobody will be harmed. This might work. Could you post a patch to better show what you mean to do? But ipoib will have to check for validity of skb-dst-neighbour before attempt to reinitialize private data on dead (n-dead != 0) neighbour. We set a flag before unregister_netdev and test it in start_xmit so that's covered I think. -- MST - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dst_ifdown breaks infiniband?
Hello! This might work. Could you post a patch to better show what you mean to do? Here it is. -neigh_destructor() is killed (not used), replaced with -neigh_cleanup(), which is called when neighbor entry goes to dead state. At this point everything is still valid: neigh-dev, neigh-parms etc. The device should guarantee that dead neighbor entries (neigh-dead != 0) do not get private part initialized, otherwise nobody will cleanup it. I think this is enough for ipoib which is the only user of this thing. Initialization private part of neighbor entries happens in ipib start_xmit routine, which is not reached when device is down. But it would be better to add explicit test for neigh-dead in any case. diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index f9dbc6f..2b5c297 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -814,7 +814,7 @@ static void ipoib_set_mcast_list(struct queue_work(ipoib_workqueue, priv-restart_task); } -static void ipoib_neigh_destructor(struct neighbour *n) +static void ipoib_neigh_cleanup(struct neighbour *n) { struct ipoib_neigh *neigh; struct ipoib_dev_priv *priv = netdev_priv(n-dev); @@ -822,7 +822,7 @@ static void ipoib_neigh_destructor(struc struct ipoib_ah *ah = NULL; ipoib_dbg(priv, - neigh_destructor for %06x IPOIB_GID_FMT \n, + neigh_cleanup for %06x IPOIB_GID_FMT \n, IPOIB_QPN(n-ha), IPOIB_GID_RAW_ARG(n-ha + 4)); @@ -874,7 +874,7 @@ void ipoib_neigh_free(struct net_device static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) { - parms-neigh_destructor = ipoib_neigh_destructor; + parms-neigh_cleanup = ipoib_neigh_cleanup; return 0; } diff --git a/include/net/neighbour.h b/include/net/neighbour.h index 3725b93..ad7fe11 100644 --- a/include/net/neighbour.h +++ b/include/net/neighbour.h @@ -36,7 +36,7 @@ struct neigh_parms struct net_device *dev; struct neigh_parms *next; int (*neigh_setup)(struct neighbour *); - void(*neigh_destructor)(struct neighbour *); + void(*neigh_cleanup)(struct neighbour *); struct neigh_table *tbl; void*sysctl_table; diff --git a/net/atm/clip.c b/net/atm/clip.c index ebb5d0c..8c38258 100644 --- a/net/atm/clip.c +++ b/net/atm/clip.c @@ -261,14 +261,6 @@ static void clip_pop(struct atm_vcc *vcc spin_unlock_irqrestore(PRIV(dev)-xoff_lock, flags); } -static void clip_neigh_destroy(struct neighbour *neigh) -{ - DPRINTK(clip_neigh_destroy (neigh %p)\n, neigh); - if (NEIGH2ENTRY(neigh)-vccs) - printk(KERN_CRIT clip_neigh_destroy: vccs != NULL !!!\n); - NEIGH2ENTRY(neigh)-vccs = (void *) NEIGHBOR_DEAD; -} - static void clip_neigh_solicit(struct neighbour *neigh, struct sk_buff *skb) { DPRINTK(clip_neigh_solicit (neigh %p, skb %p)\n, neigh, skb); @@ -342,7 +334,6 @@ static struct neigh_table clip_tbl = { /* parameters are copied from ARP ... */ .parms = { .tbl= clip_tbl, - .neigh_destructor = clip_neigh_destroy, .base_reachable_time= 30 * HZ, .retrans_time = 1 * HZ, .gc_staletime = 60 * HZ, diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 3183142..cfc6001 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -140,6 +140,8 @@ static int neigh_forced_gc(struct neigh_ n-dead = 1; shrunk = 1; write_unlock(n-lock); + if (n-parms-neigh_cleanup) + n-parms-neigh_cleanup(n); neigh_release(n); continue; } @@ -211,6 +213,8 @@ static void neigh_flush_dev(struct neigh NEIGH_PRINTK2(neigh %p is stray.\n, n); } write_unlock(n-lock); + if (n-parms-neigh_cleanup) + n-parms-neigh_cleanup(n); neigh_release(n); } } @@ -582,9 +586,6 @@ void neigh_destroy(struct neighbour *nei kfree(hh); } - if (neigh-parms-neigh_destructor) - (neigh-parms-neigh_destructor)(neigh); - skb_queue_purge(neigh-arp_queue); dev_put(neigh-dev); @@ -675,6 +676,8 @@ static void neigh_periodic_timer(unsigne *np = n-next; n-dead = 1; write_unlock(n-lock); + if (n-parms-neigh_cleanup) + n-parms-neigh_cleanup(n);