Re: [PATCH] update checkpatch.pl to version 0.10
here is an update wrt. the latest checkpatch.pl-next version (v11-to-be), about kernel/sched.c warnings: > size # warnings > > 25383 checkpatch.pl.v6 5 > 26038 checkpatch.pl.v7 6 > 29603 checkpatch.pl.v8 65 > 31160 checkpatch.pl.v9 24 > 34950 checkpatch.pl.v10 28 35948 checkpatch.pl.v11pre 11 so things are heading in the right direction :) of those 11 warnings, 6 are correct warnings (4 will be solved via KERN_CONT, 1 will be solved via a proper include file, and 1 is an overlength line), 4 are borderline warnings (easily fixed) and only one is a false positive! So v11-to-be gets the "best checkpatch.pl ever" badge from me :) The false positive is: ERROR: need consistent spacing around '*' (ctx:WxV) #5322: +static ctl_table *sd_alloc_ctl_cpu_table(int cpu) ^ i think checkpatch.pl mistook this function definition as an arithmetic expression? But, there's a cleanliness bug underlying this false positive: 'ctl_table' is a typedef, and it would be cleaner to use 'struct ctl_table' thoughout the kernel. When running checkpatch.pl over include/linux/sysctl.h, it warns about the typedef: WARNING: do not add new typedefs #944: +typedef struct ctl_table ctl_table; (but mistaking that function for an arithmetic expression is still a bug i think.) nice work Andy! Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Make tasks always have non-zero pids
Pavel Emelianov [EMAIL PROTECTED] wrote: | Some time ago Sukadev noticed that the vmlinux size has Cedric pointed it out to me first :-) | grown 5Kb due to merged pid namespaces. One of the big | problems with it was fat inline functions. The other thing | was noticed by Matt - the checks for task's pid to be not | NULL take place and make the kernel grow due to inlining, | but these checks are not always needed. | | In this series I introduce a static pid (dummy), according | to Matt's proposal, which is assigned to tasks during the | detach_pid and transfer_pid instead of NULL. This pid lives | in the init pid namespace and has the id = 0, so all the | task_xid_xnr() calls will still return 0 on a dead task. | | Places that get the struct pid from task either get it from | the current (in this case they will never get this dummy), | or use it to compare with some other value (so they will | work the same for both NULL and dummy pids). | | This saves up to 340 bytes for i386 kernel with minimal | config and probably more with more users of pids. | | Tested on i386 and x86_64 boxes. Tasks still live and die, | namespaces and proc still work. | | Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]> Acked-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] add tunable_notifier function
Vivek Goyal wrote: > On Thu, Oct 04, 2007 at 08:38:34PM +0900, Takenori Nagano wrote: >> This patch adds new notifier function tunable_notifier_chain. Its base is >> atomic_notifier_chain. >> >> Thanks, >> >> --- >> >> Signed-off-by: Takenori Nagano <[EMAIL PROTECTED]> >> >> --- >> diff -uprN linux-2.6.23-rc9.orig/include/linux/notifier.h >> linux-2.6.23-rc9/include/linux/notifier.h >> --- linux-2.6.23-rc9.orig/include/linux/notifier.h 2007-10-02 >> 12:24:52.0 >> +0900 >> +++ linux-2.6.23-rc9/include/linux/notifier.h2007-10-03 >> 14:48:04.28800 +0900 >> @@ -13,6 +13,7 @@ >> #include >> #include >> #include >> +#include >> >> /* >> * Notifier chains are of four types: >> @@ -53,6 +54,14 @@ struct notifier_block { >> int priority; >> }; >> >> +struct tunable_notifier_block { >> +struct notifier_block *nb; >> +struct tunable_notifier_head *head; >> +struct dentry *dir; >> +struct dentry *pri_dentry; >> +struct dentry *desc_dentry; >> +}; >> + > > Should this be tunable_atomic_notifier_block? I think there are two kind > of lists. One where handlers have to be atomic and other one where handlers > can be blocking one. I think you are making atomic one tunable. If that's > the case it should be reflected in the naming everywhere. Hi Vivek, Yes, it based on atomic_notifier_list. I think your opinion is reasonable. I'll change the name tunable_notifier to tunable_atomic_notifier. Thanks, Takenori Nagano <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
On Oct 05, 2007, at 00:45:17, Eric W. Biederman wrote: Kyle Moffett <[EMAIL PROTECTED]> writes: On Oct 04, 2007, at 21:44:02, Eric W. Biederman wrote: SElinux is not all encompassing or it is generally incomprehensible I don't know which. Or someone long ago would have said a better way to implement containers was with a selinux ruleset, here is a selinux ruleset that does that. Although it is completely possible to implement all of the isolation with the existing LSM hooks as Serge showed. The difference between SELinux and containers is that SELinux (and LSM as a whole) returns -EPERM to operations outside the scope of the subject, whereas containers return -ENOENT (because it's not even in the same namespace). Yes. However if you look at what the first implementations were. Especially something like linux-vserver. All they provided was isolation. So perhaps you would not see every process ps but they all had unique pid values. I'm pretty certain Serge at least prototyped a simplified version of that using the LSM hooks. Is there something I'm not remember in those hooks that allows hiding of information like processes? Yes. Currently with containers we are taking that one step farther as that solves a wider set of problems. IMHO, containers have a subtly different purpose from LSM even though both are about information hiding. Basically a container is information hiding primarily for administrative reasons; either as a convenience to help prevent errors or as a way of describing administrative boundaries. For example, even in an environment where all sysadmins are trusted employees, a few head-honcho sysadmins would get root container access, and all others would get access to specific containers as a way of preventing "oops" errors. Basically a container is about "full access inside this box and no access outside". By contrast, LSM is more strictly about providing *limited* access to resources. For an accounting business all client records would grouped and associated together, however those which have passed this year's review are read-only except by specific staff and others may have information restricted to some subset of the employees. So containers are exclusive subsets of "the system" while LSM should be about non-exclusive information restriction. We also have in the kernel another parallel security mechanism (for what is generally a different class of operations) that has been quite successful, and different groups get along quite well, and ordinary mortals can understand it. The linux firewalling code. Well, I wouldn't go so far as the "ordinary mortals can understand it" part; it's still pretty high on the obtuse-o-meter. True. Probably a more accurate statement is:`unix command line power users can and do handle it after reading the docs. That's not quite ordinary mortals but it feels like it some days. It might all be perception... I have seen more *wrong* iptables firewalls than I've seen correct ones. Securing TCP/IP traffic properly requires either a lot of training/experience or a good out-of-the-box system like Shorewall which structures the necessary restrictions for you based on an abstract description of the desired functionality. For instance what percentage of admins do you think could correctly set up their netfilter firewalls to log christmas-tree packets, smurfs, etc without the help of some external tool? Hell, I don't trust myself to reliably do it without a lot of reading of docs and testing, and I've been doing netfilter firewalls for a while. The bottom line is that with iptables it is *CRITICAL* to have a good set of interface tools to take the users' "My system is set up like..." description in some form and turn it into the necessary set of efficient security rules. The *exact* same issue applies to SELinux, with 2 major additional problems: 1) Half the tools are still somewhat beta-ish and under heavy development. Furthermore the semi-official reference policy is nowhere near comprehensive and pretty ugly to read (go back to the point about the tools being beta-ish). 2) If you break your system description or translation tools then instead of just your network dying your entire *system* dies. The linux firewalling codes has hooks all throughout the networking stack, just like the LSM has hooks all throughout the rest of linux kernel. There is a difference however. The linux firewalling code in addition to hooks has tables behind those hooks that it consults. There is generic code to walk those tables and consult with different kernel modules to decide if we should drop a packet. Each of those kernel modules provides a different capability that can be used to generate a firewall. This is almost *EXACTLY* what SELinux provides as an LSM module. The one difference is that with
Re: [PATCH 2/2] implement new notifier function to panic_notifier_list
On Thu, Oct 04, 2007 at 08:38:50PM +0900, Takenori Nagano wrote: > This patch implements new notifier function to panic_notifier_list. We can > change the list of order by debugfs. > > Thanks, > > --- > > Signed-off-by: Takenori Nagano <[EMAIL PROTECTED]> > > --- > diff -uprN linux-2.6.23-rc9.orig/arch/alpha/kernel/setup.c > linux-2.6.23-rc9/arch/alpha/kernel/setup.c > --- linux-2.6.23-rc9.orig/arch/alpha/kernel/setup.c 2007-10-02 > 12:24:52.0 +0900 > +++ linux-2.6.23-rc9/arch/alpha/kernel/setup.c2007-10-04 > 09:49:34.44000 +0900 > @@ -45,14 +45,22 @@ > #include > #include > > -extern struct atomic_notifier_head panic_notifier_list; > +extern struct tunable_notifier_head panic_notifier_list; > static int alpha_panic_event(struct notifier_block *, unsigned long, void *); > -static struct notifier_block alpha_panic_block = { > +static struct notifier_block alpha_panic_block_base = { > alpha_panic_event, > NULL, > INT_MAX /* try to do it first */ > }; > > +static struct tunable_notifier_block alpha_panic_block = { > + _panic_block_base, > + NULL, > + NULL, > + NULL, > + NULL > +}; > + > #include > #include > #include > @@ -522,8 +530,8 @@ setup_arch(char **cmdline_p) > } > > /* Register a call for panic conditions. */ > - atomic_notifier_chain_register(_notifier_list, > - _panic_block); > + tunable_notifier_chain_register(_notifier_list, > + _panic_block, "alpha_panic", NULL); > I think it might be good idea to somehow create provisions for another a help string. This help string will inform admin that what a registered user does? Ideally this should be visible in /sys/kernel/debug//description file. This kind of description can help admin to decide the priority among various registered users withoug having to look at the source code. Thanks Vivek - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] add new notifier function
Vivek Goyal wrote: > On Thu, Oct 04, 2007 at 08:38:05PM +0900, Takenori Nagano wrote: > > In summary, right now co-existence of kdb with kdump seems to be your pain > point. I would prefer that kdb just puts a break point on panic() and we move > on. If there are more candidates down the line and these can't be easily > executed in second kernel then we can re-visit this notification list > mechanism. Hi Vivek, Thank you for your comment. :-) I don't mind kdb and kdump problem now. Because my patches are not merged into mainline kernel yet. If they are merged, I think how we can resolve about RAS tools problem. >> # ls >> ipmi_msghandler ipmi_wdog >> # cat ipmi_msghandler/priority >> 200 >> # cat ipmi_wdog/priority >> 150 >> # >> Kernel panic - not syncing: panic >> ipmi_msghandler : notifier calls panic_event(). >> ipmi_watchdog : notifier calls wdog_panic_handler(). >> >> .(reboot) >> > > We also need to implement a file which can give a consolidated view. All > the registered members and their priority. I tried to implement it, but its impact is large. And we can get all priority values using "ls" and "cat */priority". I'll implement it if user strongly expects it. ex) # cd panic_notifier_list # ls ipmi_msghandler ipmi_wdog # cat */priority 200 150 # Thanks, Takenori Nagano <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] add tunable_notifier function
On Thu, Oct 04, 2007 at 08:38:34PM +0900, Takenori Nagano wrote: > This patch adds new notifier function tunable_notifier_chain. Its base is > atomic_notifier_chain. > > Thanks, > > --- > > Signed-off-by: Takenori Nagano <[EMAIL PROTECTED]> > > --- > diff -uprN linux-2.6.23-rc9.orig/include/linux/notifier.h > linux-2.6.23-rc9/include/linux/notifier.h > --- linux-2.6.23-rc9.orig/include/linux/notifier.h2007-10-02 > 12:24:52.0 > +0900 > +++ linux-2.6.23-rc9/include/linux/notifier.h 2007-10-03 14:48:04.28800 > +0900 > @@ -13,6 +13,7 @@ > #include > #include > #include > +#include > > /* > * Notifier chains are of four types: > @@ -53,6 +54,14 @@ struct notifier_block { > int priority; > }; > > +struct tunable_notifier_block { > + struct notifier_block *nb; > + struct tunable_notifier_head *head; > + struct dentry *dir; > + struct dentry *pri_dentry; > + struct dentry *desc_dentry; > +}; > + Should this be tunable_atomic_notifier_block? I think there are two kind of lists. One where handlers have to be atomic and other one where handlers can be blocking one. I think you are making atomic one tunable. If that's the case it should be reflected in the naming everywhere. Thanks Vivek - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.6.23-rc9 and a heads-up for the 2.6.24 series..
* Glauber de Oliveira Costa <[EMAIL PROTECTED]> wrote: > On 10/2/07, Alistair John Strachan <[EMAIL PROTECTED]> wrote: > > This is certainly a tool issue, but if I use Debian's kernel-image > > "make-kpkg" > > wrapper around the kernel build system, it fails with: > > > > cp: cannot stat `arch/x86_64/boot/bzImage': No such file or directory > > > > Obviously, this file has moved to arch/x86/boot, but it seems like possibly > > unnecessary breakage. I've been copying bzImage for years from > > arch/x86_64/boot, and I'm sure there's a handful of scripts (other than > > Debian's kernel-image) doing this too. > > I believe most sane tools would be using the output of uname -m, so a > possible way to fix this would be fixing the data passed to userspace > from uname. However, that might be the case that it creates a new set > of problems too, with tools relying on the output of uname -m to > determine wheter the machine is 32 or 64 bit, and so on. there are two problems with the use of uname -m: - the build machine architecture is not necessarily the same as the target architecture. (for example i cross-compile all my 32-bit kernels on a 64-bit box.) - we kept uname -m compatile. multilib depends on it, and other pieces of userspace as well. So uname -m still outputs 'i386' on 32-bit and 'x86_64' on 64-bit - not 'x86'. a symlink looks like the best solution to me. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.6.23-rc9 and a heads-up for the 2.6.24 series..
* Alistair John Strachan <[EMAIL PROTECTED]> wrote: > On Tuesday 02 October 2007 04:41:49 Linus Torvalds wrote: > [snip] > > In other words, people who know they may be affected and would want to > > prepare can look at (for example) > > > > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-x86.git x86 > > > > and generally get ready for the switch-over. > > This is certainly a tool issue, but if I use Debian's kernel-image > "make-kpkg" > wrapper around the kernel build system, it fails with: > > cp: cannot stat `arch/x86_64/boot/bzImage': No such file or directory > > Obviously, this file has moved to arch/x86/boot, but it seems like > possibly unnecessary breakage. I've been copying bzImage for years > from arch/x86_64/boot, and I'm sure there's a handful of scripts > (other than Debian's kernel-image) doing this too. > > For now, I hacked the tool[1]. Maybe, if we care, a symlink could be > set up between arch/x86/boot and arch/$ARCH/boot ? Or would papering > over this be more trouble than it's worth? yeah, a symlink is the right solution i think. Our first-step goal is to make the switchover seamless for all practical purposes, and a compatibility symlink in arch/i386/boot/ will not hurt. (we shouldnt worry about the really old zImage target though) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 2/2] IRQ: Modularize the setup_irq code (2)
Introduce irq_desc_match_fist_irqaction() to support setup_irq() code modularity. Signed-off-by: Ahmed S. Darwish <[EMAIL PROTECTED]> --- Any ideas for a better method name ? manage.c | 89 --- 1 file changed, 51 insertions(+), 38 deletions(-) diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 6a0d778..4e96d56 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -293,6 +293,55 @@ int can_add_irqaction_on_allocated_irq(unsigned int irq, struct irqaction *new) } /* + * Configure the passed irq descriptor to satisfy our first newly + * added irqaction needs + * must be called with the irq_desc[irq]->lock held + */ +void irq_desc_match_fist_irqaction(unsigned int irq, struct irqaction *new) +{ + struct irq_desc *desc = irq_desc + irq; + + /* We must be the first and the only irqaction */ + BUG_ON(desc->action != new || new->next); + + irq_chip_set_defaults(desc->chip); + +#if defined(CONFIG_IRQ_PER_CPU) + if (new->flags & IRQF_PERCPU) + desc->status |= IRQ_PER_CPU; +#endif + + /* Setup the type (level, edge polarity) if configured: */ + if (new->flags & IRQF_TRIGGER_MASK) { + if (desc->chip && desc->chip->set_type) + desc->chip->set_type(irq, +new->flags & IRQF_TRIGGER_MASK); + else + /* +* IRQF_TRIGGER_* but the PIC does not support +* multiple flow-types? +*/ + printk(KERN_WARNING "No IRQF_TRIGGER set_type " + "function for IRQ %d (%s)\n", irq, + desc->chip ? desc->chip->name : "unknown"); + } else + compat_irq_chip_set_default_handler(desc); + + desc->status &= ~(IRQ_AUTODETECT | IRQ_WAITING | IRQ_INPROGRESS); + + if (!(desc->status & IRQ_NOAUTOEN)) { + desc->depth = 0; + desc->status &= ~IRQ_DISABLED; + if (desc->chip->startup) + desc->chip->startup(irq); + else + desc->chip->enable(irq); + } else + /* Undo nested disables: */ + desc->depth = 1; +} + +/* * Internal function to register an irqaction - typically used to * allocate special interrupts that are part of the architecture. */ @@ -352,45 +401,9 @@ int setup_irq(unsigned int irq, struct irqaction *new) if (new->flags & IRQF_NOBALANCING) desc->status |= IRQ_NO_BALANCING; - if (!shared) { - irq_chip_set_defaults(desc->chip); + if (!shared) + irq_desc_match_fist_irqaction(irq, new); -#if defined(CONFIG_IRQ_PER_CPU) - if (new->flags & IRQF_PERCPU) - desc->status |= IRQ_PER_CPU; -#endif - - /* Setup the type (level, edge polarity) if configured: */ - if (new->flags & IRQF_TRIGGER_MASK) { - if (desc->chip && desc->chip->set_type) - desc->chip->set_type(irq, - new->flags & IRQF_TRIGGER_MASK); - else - /* -* IRQF_TRIGGER_* but the PIC does not support -* multiple flow-types? -*/ - printk(KERN_WARNING "No IRQF_TRIGGER set_type " - "function for IRQ %d (%s)\n", irq, - desc->chip ? desc->chip->name : - "unknown"); - } else - compat_irq_chip_set_default_handler(desc); - - desc->status &= ~(IRQ_AUTODETECT | IRQ_WAITING | - IRQ_INPROGRESS); - - if (!(desc->status & IRQ_NOAUTOEN)) { - desc->depth = 0; - desc->status &= ~IRQ_DISABLED; - if (desc->chip->startup) - desc->chip->startup(irq); - else - desc->chip->enable(irq); - } else - /* Undo nested disables: */ - desc->depth = 1; - } /* Reset broken irq detection when installing new handler */ desc->irq_count = 0; desc->irqs_unhandled = 0; -- Ahmed S. Darwish HomePage: http://darwish.07.googlepages.com Blog: http://darwish-07.blogspot.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 1/2] IRQ: Modularize the setup_irq code (1)
Hi Thomas/lkml, setup_irq() code contains a big chunk of 130 code lines that can be divided to several smaller methods. These 2 patches introduce those small functions to aid toward setup_irq() code modularity. No major code logic changes exist. Patches can be applied cleanly over v2.6.23-rc9. Thanks, ==> (Description for Logs) Introduce can_add_irqaction_on_allocated_irq and warn_about_irqaction_mismatch methods to support setup_irq() code modularity. Signed-off-by: Ahmed S. Darwish <[EMAIL PROTECTED]> --- manage.c | 92 +-- 1 file changed, 55 insertions(+), 37 deletions(-) diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 7230d91..6a0d778 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -248,6 +248,50 @@ void compat_irq_chip_set_default_handler(struct irq_desc *desc) desc->handle_irq = NULL; } +static inline void warn_about_irqaction_mismatch(unsigned int irq, +struct irqaction *new) +{ +#ifdef CONFIG_DEBUG_SHIRQ + const char *name = irq_desc[irq].action->name; + /* If device doesn't expect the mismatch */ + if (!(new->flags & IRQF_PROBE_SHARED)) { + printk(KERN_ERR "IRQ handler type mismatch for IRQ %d\n", irq); + if (name) + printk(KERN_ERR "current handler: %s\n", name); + dump_stack(); + } +#endif +} + +/* + * Test if an irqaction can be added to the passed allocated IRQ line + * Must be called with the irq_desc[irq]->lock held. + */ +int can_add_irqaction_on_allocated_irq(unsigned int irq, struct irqaction *new) +{ + struct irqaction *old = irq_desc[irq].action; + + BUG_ON(!old); + /* +* Can't share interrupts unless both agree to and are +* the same type (level, edge, polarity). So both flag +* fields must have IRQF_SHARED set and the bits which +* set the trigger type must match. +*/ + if (!((old->flags & new->flags) & IRQF_SHARED) || + ((old->flags ^ new->flags) & IRQF_TRIGGER_MASK)) + return 0; + +#if defined(CONFIG_IRQ_PER_CPU) + /* All handlers must agree on per-cpuness */ + if ((old->flags & IRQF_PERCPU) != + (new->flags & IRQF_PERCPU)) + return 0; +#endif + + return 1; +} + /* * Internal function to register an irqaction - typically used to * allocate special interrupts that are part of the architecture. @@ -256,7 +300,6 @@ int setup_irq(unsigned int irq, struct irqaction *new) { struct irq_desc *desc = irq_desc + irq; struct irqaction *old, **p; - const char *old_name = NULL; unsigned long flags; int shared = 0; @@ -289,31 +332,18 @@ int setup_irq(unsigned int irq, struct irqaction *new) p = >action; old = *p; if (old) { - /* -* Can't share interrupts unless both agree to and are -* the same type (level, edge, polarity). So both flag -* fields must have IRQF_SHARED set and the bits which -* set the trigger type must match. -*/ - if (!((old->flags & new->flags) & IRQF_SHARED) || - ((old->flags ^ new->flags) & IRQF_TRIGGER_MASK)) { - old_name = old->name; - goto mismatch; - } - -#if defined(CONFIG_IRQ_PER_CPU) - /* All handlers must agree on per-cpuness */ - if ((old->flags & IRQF_PERCPU) != - (new->flags & IRQF_PERCPU)) - goto mismatch; -#endif - - /* add new interrupt at end of irq queue */ - do { - p = >next; - old = *p; - } while (old); shared = 1; + if (can_add_irqaction_on_allocated_irq(irq, new)) { + /* add new interrupt at end of irq queue */ + do { + p = >next; + old = *p; + } while (old); + } else { + warn_about_irqaction_mismatch(irq, new); + spin_unlock_irqrestore(>lock, flags); + return -EBUSY; + } } *p = new; @@ -372,18 +402,6 @@ int setup_irq(unsigned int irq, struct irqaction *new) register_handler_proc(irq, new); return 0; - -mismatch: -#ifdef CONFIG_DEBUG_SHIRQ - if (!(new->flags & IRQF_PROBE_SHARED)) { - printk(KERN_ERR "IRQ handler type mismatch for IRQ %d\n", irq); - if (old_name) - printk(KERN_ERR "current handler: %s\n", old_name); - dump_stack(); - } -#endif - spin_unlock_irqrestore(>lock, flags); - return -EBUSY; } /**
Re: [PATCH 1/2] add tunable_notifier function
Randy Dunlap wrote: > On Thu, 04 Oct 2007 20:38:34 +0900 Takenori Nagano wrote: >> diff -uprN linux-2.6.23-rc9.orig/kernel/sys.c linux-2.6.23-rc9/kernel/sys.c >> --- linux-2.6.23-rc9.orig/kernel/sys.c 2007-10-02 12:24:52.0 >> +0900 >> +++ linux-2.6.23-rc9/kernel/sys.c2007-10-03 14:48:15.16000 +0900 >> @@ -38,6 +38,7 @@ >> #include >> #include >> #include >> +#include >> >> #include >> #include >> @@ -393,6 +394,234 @@ int blocking_notifier_call_chain(struct > >> +/** >> + * tunable_notifier_chain_register - Add notifier to an tunable notifier >> chain >> + * @nh: Pointer to head of the tunable notifier chain >> + * @n: New entry in notifier chain >> + * @name: Pointer to the name of this notifier chain > > Is @name the name of a notifier chain or of the new notifier entry? Hi Randy, @name: Pointer to the name of the new notifier entry. I'll change the explanation. Thanks, - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] implement new notifier function to panic_notifier_list
Randy Dunlap wrote: > On Thu, 04 Oct 2007 20:38:50 +0900 Takenori Nagano wrote: > >> This patch implements new notifier function to panic_notifier_list. We can >> change the list of order by debugfs. >> >> Thanks, >> >> --- >> >> Signed-off-by: Takenori Nagano <[EMAIL PROTECTED]> >> >> --- >> * Returns seconds, approximately. We don't need nanosecond >> * resolution, and we don't need to waste time with a big divide when >> @@ -193,5 +201,6 @@ __init void spawn_softlockup_task(void) >> cpu_callback(_nfb, CPU_ONLINE, cpu); >> register_cpu_notifier(_nfb); >> >> -atomic_notifier_chain_register(_notifier_list, _block); >> +tunable_notifier_chain_register(_notifier_list, _block, >> +"softlookup", NULL); >> } > > "softlockup" Hi Randy, Thank you for reviewing. :) I'll fix next version. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
Kyle Moffett <[EMAIL PROTECTED]> writes: > On Oct 04, 2007, at 21:44:02, Eric W. Biederman wrote: >> What we want from the LSM is the ability to say -EPERM when we can clearly >> articulate that we want to disallow something. > > This sort of depends on perspective; typically with security infrastructure > you > actually want "... the ability to return success when we can clearly > articulate > that we want to *ALLOW* something". File permissions work this way; we don't > have a list of forbidden users attached to each file, we have an owner, a > group, and a mode representing positive permissions. With that said in > certain > high- > risk environments you need something even stronger that cannot be changed by > the > "owner" of the file, if we don't entirely trust them, Yes. However last I looked at the LSM hooks we first do the normal unix permission checks. Then we run the hook. So it can only increase the number of times we say -EPERM. >> SElinux is not all encompassing or it is generally incomprehensible I don't >> know which. Or someone long ago would have said a better way to implement >> containers was with a selinux ruleset, here is a selinux ruleset that does >> that. Although it is completely possible to implement all of the isolation >> with the existing LSM hooks as Serge showed. > > The difference between SELinux and containers is that SELinux (and LSM as a > whole) returns -EPERM to operations outside the scope of the subject, whereas > containers return -ENOENT (because it's not even in the same namespace). Yes. However if you look at what the first implementations were. Especially something like linux-vserver. All they provided was isolation. So perhaps you would not see every process ps but they all had unique pid values. I'm pretty certain Serge at least prototyped a simplified version of that using the LSM hooks. Is there something I'm not remember in those hooks that allows hiding of information like processes? Yes. Currently with containers we are taking that one step farther as that solves a wider set of problems. >> We also have in the kernel another parallel security mechanism (for what is >> generally a different class of operations) that has been quite successful, >> and different groups get along quite well, and ordinary mortals can >> understand it. The linux firewalling code. > > Well, I wouldn't go so far as the "ordinary mortals can understand it" part; > it's still pretty high on the obtuse-o-meter. True. Probably a more accurate statement is:`unix command line power users can and do handle it after reading the docs. That's not quite ordinary mortals but it feels like it some days. It might all be perception... >> The linux firewalling codes has hooks all throughout the networking stack, >> just like the LSM has hooks all throughout the rest of linux kernel. There >> is a difference however. The linux firewalling code in addition to hooks has >> tables behind those hooks that it consults. There is generic code to walk >> those tables and consult with different kernel modules to decide if we should >> drop a packet. Each of those kernel modules provides a different capability >> that can be used to generate a firewall. > > This is almost *EXACTLY* what SELinux provides as an LSM module. The one > difference is that with SELinux some compromises and restrictions have been > made so that (theoretically) the resulting policy can be exhaustively > analyzed > to *prove* what it allows and disallows. It may be that SELinux should be > split into 2 parts, one that provides the underlying table-matching and the > other that uses it to provide the provability guarantees. Here's a direct > comparison: > > netfilter: > (A) Each packet has src, dst, port, etc that can be matched > (B) Table of rules applied sequentially (MATCH => ACTION) > (C) Rules may alter the properties of packets as they are routed/ > bridged/etc > > selinux: > (A) Each object has user, role, and type that can be matched > (B) Table of rules searched by object parameters (MATCH => allow/ > auditallow/transition) > (C) Rules may alter the properties of objects through transition rules. Ok. There is something here. However in a generic setup, at least role would be an extended match criteria provided by the selinux module. It would not be a core attribute. It would need to depend on some extra functionality being compiled in. >> I'm not yet annoyed enough to go implement an iptables like interface to the >> LSM enhancing it with more generic mechanism to make the problem simpler, but >> I'm getting there. Perhaps next time I'm bored. > > I think a fair amount of what we need is already done in SELinux, and efforts > would be better spent in figuring out what seems too complicated in SELinux > and > making it simpler. Probably a fair amount of that just means better tools. How about thinking of it another way. Perform the split up you
Re: [PATCH 0/2] add new notifier function
On Thu, Oct 04, 2007 at 08:38:05PM +0900, Takenori Nagano wrote: > Hi, > > These patches add new notifier function and implement it to > panic_notifier_list. > We used the hardcoded notifier chain so far, but it was not flexible. New > notifier is very flexible, because user can change a list of order by debugfs. > Hi Takenori, There were some more discussions regarding configurable notifier list. Following is the link. Please go through it. http://marc.info/?l=linux-kernel=118968996202991=2 Not everybody is too happy about it. Personally I am not against it. My take is that after panic() there is no gurantee that all the registered notifer will be executed. Just that kernel will try its best. If a notifier handler is written badly, kernel can't do much about it. It is left more on to administrator what he considers most important and give priority accordingly. So if kdump is of utmost priority, then administrator should give highest priority to kdump. Having said that, what are the RAS tools which require this infrastructure. Currently only kdb seems to be the only candidate which needs to run in the crashing kernel. Rest of the actions can be performed in second kernel. If that is the case, then probably it is better that kdb puts a break point on panic(), as suggested by Eric, and rest of the post panic actions are executed in second kernel. Executing rest of the actions have got both pros and cons. Executing rest of the notifications in second kernel makes things more reliable. At the same time it makes things little complex as one needs to pass all the configuration information required to second kernel, secondly all the notification handlers need to be ready to run in two contexts. These handlers will run in the context of first kernel if kdump is not configured, otherwise these will need to run in second kernel. In summary, right now co-existence of kdb with kdump seems to be your pain point. I would prefer that kdb just puts a break point on panic() and we move on. If there are more candidates down the line and these can't be easily executed in second kernel then we can re-visit this notification list mechanism. > Please review, and give some comments. > > Thanks, > > Example) > > # cd /sys/kernel/debug/ > # ls > kprobes pktcdvd > # insmod ipmi_msghandler.ko > # ls > kprobes panic_notifier_list pktcdvd > # cd panic_notifier_list/ > # ls > ipmi_msghandler > # insmod ipmi_watchdog.ko > # ls > ipmi_msghandler ipmi_wdog > # cat ipmi_msghandler/priority > 200 > # cat ipmi_wdog/priority > 150 > # > Kernel panic - not syncing: panic > ipmi_msghandler : notifier calls panic_event(). > ipmi_watchdog : notifier calls wdog_panic_handler(). > > .(reboot) > We also need to implement a file which can give a consolidated view. All the registered members and their priority. Thanks Vivek - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] LOCKDEP: fix mismatched lockdep_depth/curr_chain_hash
Hi Ingo, I am seeing a problem on the latest -rt where lockdep completely overwhelms the system to the point that it grinds to a halt on large (8-way+) systems. The problem seems to be that the class->locks_before and locks_after grow unbounded (I have observed over 1M+ entries in them) so a lock_acquire call can take over 10 seconds to finish resolving. Related to this seems to be that lockdep appears to see a chain-hash miss over and over for what I would assume should be an established graph (for instance, in double_lock_balance() in an rt_overload condition). Turning off PROVE_LOCKING (statically, or by setting debug_locks=0 dynamically restores the system to normal behavior. I took some time tonight to study lockdep (it is quite an impressive body of code!), and came up with the following "fix". It does improve things significantly by addressing what I believe is the issue with the cache-misses (though it would appear there are still a few more issues there that need addressing as some boots are still very lethargic). I use the term "fix" loosely since I am not confident that I fully understand the intention of your logic here so I can't say for sure if it was really broken, or if I have made it worse ;) Could you comment on what I have done here, or offer any advice on what to look for elsewhere? I based the patch on pure linux-2.6.git since I see the same issue (by visual inspection, that is) there as well. Thanks in advance! -Greg -- LOCKDEP: fix mismatched lockdep_depth/curr_chain_hash It is possible for the current->curr_chain_key to become inconsistent with the current index if the chain fails to validate. The end result is that future lock_acquire() operations may inadvertently fail to find a hit in the cache resulting in a new node being added to the graph for every acquire. Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]> --- kernel/lockdep.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/lockdep.c b/kernel/lockdep.c index 734da57..efb0d7e 100644 --- a/kernel/lockdep.c +++ b/kernel/lockdep.c @@ -2450,11 +2450,11 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass, chain_head = 1; } chain_key = iterate_chain_key(chain_key, id); - curr->curr_chain_key = chain_key; if (!validate_chain(curr, lock, hlock, chain_head)) return 0; + curr->curr_chain_key = chain_key; curr->lockdep_depth++; check_chain_key(curr); #ifdef CONFIG_DEBUG_LOCKDEP - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] LOCKDEP: fix mismatched lockdep_depth/curr_chain_hash
Doh! I guess there should be a rule about sending patches out after midnight ;) The original patch I worked on was written before the code was moved to validate_chain(), so my previous posting didnt quite translate when I merged with git HEAD. Here is an updated patch. Sorry for the confusion. Regards, -Greg --- kernel/lockdep.c | 10 +- 1 files changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/lockdep.c b/kernel/lockdep.c index 734da57..42ae4a5 100644 --- a/kernel/lockdep.c +++ b/kernel/lockdep.c @@ -1521,7 +1521,7 @@ cache_hit: } static int validate_chain(struct task_struct *curr, struct lockdep_map *lock, - struct held_lock *hlock, int chain_head) + struct held_lock *hlock, int chain_head, u64 chain_key) { /* * Trylock needs to maintain the stack of held locks, but it @@ -1534,7 +1534,7 @@ static int validate_chain(struct task_struct *curr, struct lockdep_map *lock, * graph_lock for us) */ if (!hlock->trylock && (hlock->check == 2) && - lookup_chain_cache(curr->curr_chain_key, hlock->class)) { + lookup_chain_cache(chain_key, hlock->class)) { /* * Check whether last held lock: * @@ -1576,7 +1576,7 @@ static int validate_chain(struct task_struct *curr, struct lockdep_map *lock, #else static inline int validate_chain(struct task_struct *curr, struct lockdep_map *lock, struct held_lock *hlock, - int chain_head) + int chain_head, u64 chain_key) { return 1; } @@ -2450,11 +2450,11 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass, chain_head = 1; } chain_key = iterate_chain_key(chain_key, id); - curr->curr_chain_key = chain_key; - if (!validate_chain(curr, lock, hlock, chain_head)) + if (!validate_chain(curr, lock, hlock, chain_head, chain_key)) return 0; + curr->curr_chain_key = chain_key; curr->lockdep_depth++; check_chain_key(curr); #ifdef CONFIG_DEBUG_LOCKDEP - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: SLUB performance regression vs SLAB
> On 10/04/2007 07:39 PM, David Schwartz wrote: > > But this is just a preposterous position to put him in. If there's no > > reproduceable test case, then why should he care that one > > program he can't > > even see works badly? If you care, you fix it. > People have been trying for years to make reproducible test cases > for huge and complex workloads. It doesn't work. The tests that do > work take weeks to run and need to be carefully validated before > they can be officially released. The open source community can and > should be working on similar tests, but they will never be simple. That's true, but irrelevent. Either the test can identify a problem that applies generally, or it's doing nothing but measuring how good the system is at doing the test. If the former, it should be possible to create a simple test case once you know from the complex test where the problem is. If the latter, who cares about a supposed regression? It should be possible to identify exactly what portion of the test shows the regression the most and exactly what the system is doing during that moment. The test may be great at finding regressions, but once it finds them, they should be forever *found*. Did you follow the recent incident when iperf fout what seemed to be a significnat CFS networking regression? The only way to identify that it was a quirk in what iperf was doing was by looking at exactly what iperf was doing. The only efficient way was to look at iperf's source and see that iperf's weird yielding meant it didn't replicate typical use cases like it was supposed to. DS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown
Rik van Riel wrote: > Either of these two would work. Another alternative could be to > let test_and_clear_pte_flags have an exception table entry, where > we jump right to the next instruction if the instruction clearing > the flag fails. > > That is the essentially variant you need for Xen, except the fast > path is still exactly the same it is as when running on native > hardware. > Hm, that wouldn't end up clearing the bit. You'd need a Xen-specific exception handler to do that, which would turn the whole thing into Xen-specific code, and you're back at adding a pv-op. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown
Andrew Morton wrote: > y'know, I think I think it's been several years since I saw a report of an > honest to goodness, genuine SMP race in core kernel. We used to be > infested by them, but the term has fallen into disuse. Interesting, but > OT. > I was a bit surprised to find myself typing it too. I guess it could also be a preempt race, which has been a bit more common. Anyway, its a deliberately unlocked access to the pagetable structure, so not terribly surprising. >> It seems to me that there are a few ways to fix this: >> >>1. Use asm-generic/pgtable.h when CONFIG_PARAVIRT is enabled. This >> will clearly work, but is pretty blunt. >>2. Make test_and_clear_pte_flags a new paravirt-op, which can be >> implemented in Xen as a hypercall, and as a raw test_and_clear_bit >> for everyone else. The downside is adding yet another pv-op. >>3. Restructure the pagetable setup code so that the mm is not added >> to the prio tree until after arch_dup_mmap has been called (and >> the converse for exit_mmap). This is arguably cleaner, but I >> haven't looked to see how much trouble this would be. >> >> Thoughts anyone? Does making the pagetables visible "early" cause >> problems for anyone else? >> > > I expect that 2) has the maximum niceness*suitable-for-2.6.23 product. > OK, I'll whip a patch together. > That's if you actually care much about kernel.org major releases - do many > people run kernel.org kernels on Xen? Well, given that there hasn't been a Xen-capable kernel.org release yet, no... But we'll see what happens when .23 goes out the door. > If "not many" then we could perhaps > do something more elaborate for 2.6.23.1. But adding ever more pvops as > core kernel evolves was always expected. > I think keep it simple for now; anything significant can wait for the brave new world of unified x86. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Accessing 64-bit BARs
hello, Thanks rolf & roland. pci_iomap() is not doing something extra. only it is some kind of abstraction for IO-mapped OR memory mapped. I know that my BARs are MMIO, so using ioremap() & readl()/writel() combination should be fine. But for the problem as explained in my first mail, any help/suggestions will be helpful. -Yogeshwar On 10/4/07, Roland Dreier <[EMAIL PROTECTED]> wrote: > > You should use pci_iomap() to get an access pointer to the BAR. After this > you > > can access the memory with ioread*() and iowrite*(). See "man pci_iomap(9)" > > if you build kernel manpages. > > That works fine, but ioremap() and readl()/writel() is also perfectly > fine for regions that you know are always MMIO. > > - R. > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] kernel BUG at arch/i386/mm/highmem.c:15! on 2.6.23-rc8/rc9
gurudas pai wrote: Hugh Dickins wrote: On Thu, 4 Oct 2007, gurudas pai wrote: Nick Piggin wrote: While running Oracle database test on x86/6GB RAM machine panics with following messages. Hmm, seems like something in sys_remap_file_pages might have broken. It's a bit hard to work out from the backtrace, though. Is it possible you can strace to find the arguments for the remap_file_pages that goes wrong? Ahh, I think it's just underflowing the preempt count somewhere, which is leading highmem.c:15 to just *think* it is in an interrupt. But you aren't running a preemptible kernel, which makes it unusual... it would have to be coming from interrupt code (or just random corruption). Still, preempt debugging should catch those cases as well. So, can you disregard my last message, and instead compile a kernel with CONFIG_PREEMPT and CONFIG_DEBUG_PREEMPT, and see what messages come up? With CONFIG_PREEMPT and CONFIG_DEBUG_PREEMPT set I got following messages on rc9. BUG: using smp_processor_id() in preemptible [0001] code: oracle/3631 caller is kunmap_atomic+0xb/0x82 [] debug_smp_processor_id+0xa1/0xb4 [] kunmap_atomic+0xb/0x82 [] __do_fault+0x55/0x35b [] handle_mm_fault+0x4d0/0x909 [] follow_page+0x1d9/0x228 [] get_user_pages+0x250/0x332 [] make_pages_present+0x7b/0x90 [] sys_remap_file_pages+0x2de/0x330 [] syscall_call+0x7/0xb [] ioctl_standard_call+0x209/0x2ce Very helpful, thanks. Guru, please try the appended patch, I think you'll find it fixes it for you (it did for me, once I'd puzzled out why I was failing to reproduce the problem - tests on ext3 don't work). Thank you so much for reporting this just in time! [PATCH] fix sys_remap_file_pages BUG at highmem.c:15! Gurudas Pai reports kernel BUG at arch/i386/mm/highmem.c:15! below sys_remap_file_pages, while running Oracle database test on x86 in 6GB RAM: kunmap thinks we're in_interrupt because the preempt count has wrapped. That's because __do_fault expected to unmap page_table, but one of its two callers do_nonlinear_fault already unmapped it: let do_linear_fault unmap it first too, and then there's no need to pass the page_table arg down. Why have we been so slow to notice this? Probably through forgetting that the mapping_cap_account_dirty test means that sys_remap_file_pages nowadays only goes the full nonlinear vma route on a few memory-backed filesystems like ramfs, tmpfs and hugetlbfs. Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]> --- 2.6.23-rc9/mm/memory.c2007-07-26 19:49:58.0 +0100 +++ linux/mm/memory.c2007-10-04 15:42:20.0 +0100 @@ -2307,13 +2307,14 @@ oom: * do not need to flush old virtual caches or the TLB. * * We enter with non-exclusive mmap_sem (to exclude vma changes, - * but allow concurrent faults), and pte mapped but not yet locked. + * but allow concurrent faults), and pte neither mapped nor locked. * We return with mmap_sem still held, but pte unmapped and unlocked. */ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, -unsigned long address, pte_t *page_table, pmd_t *pmd, +unsigned long address, pmd_t *pmd, pgoff_t pgoff, unsigned int flags, pte_t orig_pte) { +pte_t *page_table; spinlock_t *ptl; struct page *page; pte_t entry; @@ -2327,7 +2328,6 @@ static int __do_fault(struct mm_struct * vmf.flags = flags; vmf.page = NULL; -pte_unmap(page_table); BUG_ON(vma->vm_flags & VM_PFNMAP); if (likely(vma->vm_ops->fault)) { @@ -2468,8 +2468,8 @@ static int do_linear_fault(struct mm_str - vma->vm_start) >> PAGE_CACHE_SHIFT) + vma->vm_pgoff; unsigned int flags = (write_access ? FAULT_FLAG_WRITE : 0); -return __do_fault(mm, vma, address, page_table, pmd, pgoff, -flags, orig_pte); +pte_unmap(page_table); +return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte); } @@ -2552,9 +2552,7 @@ static int do_nonlinear_fault(struct mm_ } pgoff = pte_to_pgoff(orig_pte); - -return __do_fault(mm, vma, address, page_table, pmd, pgoff, -flags, orig_pte); +return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte); } /* Yes, indeed this patch worked for me , test completed successfully!! (on preempt kernel). Will continue testing with non-preempt kernel and update you if I hit any issue. Completed testing on non-preempt successfully without any issue. Thanks, -Guru - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io
On Thu, Oct 04, 2007 at 03:03:44PM +1000, David Chinner wrote: > On Thu, Oct 04, 2007 at 10:21:33AM +0800, Fengguang Wu wrote: > > On Wed, Oct 03, 2007 at 12:41:19PM +1000, David Chinner wrote: > > > On Wed, Oct 03, 2007 at 09:34:39AM +0800, Fengguang Wu wrote: > > > > On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote: > > > > > On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote: > > > > > > wbc.pages_skipped = 0; > > > > > > @@ -560,8 +561,9 @@ static void background_writeout(unsigned > > > > > > min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; > > > > > > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { > > > > > > /* Wrote less than expected */ > > > > > > - congestion_wait(WRITE, HZ/10); > > > > > > - if (!wbc.encountered_congestion) > > > > > > + if (wbc.encountered_congestion || wbc.more_io) > > > > > > + congestion_wait(WRITE, HZ/10); > > > > > > + else > > > > > > break; > > > > > > } > > > > > > > > > > Why do you call congestion_wait() if there is more I/O to issue? If > > > > > we have a fast filesystem, this might cause the device queues to > > > > > fill, then drain on congestion_wait(), then fill again, etc. i.e. we > > > > > will have trouble keeping the queues full, right? > > > > > > > > You mean slow writers and fast RAID? That would be exactly the case > > > > these patches try to improve. > > > > > > I mean any writers and a fast block device (raid or otherwise). > > > > > > > This patchset makes kupdate/background writeback more responsible, > > > > so that if (avg-write-speed < device-capabilities), the dirty data are > > > > synced timely, and we don't have to go for balance_dirty_pages(). > > > > > > Sure, but I'm asking about the effect of the patches on the > > > (avg-write-speed == device-capabilities) case. I agree that > > > they are necessary for timely syncing of data but I'm trying > > > to understand what effect they have on the normal write case > > > > > (i.e. keeping the disk at full write throughput). > > > > OK, I guess it is the focus of all your questions: Why should we sleep > > in congestion_wait() and possibly hurt the write throughput? I'll try > > to summary it: > > > > - congestion_wait() is necessary > > Besides device congestions, there may be other blockades we have to > > wait on, e.g. temporary page locks, NFS/journal issues(I guess). > > We skip locked pages in writeback, and if some filesystems have > blocking issues that require non-blocking writeback waits for some > I/O to complete before re-entering writeback, then perhaps they should be > setting wbc->encountered_congestion to tell writeback to back off. We have wbc->pages_skipped for that :-) > The question I'm asking is that if more_io tells us we have more > work to do, why do we have to sleep first if the block dev is > able to take more I/O? See below. > > > > - congestion_wait() is called only when necessary > > congestion_wait() will only be called we saw blockades: > > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { > > congestion_wait(WRITE, HZ/10); > > } > > So in normal case, it may well write 128MB data without any waiting. > > Sure, but wbc.more_io doesn't indicate a blockade - just that there > is more work to do, right? It's not wbc.more_io, but the context(wbc.pages_skipped > 0) indicates a blockade: if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {/* all-written or blockade... */ if (wbc.encountered_congestion || wbc.more_io) /* blockade! */ congestion_wait(WRITE, HZ/10); else /* all-written! */ break; } We can also read the whole background_writeout() logic as while (!done) { /* sync _all_ sync-able data */ congestion_wait(100ms); } And an example run could be: sync 1000MB, skipped 100MB congestion_wait(100ms); sync 100MB, skipped 10MB congestion_wait(100ms); sync 10MB, all done Note that it's far from "wait 100ms for every 4MB" (which is merely the worst possible case). > > - congestion_wait() won't hurt write throughput > > When not congested, congestion_wait() will be wake up on each write > > completion. > > What happens if the I/O we issued has already completed before we > got back up to the congestion_wait() call? We'll spend 100ms > sleeping when we shouldn't have and throughput goes down by 10% on > every occurrence Ah, that was out of my imagination. Maybe we could do with if (wbc.more_io) congestion_wait(WRITE, 1); It's at least 10 times better. > if we've got more work to do, then we should do it without an > arbitrary, non-deterministic delay being inserted. If the delay is > needed to prevent he system from "going mad" (whatever tht
Re: [PATCH] RCU torture update for preemption
On Wed, Oct 03, 2007 at 04:59:51PM -0400, Steven Rostedt wrote: > Paul, > > I ran your original preemption test of RCU torture, and after several > minutes, my preempt boost patch had one Preemption stall. I then > disabled preemption boosting, and ran the preempt torture again, and it > seemed to never stall. Something seemed strange, so I took a look. > > Looks like you have a single thread that will run at max prio that runs > for 10 secs and then sleeps again. This thread seems to only push rcu > readers around. But it doesn't seem to do much else. That is a good test > to see if RCU readers can handle being pushed around, but it doesn't > test preemption boosting. Looks like I shot myself in the foot by complaining about a bug... :-/ http://lkml.org/lkml/2007/6/10/234 With the bug, the readers weren't migrating, without it, they do. Good catch!!! Thank you!!! > To do that, I modified the test to create CPUS-1 preempt boost hogs (or > 1 if it is UP). But instead of putting it at max prio, I set it to the > lowest RT prio of 1. This way it's still at a higher priority than the > readers. I also switched the writers to run at 1+n where n increases for > every fake writer there is. > > Without preempt boosting, after a couple of minutes I had 83 preemption > stalls. When I turned my boosting back on, after several minutes (still > running as I type this) it has no preemption stalls. > > This seems to be a good test for RCU preemption boosting. I am testing it out against my earlier patchset, with some encouraging results -- I will incorporate into the next round of my mainline patchset. Some questions and comments below. > -- Steve > > PS. I got rid of your rcu_preeempt_task for rcu_preempt_tasks ;-) > > (No the above is _not_ a typo) :-/ > Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> > > Index: linux-2.6.23-rc9-rt1/kernel/rcutorture.c > === > --- linux-2.6.23-rc9-rt1.orig/kernel/rcutorture.c > +++ linux-2.6.23-rc9-rt1/kernel/rcutorture.c > @@ -54,6 +54,7 @@ MODULE_AUTHOR("Paul E. McKenney > static int nreaders = -1;/* # reader threads, defaults to 2*ncpus */ > static int nfakewriters = 4; /* # fake writer threads */ > +static int npreempthogs = -1;/* # preempt hogs to run (defaults to > ncpus-1) or 1 */ > static int stat_interval;/* Interval between stats, in seconds. */ > /* Defaults to "only at end of test". */ > static int verbose; /* Print more debug info. */ > @@ -90,9 +91,11 @@ MODULE_PARM_DESC(torture_type, "Type of > static char printk_buf[4096]; > > static int nrealreaders; > +static int nrealpreempthogs; I made the above be a module parameter. This OK? > static struct task_struct *writer_task; > static struct task_struct **fakewriter_tasks; > static struct task_struct **reader_tasks; > +static struct task_struct **rcu_preempt_tasks; > static struct task_struct *stats_task; > static struct task_struct *shuffler_task; > > @@ -264,7 +267,6 @@ static void rcu_torture_deferred_free(st > call_rcu(>rtort_rcu, rcu_torture_cb); > } > > -static struct task_struct *rcu_preeempt_task; > static unsigned long rcu_torture_preempt_errors; > > static int rcu_torture_preempt(void *arg) > @@ -274,7 +276,7 @@ static int rcu_torture_preempt(void *arg > time_t gcstart; > struct sched_param sp; > > - sp.sched_priority = MAX_RT_PRIO - 1; > + sp.sched_priority = 1; > err = sched_setscheduler(current, SCHED_RR, ); > if (err != 0) > printk(KERN_ALERT "rcu_torture_preempt() priority err: %d\n", > @@ -297,24 +299,43 @@ static int rcu_torture_preempt(void *arg > static long rcu_preempt_start(void) > { > long retval = 0; > + int i; > > - rcu_preeempt_task = kthread_run(rcu_torture_preempt, NULL, > - "rcu_torture_preempt"); > - if (IS_ERR(rcu_preeempt_task)) { > - VERBOSE_PRINTK_ERRSTRING("Failed to create preempter"); > - retval = PTR_ERR(rcu_preeempt_task); > - rcu_preeempt_task = NULL; > + rcu_preempt_tasks = kzalloc(nrealpreempthogs * > sizeof(rcu_preempt_tasks[0]), > + GFP_KERNEL); > + if (rcu_preempt_tasks == NULL) { > + VERBOSE_PRINTK_ERRSTRING("out of memory"); > + retval = -ENOMEM; > + goto out; > } > + > + for (i=0; i < nrealpreempthogs; i++) { > + rcu_preempt_tasks[i] = kthread_run(rcu_torture_preempt, NULL, > + "rcu_torture_preempt"); > + if (IS_ERR(rcu_preempt_tasks[i])) { > + VERBOSE_PRINTK_ERRSTRING("Failed to create preempter"); > + retval = PTR_ERR(rcu_preempt_tasks[i]); > + rcu_preempt_tasks[i] = NULL; > + break; > + } > + } > + out: >
Re: [BUG] Linux 2.6.23-rc9 and MAX_ARG_PAGES
On Thu, Oct 04, 2007 at 05:12:11PM -0700, Linus Torvalds wrote: > I also tested that "ulimit -s" seems to do the right thing for me. > > I'm also assuming Mathieu is running x86 (or x86-64): HP-PA has a stack > that grows upwards, and that has traditionally been exciting. Correct, x86 it is but as I said it's this stupid auditd thing that breaks the whole process. I'm gonna file a bug against it. Thanks for the help though. -- Mathieu Chouquet-Stringer [EMAIL PROTECTED] The sun itself sees not till heaven clears. -- William Shakespeare -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch take 2][Intel-IOMMU] Fix for IOMMU early crash
> Subject: [Intel-IOMMU] Fix for IOMMU early crash > > pci_dev's->sysdata is highly overloaded and currently > IOMMU is broken due to IOMMU code depending on this field. > > This patch introduces new field in pci_dev's dev.archdata struct to > hold IOMMU specific per device IOMMU private data. > > Signed-off-by: Anil S Keshavamurthy <[EMAIL PROTECTED]> Looks good. Won't break powerpc. Acked-by: Benjamin Herrenschmidt <[EMAIL PROTECTED]> > --- > drivers/pci/intel-iommu.c | 22 +++--- > include/asm-x86_64/device.h |3 +++ > 2 files changed, 14 insertions(+), 11 deletions(-) > > Index: 2.6-mm/drivers/pci/intel-iommu.c > === > --- 2.6-mm.orig/drivers/pci/intel-iommu.c 2007-10-04 11:35:09.0 > -0700 > +++ 2.6-mm/drivers/pci/intel-iommu.c 2007-10-04 11:47:47.0 -0700 > @@ -1348,7 +1348,7 @@ > list_del(>link); > list_del(>global); > if (info->dev) > - info->dev->sysdata = NULL; > + info->dev->dev.archdata.iommu = NULL; > spin_unlock_irqrestore(_domain_lock, flags); > > detach_domain_for_dev(info->domain, info->bus, info->devfn); > @@ -1361,7 +1361,7 @@ > > /* > * find_domain > - * Note: we use struct pci_dev->sysdata stores the info > + * Note: we use struct pci_dev->dev.archdata.iommu stores the info > */ > struct dmar_domain * > find_domain(struct pci_dev *pdev) > @@ -1369,7 +1369,7 @@ > struct device_domain_info *info; > > /* No lock here, assumes no domain exit in normal case */ > - info = pdev->sysdata; > + info = pdev->dev.archdata.iommu; > if (info) > return info->domain; > return NULL; > @@ -1519,7 +1519,7 @@ > } > list_add(>link, >devices); > list_add(>global, _domain_list); > - pdev->sysdata = info; > + pdev->dev.archdata.iommu = info; > spin_unlock_irqrestore(_domain_lock, flags); > return domain; > error: > @@ -1579,7 +1579,7 @@ > static inline int iommu_prepare_rmrr_dev(struct dmar_rmrr_unit *rmrr, > struct pci_dev *pdev) > { > - if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO) > + if (pdev->dev.archdata.iommu == DUMMY_DEVICE_DOMAIN_INFO) > return 0; > return iommu_prepare_identity_map(pdev, rmrr->base_address, > rmrr->end_address + 1); > @@ -1595,7 +1595,7 @@ > int ret; > > for_each_pci_dev(pdev) { > - if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO || > + if (pdev->dev.archdata.iommu == DUMMY_DEVICE_DOMAIN_INFO || > !IS_GFX_DEVICE(pdev)) > continue; > printk(KERN_INFO "IOMMU: gfx device %s 1-1 mapping\n", > @@ -1836,7 +1836,7 @@ > int prot = 0; > > BUG_ON(dir == DMA_NONE); > - if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO) > + if (pdev->dev.archdata.iommu == DUMMY_DEVICE_DOMAIN_INFO) > return virt_to_bus(addr); > > domain = get_valid_domain_for_dev(pdev); > @@ -1900,7 +1900,7 @@ > unsigned long start_addr; > struct iova *iova; > > - if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO) > + if (pdev->dev.archdata.iommu == DUMMY_DEVICE_DOMAIN_INFO) > return; > domain = find_domain(pdev); > BUG_ON(!domain); > @@ -1974,7 +1974,7 @@ > size_t size = 0; > void *addr; > > - if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO) > + if (pdev->dev.archdata.iommu == DUMMY_DEVICE_DOMAIN_INFO) > return; > > domain = find_domain(pdev); > @@ -2032,7 +2032,7 @@ > unsigned long start_addr; > > BUG_ON(dir == DMA_NONE); > - if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO) > + if (pdev->dev.archdata.iommu == DUMMY_DEVICE_DOMAIN_INFO) > return intel_nontranslate_map_sg(hwdev, sg, nelems, dir); > > domain = get_valid_domain_for_dev(pdev); > @@ -2234,7 +2234,7 @@ > for (i = 0; i < drhd->devices_cnt; i++) { > if (!drhd->devices[i]) > continue; > - drhd->devices[i]->sysdata = DUMMY_DEVICE_DOMAIN_INFO; > + drhd->devices[i]->dev.archdata.iommu = > DUMMY_DEVICE_DOMAIN_INFO; > } > } > } > Index: 2.6-mm/include/asm-x86_64/device.h > === > --- 2.6-mm.orig/include/asm-x86_64/device.h 2007-10-04 11:35:09.0 > -0700 > +++ 2.6-mm/include/asm-x86_64/device.h2007-10-04 11:49:44.0 > -0700 > @@ -10,6 +10,9 @@ > #ifdef CONFIG_ACPI > void*acpi_handle; > #endif > +#ifdef CONFIG_DMAR > + void *iommu; /* hook for IOMMU specific extension */ > +#endif > }; > > #endif /* _ASM_X86_64_DEVICE_H */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of
Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)
Hugh Dickins wrote: > On Thu, 4 Oct 2007, Balbir Singh wrote: >> Hugh Dickins wrote: >>> Well, swap control is another subject. I guess for that you'll need >>> to track which cgroup each swap page belongs to (rather more expensive >>> than the current swap_map of unsigned shorts). And I doubt it'll be >>> swap control as such that's required, but control of rss+swap. >> I see what you mean now, other people have recommending a per cgroup >> swap file/device. > > Sounds too inflexible, and too many swap areas to me. Perhaps the > right answer will fall in between: assign clusters of swap pages to > different cgroups as needed. But worry about that some other time. > Yes, depending on the number of cgroups, we'll need to share swap areas between them. It requires more work and thought process. >>> But here I'm just worrying about how the existence of swap makes >>> something of a nonsense of your rss control. >>> >> Ideally, pages would not reside for too long in swap cache (unless > > Thinking particularly of those brought in by swapoff or swap readahead: > some will get attached to mms once accessed, others will simply get > freed when tasks exit or munmap, others will hang around until they > reach the bottom of the LRU and are reclaimed again by memory pressure. > > But as your code stands, that'll be total memory pressure: in-cgroup > memory pressure will tend to miss them, since typically they're > assigned to the wrong cgroup; until then their presence is liable > to cause other pages to be reclaimed which ideally should not be. > in-cgroup pressure will not affect them, since they are in different cgroups. If there is pressure in the cgroup to which they are wrongly assigned, they would get reclaimed first. >> I've misunderstood swap cache or there are special cases for tmpfs/ >> ramfs). > > ramfs pages are always in RAM, never go out to swap, no need to > worry about them in this regard. But tmpfs pages can indeed go > out to swap, so whatever we come up with needs to make sense > with them too, yes. I don't think its swapoff/readahead issues > are any harder to handle than the anonymous mapped page case, > but it will need its own code to handle them. > >> Once pages have been swapped back in, they get assigned >> back to their respective cgroup's in do_swap_page() (where we charge >> them back to the cgroup). >> > > That's where it should happen, yes; but my point is that it very > often does not. Because the swap cache page (read in as part of > the readaround cluster of some other cgroup, or in swapoff by some > other cgroup) is already assigned to that other cgroup (by the > mem_cgroup_cache_charge in __add_to_swap_cache), and so goes "The > page_cgroup exists and the page has already been accounted" route > when mem_cgroup_charge is called from do_swap_page. Doesn't it? > You are right, at this point I am beginning to wonder if I should account for the swap cache at all? We account for the pages in RSS and when the page comes back into the page table(s) via do_swap_page. If we believe that the swap cache is transitional and the current expected working behaviour does not seem right or hard to fix, it might be easy to ignore unuse_pte() and add/remove_from_swap_cache() for accounting and control. The expected working behaviour of the memory controller is that currently, as you point out several pages get accounted to the cgroup that initiates swapin readahead or swapoff. On cgroup pressure (the one that initiated swapin or swapoff), the cgroup would discard these pages first. These pages are discarded from the cgroup, but still live on the global LRU. When the original cgroup is under pressure, these pages might not be effected as they belong to a different cgroup, which might not be under any sort of pressure. > Are we misunderstanding each other, because I'm assuming > MEM_CGROUP_TYPE_ALL and you're assuming MEM_CGROUP_TYPE_MAPPED? > though I can't see that _MAPPED and _CACHED are actually supported, > there being no reference to them outside the enum that defines them. > I am also assuming MEM_CGROUP_TYPE_ALL for the purpose of our discussion. The accounting is split into mem_cgroup_charge() and mem_cgroup_cache_charge(). While charging the caches is when we check for the control_type. > Or are you deceived by that ifdef NUMA code in swapin_readahead, > which propagates the fantasy that swap allocation follows vma layout? > That nonsense has been around too long, I'll soon be sending a patch > to remove it. > The swapin readahead code under #ifdef NUMA is very confusing. I also noticed another confusing thing during my test, swap cache does not drop to 0, even though I've disabled all swap using swapoff. May be those are tmpfs pages. The other interesting thing I tried was running swapoff after a cgroup went over it's limit, the swapoff succeeded, but I see strange numbers for free swap. I'll start another thread after investigating a bit more. >> The swap
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
On Oct 04, 2007, at 21:44:02, Eric W. Biederman wrote: What we want from the LSM is the ability to say -EPERM when we can clearly articulate that we want to disallow something. This sort of depends on perspective; typically with security infrastructure you actually want "... the ability to return success when we can clearly articulate that we want to *ALLOW* something". File permissions work this way; we don't have a list of forbidden users attached to each file, we have an owner, a group, and a mode representing positive permissions. With that said in certain high- risk environments you need something even stronger that cannot be changed by the "owner" of the file, if we don't entirely trust them, SElinux is not all encompassing or it is generally incomprehensible I don't know which. Or someone long ago would have said a better way to implement containers was with a selinux ruleset, here is a selinux ruleset that does that. Although it is completely possible to implement all of the isolation with the existing LSM hooks as Serge showed. The difference between SELinux and containers is that SELinux (and LSM as a whole) returns -EPERM to operations outside the scope of the subject, whereas containers return -ENOENT (because it's not even in the same namespace). We also have in the kernel another parallel security mechanism (for what is generally a different class of operations) that has been quite successful, and different groups get along quite well, and ordinary mortals can understand it. The linux firewalling code. Well, I wouldn't go so far as the "ordinary mortals can understand it" part; it's still pretty high on the obtuse-o-meter. The linux firewalling codes has hooks all throughout the networking stack, just like the LSM has hooks all throughout the rest of linux kernel. There is a difference however. The linux firewalling code in addition to hooks has tables behind those hooks that it consults. There is generic code to walk those tables and consult with different kernel modules to decide if we should drop a packet. Each of those kernel modules provides a different capability that can be used to generate a firewall. This is almost *EXACTLY* what SELinux provides as an LSM module. The one difference is that with SELinux some compromises and restrictions have been made so that (theoretically) the resulting policy can be exhaustively analyzed to *prove* what it allows and disallows. It may be that SELinux should be split into 2 parts, one that provides the underlying table-matching and the other that uses it to provide the provability guarantees. Here's a direct comparison: netfilter: (A) Each packet has src, dst, port, etc that can be matched (B) Table of rules applied sequentially (MATCH => ACTION) (C) Rules may alter the properties of packets as they are routed/ bridged/etc selinux: (A) Each object has user, role, and type that can be matched (B) Table of rules searched by object parameters (MATCH => allow/ auditallow/transition) (C) Rules may alter the properties of objects through transition rules. If there are areas where people are confused about SELinux, think it may be improved, etc, we would be *GLAD* to hear it. I'm currently struggling to find the time between a hundred other things to finish a script I offered to Casey Schaufler a month and a half ago which generated an SELinux policy based on a SMACK ruleset. So I propose that if people want to work towards a one true linux solution for additional security checks, then they should look towards the linux firewalling code. It works and it seems to very nicely allow cooperations between different groups. For the people who will scream mixing security models causes problems, the answer is simple recommend users don't set up their policies that way. Actually the one thing which really frustrates me about the Linux firewalling code is that you cannot selectively apply various transformation phases, they are automatically applied for you. I have had a couple very-transparent-routing-firewalling-bridging scenarios where I wished I could run the bridging phase, compare-and- change the result, and then run the bridging phase again to forward the packet elsewhere. For example if I was to set up a diverted ethernet port I would need to apply the bridging code, compare the destination port against the selected diverted port and change the MAC address, then reapply the bridging code. To mirror you would also need a phase which could create multiple clones of packets and conditionalize rules based on which of the copies it was. I'm not yet annoyed enough to go implement an iptables like interface to the LSM enhancing it with more generic mechanism to make the problem simpler, but I'm getting there. Perhaps next time I'm bored. I think a fair amount of what we need is already
Re: SLUB performance regression vs SLAB
On Thu, 4 Oct 2007 19:43:58 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> wrote: > So there could still be page struct contention left if multiple > processors frequently and simultaneously free to the same slab and > that slab is not the per cpu slab of a cpu. That could be addressed > by optimizing the object free handling further to not touch the page > struct even if we miss the per cpu slab. > > That get_partial* is far up indicates contention on the list lock > that should be addressable by either increasing the slab size or by > changing the object free handling to batch in some form. > > This is an SMP system right? 2 cores with 4 cpus each? The main loop > is always hitting on the same slabs? Which slabs would this be? Am I > right in thinking that one process allocates objects and then lets > multiple other processors do work and then the allocated object is > freed from a cpu that did not allocate the object? If neighboring > objects in one slab are allocated on one cpu and then are almost > simultaneously freed from a set of different cpus then this may be > explain the situation. - one of the characteristics of the application in use is the following: all cores submit IO (which means they allocate various scsi and block structures on all cpus).. but only 1 will free it (the one the IRQ is bound to). SO it's allocate-on-one-free-on-another at a high rate. That is assuming this is the IO slab; that's a bit of an assumption obviously (it's one of the slab things that are hot, but it's a complex workload, there could be others) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] kernel BUG at arch/i386/mm/highmem.c:15! on 2.6.23-rc8/rc9
Hugh Dickins wrote: On Thu, 4 Oct 2007, gurudas pai wrote: Nick Piggin wrote: While running Oracle database test on x86/6GB RAM machine panics with following messages. Hmm, seems like something in sys_remap_file_pages might have broken. It's a bit hard to work out from the backtrace, though. Is it possible you can strace to find the arguments for the remap_file_pages that goes wrong? Ahh, I think it's just underflowing the preempt count somewhere, which is leading highmem.c:15 to just *think* it is in an interrupt. But you aren't running a preemptible kernel, which makes it unusual... it would have to be coming from interrupt code (or just random corruption). Still, preempt debugging should catch those cases as well. So, can you disregard my last message, and instead compile a kernel with CONFIG_PREEMPT and CONFIG_DEBUG_PREEMPT, and see what messages come up? With CONFIG_PREEMPT and CONFIG_DEBUG_PREEMPT set I got following messages on rc9. BUG: using smp_processor_id() in preemptible [0001] code: oracle/3631 caller is kunmap_atomic+0xb/0x82 [] debug_smp_processor_id+0xa1/0xb4 [] kunmap_atomic+0xb/0x82 [] __do_fault+0x55/0x35b [] handle_mm_fault+0x4d0/0x909 [] follow_page+0x1d9/0x228 [] get_user_pages+0x250/0x332 [] make_pages_present+0x7b/0x90 [] sys_remap_file_pages+0x2de/0x330 [] syscall_call+0x7/0xb [] ioctl_standard_call+0x209/0x2ce Very helpful, thanks. Guru, please try the appended patch, I think you'll find it fixes it for you (it did for me, once I'd puzzled out why I was failing to reproduce the problem - tests on ext3 don't work). Thank you so much for reporting this just in time! [PATCH] fix sys_remap_file_pages BUG at highmem.c:15! Gurudas Pai reports kernel BUG at arch/i386/mm/highmem.c:15! below sys_remap_file_pages, while running Oracle database test on x86 in 6GB RAM: kunmap thinks we're in_interrupt because the preempt count has wrapped. That's because __do_fault expected to unmap page_table, but one of its two callers do_nonlinear_fault already unmapped it: let do_linear_fault unmap it first too, and then there's no need to pass the page_table arg down. Why have we been so slow to notice this? Probably through forgetting that the mapping_cap_account_dirty test means that sys_remap_file_pages nowadays only goes the full nonlinear vma route on a few memory-backed filesystems like ramfs, tmpfs and hugetlbfs. Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]> --- 2.6.23-rc9/mm/memory.c 2007-07-26 19:49:58.0 +0100 +++ linux/mm/memory.c 2007-10-04 15:42:20.0 +0100 @@ -2307,13 +2307,14 @@ oom: * do not need to flush old virtual caches or the TLB. * * We enter with non-exclusive mmap_sem (to exclude vma changes, - * but allow concurrent faults), and pte mapped but not yet locked. + * but allow concurrent faults), and pte neither mapped nor locked. * We return with mmap_sem still held, but pte unmapped and unlocked. */ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, pte_t *page_table, pmd_t *pmd, + unsigned long address, pmd_t *pmd, pgoff_t pgoff, unsigned int flags, pte_t orig_pte) { + pte_t *page_table; spinlock_t *ptl; struct page *page; pte_t entry; @@ -2327,7 +2328,6 @@ static int __do_fault(struct mm_struct * vmf.flags = flags; vmf.page = NULL; - pte_unmap(page_table); BUG_ON(vma->vm_flags & VM_PFNMAP); if (likely(vma->vm_ops->fault)) { @@ -2468,8 +2468,8 @@ static int do_linear_fault(struct mm_str - vma->vm_start) >> PAGE_CACHE_SHIFT) + vma->vm_pgoff; unsigned int flags = (write_access ? FAULT_FLAG_WRITE : 0); - return __do_fault(mm, vma, address, page_table, pmd, pgoff, - flags, orig_pte); + pte_unmap(page_table); + return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte); } @@ -2552,9 +2552,7 @@ static int do_nonlinear_fault(struct mm_ } pgoff = pte_to_pgoff(orig_pte); - - return __do_fault(mm, vma, address, page_table, pmd, pgoff, - flags, orig_pte); + return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte); } /* Yes, indeed this patch worked for me , test completed successfully!! (on preempt kernel). Will continue testing with non-preempt kernel and update you if I hit any issue. Thank you all for your time and effort. Regards, -Guru - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] isofs: add +w bit for non-RR discs
On Tue, Oct 02, 2007 at 08:00:26PM +0200, Jan Engelhardt wrote: > Add %S_IWUGO bit for files on ISO-9660 filesystems without RockRidge Looks to me like you've added S_IWUSR, not S_IWUGO. > - popt->mode = S_IRUGO | S_IXUGO; /* > + popt->mode = S_IRUGO | S_IWUSR | S_IXUGO; > - inode->i_mode = S_IRUGO | S_IXUGO | S_IFDIR; > + inode->i_mode = S_IRUGO | S_IWUSR | S_IXUGO | S_IFDIR; -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown
On Thu, 04 Oct 2007 18:43:32 -0700 Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote: > David's change 10a8d6ae4b3182d6588a5809a8366343bc295c20, "i386: add > ptep_test_and_clear_{dirty,young}" has introduced an SMP race which > affects the Xen pv-ops backend. y'know, I think I think it's been several years since I saw a report of an honest to goodness, genuine SMP race in core kernel. We used to be infested by them, but the term has fallen into disuse. Interesting, but OT. > It seems to me that there are a few ways to fix this: > >1. Use asm-generic/pgtable.h when CONFIG_PARAVIRT is enabled. This > will clearly work, but is pretty blunt. >2. Make test_and_clear_pte_flags a new paravirt-op, which can be > implemented in Xen as a hypercall, and as a raw test_and_clear_bit > for everyone else. The downside is adding yet another pv-op. >3. Restructure the pagetable setup code so that the mm is not added > to the prio tree until after arch_dup_mmap has been called (and > the converse for exit_mmap). This is arguably cleaner, but I > haven't looked to see how much trouble this would be. > > Thoughts anyone? Does making the pagetables visible "early" cause > problems for anyone else? I expect that 2) has the maximum niceness*suitable-for-2.6.23 product. That's if you actually care much about kernel.org major releases - do many people run kernel.org kernels on Xen? If "not many" then we could perhaps do something more elaborate for 2.6.23.1. But adding ever more pvops as core kernel evolves was always expected. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
I just spend some time looking at the functions that you see high in the list. The trouble is that I have to speculate and that I have nothing to verify my thoughts. If you could give me the hitlist for each of the 3 runs then this would help to check my thinking. I could be totally off here. It seems that we miss the per cpu slab frequently on slab_free() which leads to the calling of __slab_free() and which in turn needs to take a lock on the page (in the page struct). Typically the page lock is uncontended which seems to not be the case here otherwise it would not be that high up. The per cpu patch in mm should reduce the contention on the page struct by not touching the page struct on alloc and on free. Does not seem to work all the way though. slab_free() still has to touch the page struct if the free is not to the currently active cpu slab. So there could still be page struct contention left if multiple processors frequently and simultaneously free to the same slab and that slab is not the per cpu slab of a cpu. That could be addressed by optimizing the object free handling further to not touch the page struct even if we miss the per cpu slab. That get_partial* is far up indicates contention on the list lock that should be addressable by either increasing the slab size or by changing the object free handling to batch in some form. This is an SMP system right? 2 cores with 4 cpus each? The main loop is always hitting on the same slabs? Which slabs would this be? Am I right in thinking that one process allocates objects and then lets multiple other processors do work and then the allocated object is freed from a cpu that did not allocate the object? If neighboring objects in one slab are allocated on one cpu and then are almost simultaneously freed from a set of different cpus then this may be explain the situation. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kbuild-devel] A bit of kconfig rewrite (Re: [PATCH] 9p: fix compile error if !CONFIG_SYSCTL)
Hi, On Mon, 1 Oct 2007, Oleg Verych wrote: > Today's kconfig was proposed and accepted in a very unpleasant > circumstances, has very poor design, development and no working > alternative (for 5+ years now). If you want to make such statements, you have to offer a little more than the hot air you're producing right now... If you want to improve the design, you're more than welcome. I'm the first one to admit that there's still lots of room for improvement, but if you want to claim this can only be done via a rewrite, then you have to be a lot more specific what's wrong the current design and why it's unfixable. Quite some thought has been put into this design and if you were a little more specific, I could actually tell you why it is this way and maybe how to improve it incrementally instead of trying to reinvent everything. > + shell-like[0] (not like CML1, which is just shell) scripting, allowing > to extend easily (if there is no one available) capabilities, > config values or actions for particular sub-system or compilation > unit, Just to pick this one point as example: I like scripting and maybe I should just update the swig wrapper script I already have and merge it, which would make it easier to play with the kconfig database in whatever language you like. OTOH due to the necessary build dependencies I don't see this become a mandatory feature, so unless there is a compelling reason a certain set of base function will remain in C. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.23-rc4] qconf ("make xconfig") Search Dialog Enhancement (rev8)
Hi, On Thu, 20 Sep 2007, Shlomi Fish wrote: > Which specific problems do you see with the coding style of the patch? Can > you > comment on it? Mostly whitespace around any braces, please keep it close to the other source. > > I would also prefer to move more of the search functionality into the > > generic code, so it can be used by other front ends as well, otherwise a > > lot of this had to be duplicated. > > That would be a good idea, but I cannot use Qt there, which makes my job > harder. Where is the problem with implementing it in C? Just try to keep it a simple at first. > > I think a filter function makes it maybe a bit to flexible, if a front > > end wants to do some weird filtering, it can still access the symbol > > data base directly. > > A filter function would still be convenient in this context, as the symbol > data base API may change, and the filter function has a little logic in it. This API is not really fixed at the moment, so it's not really a problem. > > So what I have in mind is something like this: > > > > struct symbol **sym_generic_search(const char *pattern, unsigned int > > flags); > > > > This means the back end provides a basic search facility for the most > > common search operations. The flags would specify what to search (e.g. > > symbol name, help text, prompts) and how to do it. > > I suggest we don't call it sym_generic_search, as generic implies it is a > generic filter. We can call it "sym_string_search" or whatever. Then, I > suggest we have separate arguments for every parameter (i.e: search type, > case sensitivity, what to search, etc.). I don't care much about the name, but please keep it as a simple flag, which is a lot easier to extend. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] New message-logging API (kprint)
On Thursday 04 October 2007 3:17:03 pm Randy Dunlap wrote: > On Thu, 04 Oct 2007 22:04:07 +0200 Vegard Nossum wrote: > > Description: This patch largely implements the kprint API as previously > > posted to the LKML and described in Documentation/kprint.txt (see patch). > > > > The main purpose of this change is provide a unified logging API to the > > kernel and at the same time make it easy to add extensions, now and > > later. > > > > My changes and additions are as follows: > > $ diffstat -p1 -w70 kprint.patch ... > 40 files changed, 1660 insertions(+), 72 deletions(-) I started this thread by posting an idea I had for shrinking the kernel by allowing more code to be configured out. The API change was exactly one new parameter, with a direct 1->1 mapping from the old API to the new one, which was trivial to convert and which the compiler would catch if you missed one. The result of the discussion is a patch adding 1600 lines to the kernel, without removing anything. Last I checked, the current prink() worked just fine. Why is this _not_ the dreaded "infrastructure in search of a use"? What exactly can we _not_ do with the current code? What does this allow us to remove and simplify? I'm confused about what people are trying to accomplish here... Rob -- "One of my most productive days was throwing away 1000 lines of code." - Ken Thompson. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Vague maybe ppp-related panic report for 2.6.23-rc9
Just as a quick update -- I seem to only be able to reproduce this crash when my ppp session drops, which seems associated with marginal signal. And unfortunately I have great coverage at home so I haven't been able to reproduce this again today. Maybe on the train tomorrow I can crash my laptop... - R. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown
On Thu, 04 Oct 2007 18:43:32 -0700 Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote: > It seems to me that there are a few ways to fix this: > >1. Use asm-generic/pgtable.h when CONFIG_PARAVIRT is enabled. This > will clearly work, but is pretty blunt. >2. Make test_and_clear_pte_flags a new paravirt-op, which can be > implemented in Xen as a hypercall, and as a raw > test_and_clear_bit for everyone else. The downside is adding yet > another pv-op. Either of these two would work. Another alternative could be to let test_and_clear_pte_flags have an exception table entry, where we jump right to the next instruction if the instruction clearing the flag fails. That is the essentially variant you need for Xen, except the fast path is still exactly the same it is as when running on native hardware. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
Linus Torvalds <[EMAIL PROTECTED]> writes: > To get back to security: I didn't want pluggable security because I > thought that was a technically good solution. No, the reason Linux has LSM > (and yes, I was the one who pushed hard for the whole thing, even if I > didn't actually write any of it) was because the problem wasn't technical > to begin with. > > It was social/political and administrative. > > See? Another fundamental difference between schedulers and security > modules. > > But no, that's not really why we have LSM. I'd have *much* preferred to > have one unified security module setup that we could all agree on, and no > pluggable security modules. It was not to be - and the reason we have LSM > is not because "it makes more sense than a CPU scheduler", but simply > because "people didn't actually get anything done at all, because they > just argued about what to do". > > In the CPU schedulers, Ingo still gets work done, even though people argue > about it. So we haven't needed to go to the extreme of an "LSM for CPU > schedulers", because the arguments don't actually hold up the work. > > And THAT is what matters in the end. Sounds good. I want to inject some fresh ideas into this discussion from a completely different viewpoint, who knows I might get lucky and make things better. All you can do with the LSM is return -EPERM when the normal unix permissions would not have allowed an operation. I don't see where there is any magic or mystery in that, or any need for deep understanding. What we want from the LSM is the ability to say -EPERM when we can clearly articulate that we want to disallow something. SElinux is not all encompassing or it is generally incomprehensible I don't know which. Or someone long ago would have said a better way to implement containers was with a selinux ruleset, here is a selinux ruleset that does that. Although it is completely possible to implement all of the isolation with the existing LSM hooks as Serge showed. It is a legitimate criticism of the LSM that we are not improving our in-kernel abstractions to allow better concepts to base decisions upon when to return -EPERM. My first dealing with selinux and the lsm was when I fixed a security issue in /proc fixed the abstractions we were using and the default selinux security policy had a fit. If don't have good concepts in /proc/pid/xxx which is heavily used it would not surprise me at all if there are lots of other places in the kernel where our abstractions holes that have not yet been shorn up. We also have in the kernel another parallel security mechanism (for what is generally a different class of operations) that has been quite successful, and different groups get along quite well, and ordinary mortals can understand it. The linux firewalling code. The linux firewalling codes has hooks all throughout the networking stack, just like the LSM has hooks all throughout the rest of linux kernel. There is a difference however. The linux firewalling code in addition to hooks has tables behind those hooks that it consults. There is generic code to walk those tables and consult with different kernel modules to decide if we should drop a packet. Each of those kernel modules provides a different capability that can be used to generate a firewall. Meanwhile composition of a policy using code from different clients of the LSM hooks is impossible, and thus cooperation or wider use of the LSM hooks is difficult. So I propose that if people want to work towards a one true linux solution for additional security checks, then they should look towards the linux firewalling code. It works and it seems to very nicely allow cooperations between different groups. For the people who will scream mixing security models causes problems, the answer is simple recommend users don't set up their policies that way. I know we can't solve human problems with technical measures but perhaps a technical suggestion can open the way to the solution to some human problems. I'm not yet annoyed enough to go implement an iptables like interface to the LSM enhancing it with more generic mechanism to make the problem simpler, but I'm getting there. Perhaps next time I'm bored. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown
David's change 10a8d6ae4b3182d6588a5809a8366343bc295c20, "i386: add ptep_test_and_clear_{dirty,young}" has introduced an SMP race which affects the Xen pv-ops backend. In Xen, pagetables are normally kept RO so that the hypervisor can mediate all updates to them. If Xen sees a write to an active (currently pointed to by cr3) or pinned (a currently inactive but registered pagetable), it will trap the write fault and emulate the instruction making the update; this means that most pagetable-modifying code doesn't need to know or care that pagetables are RO. When a pagetable is first created (either in execve or fork), the the Xen paravirt backend pins the pagetable, and conversely, on exit it is unpinned; this is done via the arch_dup_mmap() and activate_mm() hooks. Pinning is done in two phases: first the pagetable pages are marked RO, and then the pagetable is registered with Xen; unpinning is the opposite. This works assuming that the pagetable is not in use, and not yet visible to other cpus. The race on pagetable creation is this: in kernel/fork.c:dup_mmap(), it copies the old pagetable into the new one, and registers each vma with the rmap prio tree. Once everything is copied, it calls arch_dup_mmap(), which ends up doing the Xen pagetable pin. However, because the pagetable is visible to other cpus via the prio tree, pagetable modifications (specifically, clearing the access bit) can race with pinning. If it hits between making the pagetable pages RO but before they're registered with Xen, modifications to the flags will fault, and Xen won't know to do the fixup. The converse is also true in exit_mmap(): arch_exit_mmap is called before removing the vmas from the prio tree, so it can race with unpinning. The specific oops I'm seeing is this: BUG: unable to handle kernel paging request at virtual address c5b023e8 printing eip: c016d3f2 *pdpt = 4bc1a001 Oops: 0003 [#1] PREEMPT SMP Modules linked in: CPU:1 EIP:0061:[]Not tainted VLI EFLAGS: 00010202 (2.6.23-rc9-paravirt #1656) EIP is at page_referenced_one+0xb8/0x12a eax: c0401b17 ebx: c5b023e8 ecx: c2398000 edx: c044ceca esi: 0087d000 edi: c5660688 ebp: c2399af4 esp: c2399acc ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0069 Process cc1 (pid: 31474, ti=c2398000 task=c2dc9000 task.ti=c2398000) Stack: c04014a7 c040f47a 011e c03697fe c2399b1c c5eb4500 c113e87c c116b1c8 c5660688 c13aa890 c2399b2c c016d4d8 c7917340 0008 c13aa8c4 0005 c116b1c8 0001 c0473940 Call Trace: [] show_trace_log_lvl+0x1a/0x2f [] show_stack_log_lvl+0x9d/0xa5 [] show_registers+0x1f7/0x336 [] die+0x11b/0x23b [] do_page_fault+0x758/0x838 [] error_code+0x72/0x78 [] page_referenced_file+0x74/0xa0 [] page_referenced+0xbd/0xd0 [] shrink_active_list+0x170/0x3a3 [] shrink_zone+0xb9/0xf8 [] try_to_free_pages+0x13c/0x208 [] __alloc_pages+0x197/0x290 [] __do_page_cache_readahead+0xd4/0x1d7 [] do_page_cache_readahead+0x4b/0x56 [] filemap_fault+0x1b7/0x3de [] __do_fault+0x79/0x407 [] handle_mm_fault+0x27e/0xca0 [] do_page_fault+0x391/0x838 [] error_code+0x72/0x78 === Code: 0c fe 97 36 c0 c7 44 24 08 1e 01 00 00 c7 44 24 04 7a f4 40 c0 c7 04 24 a7 14 40 c0 e8 d4 e5 fb ff e8 29 c9 f9 ff f6 03 20 74 27 0f ba 33 05 19 c0 85 c0 74 1c 8b 07 89 f2 89 d9 8d b6 00 00 EIP: [] page_referenced_one+0xb8/0x12a SS:ESP 0069:c2399acc It all worked OK before David's change, because asm-generic/pgtable.h uses set_pte_at(), which ends up making a hypercall to update the pagetable, which always works regardless of the state of the pagetable pages. It seems to me that there are a few ways to fix this: 1. Use asm-generic/pgtable.h when CONFIG_PARAVIRT is enabled. This will clearly work, but is pretty blunt. 2. Make test_and_clear_pte_flags a new paravirt-op, which can be implemented in Xen as a hypercall, and as a raw test_and_clear_bit for everyone else. The downside is adding yet another pv-op. 3. Restructure the pagetable setup code so that the mm is not added to the prio tree until after arch_dup_mmap has been called (and the converse for exit_mmap). This is arguably cleaner, but I haven't looked to see how much trouble this would be. Thoughts anyone? Does making the pagetables visible "early" cause problems for anyone else? Thanks, J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [NFS] What's slated for inclusion in 2.6.24-rc1 from the NFS client git tree...
On Thu, 2007-10-04 at 12:59 -0700, Andrew Morton wrote: > On Thu, 04 Oct 2007 15:16:03 -0400 > Trond Myklebust <[EMAIL PROTECTED]> wrote: > > > > > > > > > That would be perfect. It can even be in non-legacy mode by default, > > > > just as long as you can go back to the old behaviour when/if you run > > > > into a non-LFS application. > > > > > > > > > > Wouldn't a mount option be better? > > > > I suppose that might be OK if you know that the 32-bit legacy > > applications will only touch one or two servers, but that sounds like a > > niche thing. > > > > On the downside, forcing all those people who have portable 64-bit aware > > applications to upgrade their version of mount just in order to have > > stat64() work correctly seems unnecessarily complicated. I'd prefer not > > to have to do that unless someone comes up with a good reason why we > > must. > > Confused. You don't need to modify mount(8) when adding a new mount option? Prior to 2.6.22, the 'mount' program used a binary blob for passing the NFS mount options to the kernel. It is only very recently that we have started doing in-kernel parsing of text strings, and in order to make use of that, people will need to upgrade to the latest version of nfs-utils. Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] signal(i386): alternative signal stack wraparound occurs
Mikael Pettersson wrote:: On Thu, 4 Oct 2007 21:47:30 +0900, KAMEZAWA Hiroyuki wrote: On Thu, 04 Oct 2007 21:33:12 +0900 Shi Weihua <[EMAIL PROTECTED]> wrote: KAMEZAWA Hiroyuki wrote:: On Thu, 04 Oct 2007 20:56:14 +0900 Shi Weihua <[EMAIL PROTECTED]> wrote: stack.ss_sp = addr + pagesize; stack.ss_flags = 0; stack.ss_size = pagesize; Here is bad. stack,ss_sp = addr; stack.ss_flags = 0; stack.ss_size = pagesize * 2; [What the test code want to do] addr+pagesize*2 - addr+pagesize -> sigaltstack addr+pagesize - addr -> protected region The code want to catch overflow when esp enter the protected region. You have to protect the top of *registered* sigaltstack. The reason of wraparound is %esp will be set to the bottom of sigaltstack if it is not on sigaltstack area when signaled. What you have to do is protect the top of registerd sigaltstack. If %esp is in the range of registerd sigaltstack at SEGV, wraparound will stop. Exactly right. You mprotect or munmap the end of the altstack, not the area beyond it. So we tell users "Even if you protectted half of mmap's space, but you must to register all space to kernel. " ? The image about my test code's result: No patchPatched ┌───┐ │ │← 1 ┌ ← 3 ← 1 │A ││(wraparound) │ ││ │ │← 2 │ ← 2 │ ││ ├───┤│ │▒▒▒│← 3 ┘ ← 3 │B▒▒│ (caught) │▒protected▒│ │▒▒▒│ │▒▒▒│ └───┘ A+B mmap's space Asigaltstack Bprotectted I agree that if register A+B to kernel, the wraparound will stop. But if register A to kernel, why not kernel do something? Thanks Shi Weihua /Mikael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove throttle_vm_writeout()
On Fri, 05 Oct 2007 02:12:30 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote: > > > > I don't think I understand that. Sure, it _shouldn't_ be a problem. But it > > _is_. That's what we're trying to fix, isn't it? > > The problem, I believe is in the memory allocation code, not in fuse. fuse is trying to do something which page reclaim was not designed for. Stuff broke. > In the example, memory allocation may be blocking indefinitely, > because we have 4MB under writeback, even though 28MB can still be > made available. And that _should_ be fixable. Well yes. But we need to work out how, without re-breaking the thing which throttle_vm_writeout() fixed. > > > So the only thing the kernel should be careful about, is not to block > > > on an allocation if not strictly necessary. > > > > > > Actually a trivial fix for this problem could be to just tweak the > > > thresholds, so to make the above scenario impossible. Although I'm > > > still not convinced, this patch is perfect, because the dirty > > > threshold can actually change in time... > > > > > > Index: linux/mm/page-writeback.c > > > === > > > --- linux.orig/mm/page-writeback.c 2007-10-05 00:31:01.0 > > > +0200 > > > +++ linux/mm/page-writeback.c 2007-10-05 00:50:11.0 +0200 > > > @@ -515,6 +515,12 @@ void throttle_vm_writeout(gfp_t gfp_mask > > > for ( ; ; ) { > > > get_dirty_limits(_thresh, _thresh, NULL, > > > NULL); > > > > > > + /* > > > +* Make sure the theshold is over the hard limit of > > > +* dirty_thresh + ratelimit_pages * nr_cpus > > > +*/ > > > + dirty_thresh += ratelimit_pages * num_online_cpus(); > > > + > > > /* > > > * Boost the allowable dirty threshold a bit for page > > > * allocators so they don't get DoS'ed by heavy writers > > > > I can probably kind of guess what you're trying to do here. But if > > ratelimit_pages * num_online_cpus() exceeds the size of the offending zone > > then things might go bad. > > I think the admin can do quite a bit of other damage, by setting > dirty_ratio too high. > > Maybe this writeback throttling should just have a fixed limit of 80% > ZONE_NORMAL, and limit dirty_ratio to something like 50%. Bear in mind that the same problem will occur for the 16MB ZONE_DMA, and we cannot limit the system-wide dirty-memory threshold to 12MB. iow, throttle_vm_writeout() needs to become zone-aware. Then it only throttles when, say, 80% of ZONE_FOO is under writeback. Except I don't think that'll fix the problem 100%: if your fuse kernel component somehow manages to put 80% of ZONE_FOO under writeback (and remmeber this might be only 12MB on a 16GB machine) then we get stuck again - the fuse server process (is that the correct terminology, btw?) ends up waiting upon itself. I'll think about it a bit. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Vague maybe ppp-related panic report for 2.6.23-rc9
On Thu, Oct 04, 2007 at 01:51:13PM -0700, David Miller wrote: > > I don't want to jump the gun on the analysis but it just might > be the packet sharing fixes Herbert put in a short time ago. I think the only change of mine that could affect ppp over a serial line is this one. I couldn't see anything obvious in it but maybe someone else can. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- 2a38b775b77f99308a4e571c13d908df78ac5e57 diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c index 7e21342..4b49d0e 100644 --- a/drivers/net/ppp_generic.c +++ b/drivers/net/ppp_generic.c @@ -1525,7 +1525,7 @@ ppp_input_error(struct ppp_channel *chan, int code) static void ppp_receive_frame(struct ppp *ppp, struct sk_buff *skb, struct channel *pch) { - if (skb->len >= 2) { + if (pskb_may_pull(skb, 2)) { #ifdef CONFIG_PPP_MULTILINK /* XXX do channel-level decompression here */ if (PPP_PROTO(skb) == PPP_MP) @@ -1577,7 +1577,7 @@ ppp_receive_nonmp_frame(struct ppp *ppp, struct sk_buff *skb) if (ppp->vj == 0 || (ppp->flags & SC_REJ_COMP_TCP)) goto err; - if (skb_tailroom(skb) < 124) { + if (skb_tailroom(skb) < 124 || skb_cloned(skb)) { /* copy to a new sk_buff with more tailroom */ ns = dev_alloc_skb(skb->len + 128); if (ns == 0) { @@ -1648,23 +1648,29 @@ ppp_receive_nonmp_frame(struct ppp *ppp, struct sk_buff *skb) /* check if the packet passes the pass and active filters */ /* the filter instructions are constructed assuming a four-byte PPP header on each packet */ - *skb_push(skb, 2) = 0; - if (ppp->pass_filter - && sk_run_filter(skb, ppp->pass_filter, -ppp->pass_len) == 0) { - if (ppp->debug & 1) - printk(KERN_DEBUG "PPP: inbound frame not passed\n"); - kfree_skb(skb); - return; - } - if (!(ppp->active_filter - && sk_run_filter(skb, ppp->active_filter, - ppp->active_len) == 0)) - ppp->last_recv = jiffies; - skb_pull(skb, 2); -#else - ppp->last_recv = jiffies; + if (ppp->pass_filter || ppp->active_filter) { + if (skb_cloned(skb) && + pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) + goto err; + + *skb_push(skb, 2) = 0; + if (ppp->pass_filter + && sk_run_filter(skb, ppp->pass_filter, +ppp->pass_len) == 0) { + if (ppp->debug & 1) + printk(KERN_DEBUG "PPP: inbound frame " + "not passed\n"); + kfree_skb(skb); + return; + } + if (!(ppp->active_filter + && sk_run_filter(skb, ppp->active_filter, + ppp->active_len) == 0)) + ppp->last_recv = jiffies; + __skb_pull(skb, 2); + } else #endif /* CONFIG_PPP_FILTER */ + ppp->last_recv = jiffies; if ((ppp->dev->flags & IFF_UP) == 0 || ppp->npmode[npi] != NPMODE_PASS) { @@ -1762,7 +1768,7 @@ ppp_receive_mp_frame(struct ppp *ppp, struct sk_buff *skb, struct channel *pch) struct channel *ch; int mphdrlen = (ppp->flags & SC_MP_SHORTSEQ)? MPHDRLEN_SSN: MPHDRLEN; - if (!pskb_may_pull(skb, mphdrlen) || ppp->mrru == 0) + if (!pskb_may_pull(skb, mphdrlen + 1) || ppp->mrru == 0) goto err; /* no good, throw it away */ /* Decode sequence number and begin/end bits */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23-rc9-git2: Known regressions from 2.6.22
On Friday, 5 October 2007 02:11, H. Peter Anvin wrote: > Rafael J. Wysocki wrote: > > > > Subject:vga text console not working on 2.6.23-rc8 > > Submitter: Santiago Garcia Mantinan <[EMAIL PROTECTED]> > > References: http://lkml.org/lkml/2007/9/28/342 > > http://bugzilla.kernel.org/show_bug.cgi?id=9099 > > Handled-By: H. Peter Anvin <[EMAIL PROTECTED]> > > Antonino A. Daplas <[EMAIL PROTECTED]> > > > > This one was user error. Not a regression. OK, will drop. Thanks, Rafael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Linux 2.6.23-rc9 and MAX_ARG_PAGES
On Fri, 5 Oct 2007, Paul Mackerras wrote: > Linus Torvalds writes: > > > > Well, since others definitely don't see this, including me, and I can do > > things like 62MB exec arrays: > > > > [EMAIL PROTECTED] linux]$ echo $(find /home/torvalds/) | wc > > 1 883304 63000962 > > That wouldn't actually do an exec, assuming you're using bash, since > echo is a shell builtin in bash. You'd need to do /bin/echo. Right you are, silly me. But yes, it works for me even with that (and since I downloaded the gcc source tree, it now has six more megs of arguments). I also tested that "ulimit -s" seems to do the right thing for me. I'm also assuming Mathieu is running x86 (or x86-64): HP-PA has a stack that grows upwards, and that has traditionally been exciting. IA64 also has some strange things for the register backing store. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove throttle_vm_writeout()
> > > This is a somewhat general problem: a userspace process is in the IO > > > path. > > > Userspace block drivers, for example - pretty much anything which involves > > > kernel->userspace upcalls for storage applications. > > > > > > I solved it once in the past by marking the userspace process as > > > PF_MEMALLOC and I beleive that others have implemented the same hack. > > > > > > I suspect that what we need is a general solution, and that the solution > > > will involve explicitly telling the kernel that this process is one which > > > actually cleans memory and needs special treatment. > > > > > > Because I bet there will be other corner-cases where such a process needs > > > kernel help, and there might be optimisation opportunities as well. > > > > > > Problem is, any such mark-me-as-special syscall would need to be > > > privileged, and FUSE servers presently don't require special perms (do > > > they?) > > > > No, and that's a rather important feature, that I'd rather not give > > up. > > Can fuse do it? Perhaps the fs can diddle the server's task_struct at > registration time? No, it's futile. What if another process is involved (ssh in case of sshfs), etc. > > But with the dirty limiting, the memory cleaning really shouldn't > > be a problem, as there is plenty of memory _not_ used for dirty file > > data, that the filesystem can use during the writeback. > > I don't think I understand that. Sure, it _shouldn't_ be a problem. But it > _is_. That's what we're trying to fix, isn't it? The problem, I believe is in the memory allocation code, not in fuse. In the example, memory allocation may be blocking indefinitely, because we have 4MB under writeback, even though 28MB can still be made available. And that _should_ be fixable. > > So the only thing the kernel should be careful about, is not to block > > on an allocation if not strictly necessary. > > > > Actually a trivial fix for this problem could be to just tweak the > > thresholds, so to make the above scenario impossible. Although I'm > > still not convinced, this patch is perfect, because the dirty > > threshold can actually change in time... > > > > Index: linux/mm/page-writeback.c > > === > > --- linux.orig/mm/page-writeback.c 2007-10-05 00:31:01.0 +0200 > > +++ linux/mm/page-writeback.c 2007-10-05 00:50:11.0 +0200 > > @@ -515,6 +515,12 @@ void throttle_vm_writeout(gfp_t gfp_mask > > for ( ; ; ) { > > get_dirty_limits(_thresh, _thresh, NULL, > > NULL); > > > > + /* > > +* Make sure the theshold is over the hard limit of > > +* dirty_thresh + ratelimit_pages * nr_cpus > > +*/ > > + dirty_thresh += ratelimit_pages * num_online_cpus(); > > + > > /* > > * Boost the allowable dirty threshold a bit for page > > * allocators so they don't get DoS'ed by heavy writers > > I can probably kind of guess what you're trying to do here. But if > ratelimit_pages * num_online_cpus() exceeds the size of the offending zone > then things might go bad. I think the admin can do quite a bit of other damage, by setting dirty_ratio too high. Maybe this writeback throttling should just have a fixed limit of 80% ZONE_NORMAL, and limit dirty_ratio to something like 50%. Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23-rc9-git2: Known regressions from 2.6.22
Rafael J. Wysocki wrote: Subject:vga text console not working on 2.6.23-rc8 Submitter: Santiago Garcia Mantinan <[EMAIL PROTECTED]> References: http://lkml.org/lkml/2007/9/28/342 http://bugzilla.kernel.org/show_bug.cgi?id=9099 Handled-By: H. Peter Anvin <[EMAIL PROTECTED]> Antonino A. Daplas <[EMAIL PROTECTED]> This one was user error. Not a regression. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vm86.c audit_syscall_exit() call trashes registers
On 10/04/2007 07:58 PM, William Cattey wrote: > > Sadly, the effect of the patch is the same as the most recent candidate > patch from Jeremy Fitzhardinge: The EDID transfer still comes up all > zeros. > I think maybe a better question is: why does read_edid still work? The X server might be making some invalid assumption about system state. Comparing the code the two programs use could provide some clues. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23-rc7-mm1 -- powerpc rtas panic
On 10/2/07, Tony Breeds <[EMAIL PROTECTED]> wrote: > On Wed, Oct 03, 2007 at 10:30:16AM +1000, Michael Ellerman wrote: > > > I realise it'll make the patch bigger, but this doesn't seem like a > > particularly good name for the variable anymore. > > Sure, what about? > > Clarify when RTAS logging is enabled. > > Signed-off-by: Tony Breeds <[EMAIL PROTECTED]> For what it's worth, on a different ppc64 box, this resolves a similar panic for me. Tested-by: Nishanth Aravamudan <[EMAIL PROTECTED]> Thanks, Nish - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vm86.c audit_syscall_exit() call trashes registers
Thanks very much for thinking about this and providing a revised candidate patch. Sadly, the effect of the patch is the same as the most recent candidate patch from Jeremy Fitzhardinge: The EDID transfer still comes up all zeros. This is very perplexing to me. If I take the code that appears in 2.6.18's vm86.c, and simply put #if 0 around the call to audit_syscall_exit I get good data. If this is indeed a correct minimal correction to the audit_syscall_exit code, then perhaps there's some other condition being exercised. I guess my next step is to take the whole pt_regs patch (commit 49d26b6eaa8e970c8cf6e299e6ccba2474191bf5) from kernel.org and see if that has a beneficial effect. -Bill William Cattey Linux Platform Coordinator MIT Information Services & Technology N42-040M, 617-253-0140, [EMAIL PROTECTED] http://web.mit.edu/wdc/www/ On Oct 2, 2007, at 12:44 PM, Chuck Ebbert wrote: On 09/25/2007 07:38 PM, William Cattey wrote: I'd feel a lot more confident we were on the right track if I could just correctly patch Fitzhardinge's cleanup into the test setup I have now. I think you need to zero both registers if you're using 2.6.16, and force %eax as the source so it doesn't choose %ebp? --- a/arch/i386/kernel/vm86.c +++ b/arch/i386/kernel/vm86.c @@ -306,19 +334,19 @@ static void do_sys_vm86(struct kernel_vm86_struct *info, struct task_struct *tsk tsk->thread.screen_bitmap = info->screen_bitmap; if (info->flags & VM86_SCREEN_BITMAP) mark_screen_rdonly(tsk->mm); - __asm__ __volatile__("xorl %eax,%eax; movl %eax,%fs; movl %eax,%gs \n\t"); - __asm__ __volatile__("movl %%eax, %0\n" :"=r"(eax)); /*call audit_syscall_exit since we do not exit via the normal paths */ if (unlikely(current->audit_context)) - audit_syscall_exit(AUDITSC_RESULT(eax), eax); + audit_syscall_exit(AUDITSC_RESULT(0), 0); __asm__ __volatile__( "movl %0,%%esp\n\t" "movl %1,%%ebp\n\t" + "mov %2, %%fs\n\t" + "mov %2, %%gs\n\t" "jmp resume_userspace" : /* no outputs */ - :"r" (>regs), "r" (task_thread_info(tsk))); + :"r" (>regs), "r" (task_thread_info(tsk)), "a" (0)); /* we never return here */ } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On 10/04/2007 07:39 PM, David Schwartz wrote: > But this is just a preposterous position to put him in. If there's no > reproduceable test case, then why should he care that one program he can't > even see works badly? If you care, you fix it. > People have been trying for years to make reproducible test cases for huge and complex workloads. It doesn't work. The tests that do work take weeks to run and need to be carefully validated before they can be officially released. The open source community can and should be working on similar tests, but they will never be simple. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove throttle_vm_writeout()
On Fri, 05 Oct 2007 01:26:12 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote: > > This is a somewhat general problem: a userspace process is in the IO path. > > Userspace block drivers, for example - pretty much anything which involves > > kernel->userspace upcalls for storage applications. > > > > I solved it once in the past by marking the userspace process as > > PF_MEMALLOC and I beleive that others have implemented the same hack. > > > > I suspect that what we need is a general solution, and that the solution > > will involve explicitly telling the kernel that this process is one which > > actually cleans memory and needs special treatment. > > > > Because I bet there will be other corner-cases where such a process needs > > kernel help, and there might be optimisation opportunities as well. > > > > Problem is, any such mark-me-as-special syscall would need to be > > privileged, and FUSE servers presently don't require special perms (do > > they?) > > No, and that's a rather important feature, that I'd rather not give > up. Can fuse do it? Perhaps the fs can diddle the server's task_struct at registration time? > But with the dirty limiting, the memory cleaning really shouldn't > be a problem, as there is plenty of memory _not_ used for dirty file > data, that the filesystem can use during the writeback. I don't think I understand that. Sure, it _shouldn't_ be a problem. But it _is_. That's what we're trying to fix, isn't it? > So the only thing the kernel should be careful about, is not to block > on an allocation if not strictly necessary. > > Actually a trivial fix for this problem could be to just tweak the > thresholds, so to make the above scenario impossible. Although I'm > still not convinced, this patch is perfect, because the dirty > threshold can actually change in time... > > Index: linux/mm/page-writeback.c > === > --- linux.orig/mm/page-writeback.c 2007-10-05 00:31:01.0 +0200 > +++ linux/mm/page-writeback.c 2007-10-05 00:50:11.0 +0200 > @@ -515,6 +515,12 @@ void throttle_vm_writeout(gfp_t gfp_mask > for ( ; ; ) { > get_dirty_limits(_thresh, _thresh, NULL, > NULL); > > + /* > +* Make sure the theshold is over the hard limit of > +* dirty_thresh + ratelimit_pages * nr_cpus > +*/ > + dirty_thresh += ratelimit_pages * num_online_cpus(); > + > /* > * Boost the allowable dirty threshold a bit for page > * allocators so they don't get DoS'ed by heavy writers I can probably kind of guess what you're trying to do here. But if ratelimit_pages * num_online_cpus() exceeds the size of the offending zone then things might go bad. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
On Thu, Oct 04, 2007 at 07:18:47PM -0400, Chuck Ebbert wrote: > > I ran firefox setuid to a different (not my main user), uid+gid, gave > > my main account that gid as a supplemental group, and gave that uid > > access to the X magic cookie. > > You need to use runxas to get any kind of real security. Interesting script - sad how everyone reinvents equivalent things. I had been experimenting with running the whole lot under Xnest, with two extra users - one for the Xnest which had the main X cookie, and another for the browser. But found that it was just too awkward (since I use multiple browser windows as well a tabs). So I ended up trading a small security gain vs usablity. The other thing I started playing with was the NX version of Xnest, since it allows for a rootless server... DF - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: SLUB performance regression vs SLAB
David Miller wrote: > Using an unpublishable benchmark, whose results even cannot be > published, really stretches the limits of "reasonable" don't you > think? > > This "SLUB isn't ready yet" bullshit is just a shamans dance which > distracts attention away from the real problem, which is that a > reproducable, publishable test case, is not being provided to the > developer so he can work on fixing the problem. > > I can tell you this thing would be fixed overnight if a proper test > case had been provided by now. I would just like to echo what you said just a bit angrier. This is the same as someone asking him to fix a bug that they can only see with a binary-only kernel module. I think he's perfectly justified in simply responding "the bug is as likely to be in your code as mine". Now, just because he's justified in doing that doesn't mean he should. I presume he has an honest desire to improve his own code and if they've found a real problem, I'm sure he'd love to fix it. But this is just a preposterous position to put him in. If there's no reproduceable test case, then why should he care that one program he can't even see works badly? If you care, you fix it. Matthew Wilcox wrote: > Yet here we stand. Christoph is aggressively trying to get slab removed > from the tree. There is a testcase which shows slub performing worse > than slab. It's not my fault I can't publish it. And just because I > can't publish it doesn't mean it doesn't exist. It means it may or may not exist. All we have is your word that slub is the problem. If I said I found a bug in the Linux kernel that caused it to panic but I could only reproduce it with the nVidia driver, I'd be laughed at. It may even be that slub is better, your benchmark simply interprets this as worse. Without the details of your benchmark, we can't know. For example, I've seen benchmarks that (usually unintentionally) actually do a *variable* amount of work and details of the implementation may result in the benchmark actually doing *more* work, so it taking longer does not mean it ran slower. DS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/2] getattr - fill the size of pipes
> Cute feature, but it is (I assume) a Linux-specific extension and is > something which we'll need to maintain for ever and it invites Actually it used to work on the old old Linux pipe code. > unportability to older Linuxes and other OSes and it introduces some risk > of breakage of existing applications. And it slows down fstat on a pipe. Most Sys5 based boxes happen to put the right value there but not everyone and its not guaranteed in the slightest > > Given that the info can already be obtained via ioctl(FIONREAD) anyway, I > don't think that (gain > pain)? Nor me - any application trying to reduce the syscall count would just do a very large read and get the data and size in one go. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] Prepare pid_nr() etc functions to work with not-NULL pids
On Thu, Oct 04, 2007 at 12:54:17PM +0400, Pavel Emelyanov wrote: > Matt Mackall wrote: > > On Wed, Oct 03, 2007 at 06:20:43PM +0400, Pavel Emelyanov wrote: > >> Just make the __pid_nr() etc functions that expect the argument > >> to always be not NULL. > >> > >> Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]> > > > >> static inline pid_t pid_nr(struct pid *pid) > >> { > >>pid_t nr = 0; > >>if (pid) > >> - nr = pid->numbers[0].nr; > >> + nr = __pid_nr(pid); > >>return nr; > >> } > > > > Is there a patch that removes these inlines? Otherwise this looks good > > to me. > > Not yet. Some of are uninlined already, but others are not. I'd like > to make some testing before uninline them. I was asking about the whole function, actually, not the keyword. Is this function not equivalent to __pid_nr now? -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sata_sil24 broken since 2.6.23-rc4-mm1
On Thu, Oct 04, 2007 at 07:32:52AM +0200, Torsten Kaiser wrote: > On 10/3/07, Matt Mackall <[EMAIL PROTECTED]> wrote: > > Well I can see no reason why the vma we just got to by the mm->mmap > > would have a vm_mm != mm, but I've certainly been wrong before. > > > > Try changing it to: > > > > for (vma = mm->mmap; vma; vma = vma->vm_next) > > if (!is_vm_hugetlb_page(vma)) { > > if (vma->vm_mm != mm) > > printk("WTF: vma->vm_mm %p mm %p\n", > > vma->vm_mm, mm); > > walk_page_range(vma->vm_mm, vma->vm_start, > > vma->vm_end, > > _refs_walk, vma); > > } > > You were right. > I was able to trigger the error with above printk added, but nothing > was written to the syslog. > > So now I'm rather out of ideas what to test... :( I'd give your previous bisect step another try. Looking back at the thread a bit, anything that requires the machine to be off for more than a couple seconds to manifest stops looking like software and firmware and starts looking like a heat-related electrical or mechanical issue. Make sure your backups are current. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove throttle_vm_writeout()
> This is a somewhat general problem: a userspace process is in the IO path. > Userspace block drivers, for example - pretty much anything which involves > kernel->userspace upcalls for storage applications. > > I solved it once in the past by marking the userspace process as > PF_MEMALLOC and I beleive that others have implemented the same hack. > > I suspect that what we need is a general solution, and that the solution > will involve explicitly telling the kernel that this process is one which > actually cleans memory and needs special treatment. > > Because I bet there will be other corner-cases where such a process needs > kernel help, and there might be optimisation opportunities as well. > > Problem is, any such mark-me-as-special syscall would need to be > privileged, and FUSE servers presently don't require special perms (do > they?) No, and that's a rather important feature, that I'd rather not give up. But with the dirty limiting, the memory cleaning really shouldn't be a problem, as there is plenty of memory _not_ used for dirty file data, that the filesystem can use during the writeback. So the only thing the kernel should be careful about, is not to block on an allocation if not strictly necessary. Actually a trivial fix for this problem could be to just tweak the thresholds, so to make the above scenario impossible. Although I'm still not convinced, this patch is perfect, because the dirty threshold can actually change in time... Index: linux/mm/page-writeback.c === --- linux.orig/mm/page-writeback.c 2007-10-05 00:31:01.0 +0200 +++ linux/mm/page-writeback.c 2007-10-05 00:50:11.0 +0200 @@ -515,6 +515,12 @@ void throttle_vm_writeout(gfp_t gfp_mask for ( ; ; ) { get_dirty_limits(_thresh, _thresh, NULL, NULL); + /* +* Make sure the theshold is over the hard limit of +* dirty_thresh + ratelimit_pages * nr_cpus +*/ + dirty_thresh += ratelimit_pages * num_online_cpus(); + /* * Boost the allowable dirty threshold a bit for page * allocators so they don't get DoS'ed by heavy writers - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/2] getattr - fill the size of pipes
On Tue, 2 Oct 2007 19:54:53 +0200 (CEST) Jan Engelhardt <[EMAIL PROTECTED]> wrote: > [PATCH]: Fill the size of pipes > > Instead of reporting 0 in size when stating() a pipe, we give the number of > queued bytes. This might avoid using ioctl(FIONREAD) to get this information. > > References and derived from: http://lkml.org/lkml/2007/4/2/138 > Cc: Eric Dumazet <[EMAIL PROTECTED]> > Signed-off-by: Jan Engelhardt <[EMAIL PROTECTED]> Cute feature, but it is (I assume) a Linux-specific extension and is something which we'll need to maintain for ever and it invites unportability to older Linuxes and other OSes and it introduces some risk of breakage of existing applications. And it slows down fstat on a pipe. Given that the info can already be obtained via ioctl(FIONREAD) anyway, I don't think that (gain > pain)? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] reiser4: do not allocate struct file on stack
Edward Shishkin wrote: Dave Hansen wrote: ... I think that stack allocation is a pretty nasty trick for a structure that's supposed to be pretty persistent and dynamically allocated, and is certainly something that needs to get fixed up in a proper way. agreed. This works around the problem for now, but this could potentially cause more bugs any time we add some member to 'struct file' and depend on its state being sane anywhere in the VFS. If there's a list anywhere of merge-stopper reiser4 bugs around, this should probably go in there. will be fixed. The promised fixup is attached. Andrew, please apply. Thanks, Edward. Do not allocate struct file on stack, pass the persistent one instead. Signed-off-by: Edward Shishkin <[EMAIL PROTECTED]> --- linux-2.6.23-rc8-mm2/fs/reiser4/plugin/file/file.c| 35 -- linux-2.6.23-rc8-mm2/fs/reiser4/plugin/file/file.h|2 linux-2.6.23-rc8-mm2/fs/reiser4/plugin/file/tail_conversion.c | 23 ++ 3 files changed, 26 insertions(+), 34 deletions(-) --- linux-2.6.23-rc8-mm2/fs/reiser4/plugin/file/file.c.orig +++ linux-2.6.23-rc8-mm2/fs/reiser4/plugin/file/file.c @@ -566,23 +566,18 @@ * items or add them to represent a hole at the end of file. The caller has to * obtain exclusive access to the file. */ -static int truncate_file_body(struct inode *inode, loff_t new_size) +static int truncate_file_body(struct inode *inode, struct iattr *attr) { int result; + loff_t new_size = attr->ia_size; if (inode->i_size < new_size) { /* expanding truncate */ - struct dentry dentry; - struct file file; - struct unix_file_info *uf_info; + struct file * file = attr->ia_file; + struct unix_file_info *uf_info = unix_file_inode_data(inode); + + assert("edward-1532", attr->ia_valid & ATTR_FILE); - dentry.d_inode = inode; - file.f_dentry = - file.private_data = NULL; - file.f_pos = new_size; - file.private_data = NULL; - file.f_vfsmnt = NULL; - uf_info = unix_file_inode_data(inode); result = find_file_state(inode, uf_info); if (result) return result; @@ -615,19 +610,19 @@ return result; } } - result = reiser4_write_extent(, NULL, 0, + result = reiser4_write_extent(file, NULL, 0, _size); if (result) return result; uf_info->container = UF_CONTAINER_EXTENTS; } else { if (uf_info->container == UF_CONTAINER_EXTENTS) { -result = reiser4_write_extent(, NULL, 0, +result = reiser4_write_extent(file, NULL, 0, _size); if (result) return result; } else { -result = reiser4_write_tail(, NULL, 0, +result = reiser4_write_tail(file, NULL, 0, _size); if (result) return result; @@ -636,10 +631,10 @@ } BUG_ON(result > 0); INODE_SET_FIELD(inode, i_size, new_size); - file_update_time(); + file_update_time(file); result = reiser4_update_sd(inode); BUG_ON(result != 0); - reiser4_free_file_fsdata(); + reiser4_free_file_fsdata(file); } else result = shorten_file(inode, new_size); return result; @@ -2092,7 +2087,7 @@ * first item is formatting item, therefore there was * incomplete extent2tail conversion. Complete it */ - result = extent2tail(unix_file_inode_data(inode)); + result = extent2tail(file, unix_file_inode_data(inode)); else result = -EIO; @@ -2372,7 +2367,7 @@ uf_info->container == UF_CONTAINER_EXTENTS && !should_have_notail(uf_info, inode->i_size) && !rofs_inode(inode)) { - result = extent2tail(uf_info); + result = extent2tail(file, uf_info); if (result != 0) { warning("nikita-3233", "Failed (%d) to convert in %s (%llu)", @@ -2638,7 +2633,7 @@ if (result == 0) result = safe_link_add(inode, SAFE_TRUNCATE); if (result == 0) - result = truncate_file_body(inode, attr->ia_size); + result = truncate_file_body(inode, attr); if (result) warning("vs-1588", "truncate_file failed: oid %lli, " "old size %lld, new size %lld, retval %d", @@ -2724,7 +2719,7 @@ /* truncate file bogy first */ uf_info = unix_file_inode_data(inode); get_exclusive_access(uf_info); - result = truncate_file_body(inode, 0 /* size */ ); + result = shorten_file(inode, 0 /* size */ ); drop_exclusive_access(uf_info); if (result) --- linux-2.6.23-rc8-mm2/fs/reiser4/plugin/file/file.h.orig +++ linux-2.6.23-rc8-mm2/fs/reiser4/plugin/file/file.h @@ -237,7 +237,7 @@ #define WRITE_GRANULARITY 32 int tail2extent(struct unix_file_info *); -int extent2tail(struct unix_file_info *); +int extent2tail(struct file *, struct unix_file_info *); int goto_right_neighbor(coord_t *, lock_handle *); int find_or_create_extent(struct page *); --- linux-2.6.23-rc8-mm2/fs/reiser4/plugin/file/tail_conversion.c.orig +++ linux-2.6.23-rc8-mm2/fs/reiser4/plugin/file/tail_conversion.c @@ -546,7 +546,7 @@ /* for every page of file: read page, cut part of extent pointing to this page, put data of page tree by tail item */ -int
[patch] reiserfs: do not repair wrong journal params
Jan Engelhardt wrote: On Aug 23 2007 15:59, Martin Vogt wrote: ... Even if knoppix should not be used as a rescue/live CD, then the reiserfs module should not try to correct something, this should be done by another tool.(fsck.reiserfs or a module option...) The attached patch fixes this badness. Thanks, Edward. When mounting a file system with wrong journal params do not try to repair them, suggest fsck instead. Signed-off-by: Edward Shishkin <[EMAIL PROTECTED]> --- linux-2.6.23-rc8-mm2/fs/reiserfs/journal.c | 100 - 1 files changed, 57 insertions(+), 43 deletions(-) --- linux-2.6.23-rc8-mm2/fs/reiserfs/journal.c.orig +++ linux-2.6.23-rc8-mm2/fs/reiserfs/journal.c @@ -2649,6 +2649,61 @@ return result; } +/** + * When creating/tuning a file system user can assign some + * journal params within boundaries which depend on the ratio + * blocksize/standard_blocksize. + * + * For blocks >= standard_blocksize transaction size should + * be not less then JOURNAL_TRANS_MIN_DEFAULT, and not more + * then JOURNAL_TRANS_MAX_DEFAULT. + * + * For blocks < standard_blocksize these boundaries should be + * decreased proportionally. + */ +#define REISERFS_STANDARD_BLKSIZE (4096) + +static int check_advise_trans_params(struct super_block *p_s_sb, + struct reiserfs_journal *journal) +{ +if (journal->j_trans_max) { + /* Non-default journal params. + Do sanity check for them. */ + int ratio = 1; + if (p_s_sb->s_blocksize < REISERFS_STANDARD_BLKSIZE) + ratio = REISERFS_STANDARD_BLKSIZE / p_s_sb->s_blocksize; + + if (journal->j_trans_max > JOURNAL_TRANS_MAX_DEFAULT / ratio || + journal->j_trans_max < JOURNAL_TRANS_MIN_DEFAULT / ratio || + SB_ONDISK_JOURNAL_SIZE(p_s_sb) / journal->j_trans_max < + JOURNAL_MIN_RATIO) { + reiserfs_warning(p_s_sb, + "sh-462: bad transaction max size (%u). FSCK?", + journal->j_trans_max); + return 1; + } + if (journal->j_max_batch != (journal->j_trans_max) * + JOURNAL_MAX_BATCH_DEFAULT/JOURNAL_TRANS_MAX_DEFAULT) { + reiserfs_warning(p_s_sb, +"sh-463: bad transaction max batch (%u). FSCK?", +journal->j_max_batch); + return 1; + } + } else { + /* Default journal params. + The file system was created by old version + of mkreiserfs, so some fields contain zeros, + and we need to advise proper values for them */ + if (p_s_sb->s_blocksize != REISERFS_STANDARD_BLKSIZE) + reiserfs_panic(p_s_sb, "sh-464: bad blocksize (%u)", + p_s_sb->s_blocksize); + journal->j_trans_max = JOURNAL_TRANS_MAX_DEFAULT; + journal->j_max_batch = JOURNAL_MAX_BATCH_DEFAULT; + journal->j_max_commit_age = JOURNAL_MAX_COMMIT_AGE; + } + return 0; +} + /* ** must be called once on fs mount. calls journal_read for you */ @@ -2744,49 +2799,8 @@ le32_to_cpu(jh->jh_journal.jp_journal_max_commit_age); journal->j_max_trans_age = JOURNAL_MAX_TRANS_AGE; - if (journal->j_trans_max) { - /* make sure these parameters are available, assign it if they are not */ - __u32 initial = journal->j_trans_max; - __u32 ratio = 1; - - if (p_s_sb->s_blocksize < 4096) - ratio = 4096 / p_s_sb->s_blocksize; - - if (SB_ONDISK_JOURNAL_SIZE(p_s_sb) / journal->j_trans_max < - JOURNAL_MIN_RATIO) - journal->j_trans_max = - SB_ONDISK_JOURNAL_SIZE(p_s_sb) / JOURNAL_MIN_RATIO; - if (journal->j_trans_max > JOURNAL_TRANS_MAX_DEFAULT / ratio) - journal->j_trans_max = - JOURNAL_TRANS_MAX_DEFAULT / ratio; - if (journal->j_trans_max < JOURNAL_TRANS_MIN_DEFAULT / ratio) - journal->j_trans_max = - JOURNAL_TRANS_MIN_DEFAULT / ratio; - - if (journal->j_trans_max != initial) - reiserfs_warning(p_s_sb, - "sh-461: journal_init: wrong transaction max size (%u). Changed to %u", - initial, journal->j_trans_max); - - journal->j_max_batch = journal->j_trans_max * - JOURNAL_MAX_BATCH_DEFAULT / JOURNAL_TRANS_MAX_DEFAULT; - } - - if (!journal->j_trans_max) { - /*we have the file system was created by old version of mkreiserfs - so this field contains zero value */ - journal->j_trans_max = JOURNAL_TRANS_MAX_DEFAULT; - journal->j_max_batch = JOURNAL_MAX_BATCH_DEFAULT; - journal->j_max_commit_age = JOURNAL_MAX_COMMIT_AGE; - - /* for blocksize >= 4096 - max transaction size is 1024. For block size < 4096 - trans max size is decreased proportionally */ - if (p_s_sb->s_blocksize < 4096) { - journal->j_trans_max /= (4096 / p_s_sb->s_blocksize); - journal->j_max_batch = (journal->j_trans_max) * 9 / 10; - } - } - + if (check_advise_trans_params(p_s_sb, journal) != 0) + goto free_and_return; journal->j_default_max_commit_age = journal->j_max_commit_age; if (commit_max_age != 0) {
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
On 10/04/2007 06:56 PM, Derek Fawcus wrote: > > I ran firefox setuid to a different (not my main user), uid+gid, gave > my main account that gid as a supplemental group, and gave that uid > access to the X magic cookie. You need to use runxas to get any kind of real security. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[IRQ map] VIA C7 CN700 2.6.23-rc9-git USB IRQs disabled
Booting git snapshot of about 6 hours ago, getting the following: USB Universal Host Controller Interface driver v3.0 ACPI: PCI Interrupt Link [ALKB] enabled at IRQ 21 ACPI: PCI Interrupt :00:10.0[A] -> Link [ALKB] -> GSI 21 (level, low) -> IRQ 18 ACPI: PCI interrupt for device :00:10.0 disabled uhci_hcd :00:10.0: init :00:10.0 fail, -16 uhci_hcd: probe of :00:10.0 failed with error -16 ACPI: PCI Interrupt :00:10.1[A] -> Link [ALKB] -> GSI 21 (level, low) -> IRQ 18 ACPI: PCI interrupt for device :00:10.1 disabled uhci_hcd :00:10.1: init :00:10.1 fail, -16 uhci_hcd: probe of :00:10.1 failed with error -16 ACPI: PCI Interrupt :00:10.2[B] -> Link [ALKB] -> GSI 21 (level, low) -> IRQ 18 ACPI: PCI interrupt for device :00:10.2 disabled uhci_hcd :00:10.2: init :00:10.2 fail, -16 uhci_hcd: probe of :00:10.2 failed with error -16 ACPI: PCI Interrupt :00:10.3[B] -> Link [ALKB] -> GSI 21 (level, low) -> IRQ 18 ACPI: PCI interrupt for device :00:10.3 disabled uhci_hcd :00:10.3: init :00:10.3 fail, -16 uhci_hcd: probe of :00:10.3 failed with error -16 ACPI: PCI Interrupt :00:10.4[C] -> Link [ALKB] -> GSI 21 (level, low) -> IRQ 18 ACPI: PCI interrupt for device :00:10.4 disabled ehci_hcd :00:10.4: init :00:10.4 fail, -16 ehci_hcd: probe of :00:10.4 failed with error -16 With "pci=routeirq" it is the same, but then it's "IRQ 17" instead of 18, and the line ACPI: PCI Interrupt Link [ALKB] enabled at IRQ 21 is missing. Works with Debian etch default 2.6.18. /proc/interrupts under .23-rc9-...: $ cat /proc/interrupts CPU0 0: 31756 IO-APIC-edge timer 1: 2 IO-APIC-edge i8042 8: 1 IO-APIC-edge rtc 9: 0 IO-APIC-fasteoi acpi 12: 4 IO-APIC-edge i8042 16: 2627 IO-APIC-fasteoi sata_via 19:472 IO-APIC-fasteoi eth0 Under 2.6.18: ACPI: PCI Interrupt Link [ALKB] enabled at IRQ 21 ACPI: PCI Interrupt :00:10.0[A] -> Link [ALKB] -> GSI 21 (level, low) -> IRQ 177 PCI: VIA IRQ fixup for :00:10.0, from 10 to 1 uhci_hcd :00:10.0: UHCI Host Controller uhci_hcd :00:10.0: new USB bus registered, assigned bus number 1 uhci_hcd :00:10.0: irq 177, io base 0xf900 Thanks Guennadi --- Guennadi Liakhovetski - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 09/12] fuse: add list of writable files to fuse_inode
> hm. At no point in this patch series does anything actually get added to > these lists, so this patch is presently a no-op. > > I'll assume that it will get used later. But it is a bit odd to add > infrastructure in a patch series, then not use it. Why not hold the patch > back and include it in the patch series which actually uses these lists for > something? My stupidity. I somehow thought the patch does actually do something interesting when including it in this series, instead of holding it back for the writable-mmap series. Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 12/12] fuse: add blksize field to fuse_attr
> > From: Miklos Szeredi <[EMAIL PROTECTED]> > > > > Allow the userspace filesystem to supply a blksize value to be > > returned by stat() and friends. If the field is zero, it defaults to > > the old PAGE_CACHE_SIZE value. > > > > Why does fuse need this feature? There are cases, when the filesystem will be passed the buffer from a single read or write call, namely: 1) in 'direct-io' mode (not O_DIRECT), read/write requests don't go through the page cache, but go directly to the userspace fs 2) currently buffered writes are done with single page requests, but if Nick's ->perform_write() patch goes it, it will be possible to do larger write requests. But only if the original write() was also bigger than a page. In these cases the filesystem might want to give a hint to the app about the optimal I/O size. Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/3] Trace code and documentation
Andi Kleen wrote: On Thu, Oct 04, 2007 at 12:19:35PM -0700, David Wilder wrote: Andi Kleen wrote: "David J. Wilder" <[EMAIL PROTECTED]> writes: @@ -0,0 +1,160 @@ +Trace Setup and Control +=== +In the kernel, the trace interface provides a simple mechanism for +starting and managing data channels (traces) to user space. Wasn't relayfs supposed to do that already? Why do you need another wrapper around it? The code in trace is exactly what all the current users of relay do. Therefor trace reduces the duplication of code. If everybody does this then the code should be just put into relayfs? I disagree, I keeping the code separate (layering if you will) makes it easer to use and maintain. Is this also really still faster than a printk below log level (without console driver overhead). If not then why not just use printk? Are you arguing against relayfs or trace? Trace just makes relayfs easer to use. I think relayfs can stand up for it's self. I'm arguing against complicated trace mechanisms that are not fast. What makes trace complicated? It is just, open ,start/stop, close. I can't see how an trace API could be any simpler. At some point when I looked at relayfs it seemed to be reasonably fast (per cpu buffers; not much locking, over head per call roughtly like putchar()), but that might have regressed. No regression has occurred. According the relay documentation if you use global bufferers you must use locking. If you don't want to use locking use per-cpu bufferers. Your example module with its lock definitely looks very slow and I don't approve of it. If you don't approve of the locking then use per-cpu bufferers. The example will do ether. The example shows a way to create an ASCII data layer. ASCII layers don't make much sense imho -- these should just use printk. So the only way I should pass ASCII to user space is using printk? I don't understand that. Again nothing in trace limits you to ASCII data. Fast dedicated binary log channels make sense though; but you don't seem really to be very concentrated on that. I impose no restriction on what type of data you can pass over trace's fast dedicated channels. True, to make trace "fast" you need a data layer that can handle the requirements of per-cpu buffers. However there are still advantages of trace over printk even when using global bufferers: selectable bufferer sizes, printk has selectable buffer sizes too. "Long term we probably want more complex tracing based on lttng, but I'm a big fan of starting out simple and doing incremental changes." It's just that relayfs + another not simple layer are definitely not simple. For a simple logger I'm thinking more like something like SGI's old ktrace module (which undoubtedly many other people have recreated many times for specific debugging scenarios) But that all only makes sense if the overhead is really kept low and i don't see that in your approach. Is your complaint with the overhead of setting up a trace channel or the overhead of writing to a trace channel? For the later, trace adds almost no overhead on top of relay. One advantage of the trace approach is separating control and data layers, therefor trace can support multiple data layers to fit multiple requirements. I have my ideas on how to develop data layer, others may have their own ideas and I welcome the input. relayfs was supposed to be that data layer. I am using the layer definitions described in trace.txt. In this definition relay is a buffering layer. PS: Systemtap has been criticized for introducing out-of-tree kernel code. A clear direction from the community is to move re-usable code in-tree where it can be maintained. Trace is a move in that direction. I'm all for that. I believe a simple fast efficient no frills logger would serve systemtap just fine too. But the approach here seems to be more to add all kinds of knobs and whizzles until you end up with something as slow with printk. And since we already have printk another one just doesn't seem to make much sense. If by knobs you mean the trace controls. The only one that has any effect on the "speed" of tracing is the control to start and stop tracing. And that had been designed to impose the minimal impact possible (one "if" in the tracing path). -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
On Thu, 4 Oct 2007 16:40:44 -0600 Andreas Dilger <[EMAIL PROTECTED]> wrote: > On Oct 04, 2007 13:12 -0700, Andrew Morton wrote: > > On Mon, 01 Oct 2007 17:35:46 -0700 > > > ext2: Avoid rec_len overflow with 64KB block size > > > > > > into 16 bits we have for entry lenght. So we store 0x instead and > > > convert value when read from / written to disk. > > > > This patch clashes in non-trivial ways with > > ext2-convert-to-new-aops-fix.patch and perhaps other things which are > > already queued for 2.6.24 inclusion, so I'll need to ask for an updated > > patch, please. > > If the rel_len overflow patch isn't going to make it, then we also need > to revert the EXT*_MAX_BLOCK_SIZE change to 65536. It would be possible > to allow this to be up to 32768 w/o the rec_len overflow fix however. > Ok, thanks, I dropped ext3-support-large-blocksize-up-to-pagesize.patch and ext2-support-large-blocksize-up-to-pagesize.patch. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove throttle_vm_writeout()
On Fri, 05 Oct 2007 00:39:16 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote: > > throttle_vm_writeout() should be a per-zone thing, I guess. Perhaps fixing > > that would fix your deadlock. That's doubtful, but I don't know anything > > about your deadlock so I cannot say. > > No, doing the throttling per-zone won't in itself fix the deadlock. > > Here's a deadlock example: > > Total memory = 32M > /proc/sys/vm/dirty_ratio = 10 > dirty_threshold = 3M > ratelimit_pages = 1M > > Some program dirties 4M (dirty_threshold + ratelimit_pages) of mmap on > a fuse fs. Page balancing is called which turns all these into > writeback pages. > > Then userspace filesystem gets a write request, and tries to allocate > memory needed to complete the writeout. > > That will possibly trigger direct reclaim, and throttle_vm_writeout() > will be called. That will block until nr_writeback goes below 3.3M > (dirty_threshold + 10%). But since all 4M of writeback is from the > fuse fs, that will never happen. > > Does that explain it better? > yup, thanks. This is a somewhat general problem: a userspace process is in the IO path. Userspace block drivers, for example - pretty much anything which involves kernel->userspace upcalls for storage applications. I solved it once in the past by marking the userspace process as PF_MEMALLOC and I beleive that others have implemented the same hack. I suspect that what we need is a general solution, and that the solution will involve explicitly telling the kernel that this process is one which actually cleans memory and needs special treatment. Because I bet there will be other corner-cases where such a process needs kernel help, and there might be optimisation opportunities as well. Problem is, any such mark-me-as-special syscall would need to be privileged, and FUSE servers presently don't require special perms (do they?) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
On Wed, Oct 03, 2007 at 01:12:46AM +0100, Alan Cox wrote: > > The value of SELinux (or indeed any system compartmentalising access and > limiting damage) comes into play when you get breakage - eg via a web > browser exploit. well, being sick of the number of times one has to upgrade the browser for exploits, I addressed it in a different way. I ran firefox setuid to a different (not my main user), uid+gid, gave my main account that gid as a supplemental group, and gave that uid access to the X magic cookie. ... which only changes the nature of any exploit that might occur - any injected code would have to go via X to attack my main account. DF - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 12/12] fuse: add blksize field to fuse_attr
On Tue, 02 Oct 2007 17:50:38 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote: > From: Miklos Szeredi <[EMAIL PROTECTED]> > > Allow the userspace filesystem to supply a blksize value to be > returned by stat() and friends. If the field is zero, it defaults to > the old PAGE_CACHE_SIZE value. > Why does fuse need this feature? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/12] fuse: fix race between getattr and write
> > @@ -228,6 +243,7 @@ static struct dentry *fuse_lookup(struct > > struct fuse_conn *fc = get_fuse_conn(dir); > > struct fuse_req *req; > > struct fuse_req *forget_req; > > + u64 attr_version; > > > > if (entry->d_name.len > FUSE_NAME_MAX) > > return ERR_PTR(-ENAMETOOLONG); > > @@ -242,6 +258,10 @@ static struct dentry *fuse_lookup(struct > > return ERR_PTR(PTR_ERR(forget_req)); > > } > > > > + spin_lock(>lock); > > + attr_version = fc->attr_version; > > + spin_unlock(>lock); > > You might want to do this (oft-repeated) operation in a little helper > function. > > Because I suspect that the lock isn't needed if CONFIG_64BIT=y. You're perfectly right, although fuse is not yet at the stage, where I'd bother too much with scalability optimizations like that ;) But it's a good cleanup, and I'll do an incremental patch on top of this if that's OK. Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 09/12] fuse: add list of writable files to fuse_inode
On Tue, 02 Oct 2007 17:50:35 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote: > From: Miklos Szeredi <[EMAIL PROTECTED]> > > Each WRITE request must carry a valid file descriptor. When a page is > written back from a memory mapping, the file through which the page > was dirtied is not available, so a new mechananism is needed to find a > suitable file in ->writepage(s). > > A list of fuse_files is added to fuse_inode. The file is removed from > the list in fuse_release(). > > This patch is in preparation for writable mmap support. > > Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]> > --- > > Index: linux/fs/fuse/file.c > === > --- linux.orig/fs/fuse/file.c 2007-10-01 22:42:26.0 +0200 > +++ linux/fs/fuse/file.c 2007-10-01 22:42:27.0 +0200 > @@ -56,6 +56,7 @@ struct fuse_file *fuse_file_alloc(void) > kfree(ff); > ff = NULL; > } > + INIT_LIST_HEAD(>write_entry); > atomic_set(>count, 0); > } > return ff; > @@ -150,12 +151,18 @@ int fuse_release_common(struct inode *in > { > struct fuse_file *ff = file->private_data; > if (ff) { > + struct fuse_conn *fc = get_fuse_conn(inode); > + > fuse_release_fill(ff, get_node_id(inode), file->f_flags, > isdir ? FUSE_RELEASEDIR : FUSE_RELEASE); > > /* Hold vfsmount and dentry until release is finished */ > ff->reserved_req->vfsmount = mntget(file->f_path.mnt); > ff->reserved_req->dentry = dget(file->f_path.dentry); > + > + spin_lock(>lock); > + list_del(>write_entry); > + spin_unlock(>lock); > /* >* Normally this will send the RELEASE request, >* however if some asynchronous READ or WRITE requests > Index: linux/fs/fuse/fuse_i.h > === > --- linux.orig/fs/fuse/fuse_i.h 2007-10-01 22:42:24.0 +0200 > +++ linux/fs/fuse/fuse_i.h2007-10-01 22:43:15.0 +0200 > @@ -70,6 +70,9 @@ struct fuse_inode { > > /** Version of last attribute change */ > u64 attr_version; > + > + /** Files usable in writepage. Protected by fc->lock */ > + struct list_head write_files; > }; > > /** FUSE specific file data */ > @@ -82,6 +85,9 @@ struct fuse_file { > > /** Refcount */ > atomic_t count; > + > + /** Entry on inode's write_files list */ > + struct list_head write_entry; > }; > > /** One input argument of a request */ > Index: linux/fs/fuse/inode.c > === > --- linux.orig/fs/fuse/inode.c2007-10-01 22:42:24.0 +0200 > +++ linux/fs/fuse/inode.c 2007-10-01 22:42:27.0 +0200 > @@ -56,6 +56,7 @@ static struct inode *fuse_alloc_inode(st > fi->i_time = 0; > fi->nodeid = 0; > fi->nlookup = 0; > + INIT_LIST_HEAD(>write_files); > fi->forget_req = fuse_request_alloc(); > if (!fi->forget_req) { > kmem_cache_free(fuse_inode_cachep, inode); > @@ -68,6 +69,7 @@ static struct inode *fuse_alloc_inode(st > static void fuse_destroy_inode(struct inode *inode) > { > struct fuse_inode *fi = get_fuse_inode(inode); > + BUG_ON(!list_empty(>write_files)); > if (fi->forget_req) > fuse_request_free(fi->forget_req); > kmem_cache_free(fuse_inode_cachep, inode); hm. At no point in this patch series does anything actually get added to these lists, so this patch is presently a no-op. I'll assume that it will get used later. But it is a bit odd to add infrastructure in a patch series, then not use it. Why not hold the patch back and include it in the patch series which actually uses these lists for something? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[OOPS] AXIS 700 Lite (VIA C7 CPU) BUG with 2.6.23-rc9-git (i2c)
Hi Got an AXIS 700 Lite thin client with a C7 CPU and CN700 chipset in it, compiled today's git snapshot, and it Oopses in i2c_viapro: BUG: unable to handle kernel paging request at virtual address 016c0555 printing eip: c01a60ed *pde = Oops: [#1] PREEMPT Modules linked in: i2c_viapro i2c_dev i2c_core loop CPU:0 EIP:0060:[]Not tainted VLI EFLAGS: 00010282 (2.6.23-rc9-g804b3f9a #5) EIP is at sysfs_create_group+0x1d/0xe0 eax: f889b828 ebx: ecx: f7d764b0 edx: 016c0555 esi: 016c0555 edi: ebp: f7d71dc4 esp: f7d71da8 ds: 007b es: 007b fs: gs: 0033 ss: 0068 Process modprobe (pid: 1214, ti=f7d7 task=c18f3560 task.ti=f7d7) Stack: c03ceba0 f7d7647c f7d71dd8 c01a392b f7d764b0 f889f3a0 f7d71ddc c0248b38 f889b828 f889b7c0 f7d71df4 c0248c0c f889b7c0 f889b864 f7d71e20 c02491ee f889b828 c0367b56 f889b864 Call Trace: [] show_trace_log_lvl+0x1c/0x40 [] show_stack_log_lvl+0x9a/0xc0 [] show_registers+0x1dc/0x340 [] die+0x102/0x210 [] do_page_fault+0x266/0x600 [] error_code+0x6a/0x70 [] device_add_groups+0x28/0x60 [] device_add_attrs+0x5c/0xb0 [] device_add+0xfe/0x330 [] device_register+0x12/0x20 [] i2c_register_adapter+0xbd/0x170 [i2c_core] [] i2c_add_adapter+0x7a/0x80 [i2c_core] [] vt596_probe+0x145/0x370 [i2c_viapro] [] pci_call_probe+0xd/0x10 [] __pci_device_probe+0x4f/0x60 [] pci_device_probe+0x29/0x50 [] really_probe+0x94/0x140 [] driver_probe_device+0x40/0x60 [] __driver_attach+0x7a/0x80 [] bus_for_each_dev+0x54/0x70 [] driver_attach+0x19/0x20 [] bus_add_driver+0x77/0x130 [] driver_register+0x75/0x80 [] __pci_register_driver+0x4a/0x80 [] i2c_vt596_init+0x17/0x19 [i2c_viapro] [] sys_init_module+0xe2/0x140 [] sysenter_past_esp+0x5f/0x85 === Code: e8 8d b6 00 00 00 00 8d bc 27 00 00 00 00 55 89 e5 56 89 d6 53 83 ec 14 85 c0 0f 84 b2 00 00 00 8b 48 30 85 c9 0f 84 a7 00 00 00 <8b> 12 85 d2 0f 85 89 00 00 00 89 4d f4 8b 5d f4 85 db 74 0b 8b EIP: [] sysfs_create_group+0x1d/0xe0 SS:ESP 0068:f7d71da8 Thanks Guennadi --- Guennadi Liakhovetski - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: JBD-DEBUG /proc/sys entry (again)
On Thu, 04 Oct 2007 16:28:07 +0400 Rusev <[EMAIL PROTECTED]> wrote: > All that should be moved to DEBUGFS under /sys/kernel/debug and so on > - that's right, a bit other issue > is of interest for me. > > My suggestion is that a few other problems with PROCFS exist: > > From my observation there are two major issues are involved: > > 1. /proc/sys entry has very specific readdir operation (vs. other > entries such as /proc/drivers and others) > entries to /proc/sys is likely is to be performed by means of other API. > (quick search found thet API explanation for 2.4.xx > http://www.opentech.at/papers/embedded_proc/node33.html, > yet it looks to be a little change in 2.6) > > 2. function xlate_proc_name behaves not the way it specified in it's > header comment: > [1] minor issue is that proc_match wolud likely return "equals" > as result of comparison of "sys" and "sysvipc" > [2] more significant issue is that it can't properly walk long > paths such as /proc/sys/jbd/jbd-debug, >but only paths likes /proc/sys/jbd-debug (just one step > down, path walking is broken). > > This way we can't add not only /proc/sys/jbd/jbd-debug but any path > likes /proc/aaa/bbb/xxx-debug at one step. > The entry /proc/sys is still specific, because even if fixing > xlate_proc_name we can't see /proc/sys/jbd/jbd-debug > in userspace and successfully see /proc/aaa/bbb/jbd-debug. This patch is wrong. xlate_pro_name() is meant to check if a given path is valid, creating new directory entries is something that need to be handle by the code that's creating the entries. Also note that xlate_pro_name() is also called by remove_proc_entry() so if I call it with a bogus path, this patch will end up creating new directory entries which is not the intended result. > That's because /proc/sys specific operator readdir blocks such PROCFS > entries that they are NOTproperly registersd > with CTL_TABLE. > > Yet I think that we have a general problem with > adding-long-paths-in-one-step, which is addressed by the following patch: This should not be done is user one step and for good reason. If you blindly create multiple directory entries in /proc, how are you going to keep track of all the created entries when its time to remove them (module unloading for example)? If you enter an invalid path the original code is doing the right thing by returning -ENOENT. > > > > diff -uprN linux-2.6.21.orig/fs/proc/generic.c > linux-2.6.21/fs/proc/generic.c > --- linux-2.6.21.orig/fs/proc/generic.c 2007-09-13 15:36:07.0 +0400 > +++ linux-2.6.21/fs/proc/generic.c 2007-10-03 22:12:57.0 +0400 > @@ -298,6 +298,7 @@ static int xlate_proc_name(const char *n > int len; > int rtn = 0; > > + White space damage. > spin_lock(_subdir_lock); > de = _root; > while (1) { > @@ -305,24 +306,52 @@ static int xlate_proc_name(const char *n > if (!next) > break; > > - len = next - cp; > - for (de = de->subdir; de ; de = de->next) { > - if (proc_match(len, cp, de)) > - break; > - } > - if (!de) { > - rtn = -ENOENT; > - goto out; > - } > - cp += len + 1; > - } > +++next; > + > + > +len = next - cp; > + > +if(de->subdir == NULL){ > + /* directory "de" is empty, add myself to it now */ > + char* my_name = kzalloc( (len - 1) + 1, GFP_KERNEL); You did not check if kzalloc was successfully. If the allocation fails, bad things will happen here. Need to check the return status of my_name and return -ENOMEM if the allocation fails. This would of course mean an API change and you would need make sure that all the callers of xlate_proc_name handle the new return code correctly. > + memcpy(my_name, cp, len - 1); > + proc_mkdir(my_name,de); > + kfree(my_name); > +} > + > + > +struct proc_dir_entry *parent_de = de; > +for (de = parent_de->subdir; de ; de = de->next) { > + if (proc_match(len - 1, cp, de)) > +break; > + > +} > + > +if(de == NULL){ > + /* we found no appropriate subdirectory, well create > it now */ 1. Email client cut the line. Disable line wrapping. 2. Line too long - Documentation/CodingStyle > + char* my_name = kzalloc( (len - 1) + 1, GFP_KERNEL); Again, check for kzalloc return status. > + memcpy(my_name, cp, len - 1); > + de = proc_mkdir(my_name,parent_de); > + kfree(my_name); > +} > + > + > + White space damage.
Re: [patch 02/12] fuse: fix race between getattr and write
On Tue, 02 Oct 2007 17:50:28 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote: > @@ -228,6 +243,7 @@ static struct dentry *fuse_lookup(struct > struct fuse_conn *fc = get_fuse_conn(dir); > struct fuse_req *req; > struct fuse_req *forget_req; > + u64 attr_version; > > if (entry->d_name.len > FUSE_NAME_MAX) > return ERR_PTR(-ENAMETOOLONG); > @@ -242,6 +258,10 @@ static struct dentry *fuse_lookup(struct > return ERR_PTR(PTR_ERR(forget_req)); > } > > + spin_lock(>lock); > + attr_version = fc->attr_version; > + spin_unlock(>lock); You might want to do this (oft-repeated) operation in a little helper function. Because I suspect that the lock isn't needed if CONFIG_64BIT=y. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
On Oct 04, 2007 13:12 -0700, Andrew Morton wrote: > On Mon, 01 Oct 2007 17:35:46 -0700 > > ext2: Avoid rec_len overflow with 64KB block size > > > > into 16 bits we have for entry lenght. So we store 0x instead and > > convert value when read from / written to disk. > > This patch clashes in non-trivial ways with > ext2-convert-to-new-aops-fix.patch and perhaps other things which are > already queued for 2.6.24 inclusion, so I'll need to ask for an updated > patch, please. If the rel_len overflow patch isn't going to make it, then we also need to revert the EXT*_MAX_BLOCK_SIZE change to 65536. It would be possible to allow this to be up to 32768 w/o the rec_len overflow fix however. Yes, this does imply that those patches were in the wrong order in the patch series, and I apologize for that, even if it isn't my fault. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove throttle_vm_writeout()
> None of the above. > > [PATCH] vm: pageout throttling > > With silly pageout testcases it is possible to place huge amounts of > memory > under I/O. With a large request queue (CFQ uses 8192 requests) it is > possible to place _all_ memory under I/O at the same time. > > This means that all memory is pinned and unreclaimable and the VM gets > upset and goes oom. > > The patch limits the amount of memory which is under pageout writeout to > be > a little more than the amount of memory at which balance_dirty_pages() > callers will synchronously throttle. > > This means that heavy pageout activity can starve heavy writeback activity > completely, but heavy writeback activity will not cause starvation of > pageout. Because we don't want a simple `dd' to be causing excessive > latencies in page reclaim. > > afaict that problem is still there. It is possible to get all of > ZONE_NORMAL dirty on a highmem machine. With a large queue (or lots of > queues), vmscan can them place all of ZONE_NORMAL under IO. > > It could be that we've fixed this problem via other means in the interrim, > but from a quick peek to seems to me that the scanner will still do a 100% > CPU burn when all of a zone's pages are under writeback. Ah, OK. I did read the changelog, but you added quite a bit of translation ;) > throttle_vm_writeout() should be a per-zone thing, I guess. Perhaps fixing > that would fix your deadlock. That's doubtful, but I don't know anything > about your deadlock so I cannot say. No, doing the throttling per-zone won't in itself fix the deadlock. Here's a deadlock example: Total memory = 32M /proc/sys/vm/dirty_ratio = 10 dirty_threshold = 3M ratelimit_pages = 1M Some program dirties 4M (dirty_threshold + ratelimit_pages) of mmap on a fuse fs. Page balancing is called which turns all these into writeback pages. Then userspace filesystem gets a write request, and tries to allocate memory needed to complete the writeout. That will possibly trigger direct reclaim, and throttle_vm_writeout() will be called. That will block until nr_writeback goes below 3.3M (dirty_threshold + 10%). But since all 4M of writeback is from the fuse fs, that will never happen. Does that explain it better? Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/1] ia64: Convert cpu_sibling_map to a per_cpu data array FIX
The previous version of this patch missed a code path in inserting the boot cpu into the cpu sibling and core maps. This fix corrects that omission. -- -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/1] ia64: Convert cpu_sibling_map to a per_cpu data array FIX
There are two versions of per_cpu_init() for ia64. This patch corrects the problem that one of the versions did not insert the boot cpu into the cpu sibling and core maps. Signed-off-by: Mike Travis <[EMAIL PROTECTED]> --- arch/ia64/kernel/setup.c |8 arch/ia64/mm/contig.c|6 -- 2 files changed, 8 insertions(+), 6 deletions(-) --- linux.orig/arch/ia64/kernel/setup.c 2007-10-04 14:38:53.0 -0700 +++ linux/arch/ia64/kernel/setup.c 2007-10-04 14:51:46.289055433 -0700 @@ -873,6 +873,14 @@ cpu_init (void) void *cpu_data; cpu_data = per_cpu_init(); + /* +* insert boot cpu into sibling and core mapes +* (must be done after per_cpu area is setup) +*/ + if (smp_processor_id() == 0) { + cpu_set(0, per_cpu(cpu_sibling_map, 0)); + cpu_set(0, cpu_core_map[0]); + } /* * We set ar.k3 so that assembly code in MCA handler can compute --- linux.orig/arch/ia64/mm/contig.c2007-10-04 14:38:53.0 -0700 +++ linux/arch/ia64/mm/contig.c 2007-10-04 14:50:12.699513748 -0700 @@ -212,12 +212,6 @@ per_cpu_init (void) cpu_data += PERCPU_PAGE_SIZE; per_cpu(local_per_cpu_offset, cpu) = __per_cpu_offset[cpu]; } - /* -* cpu_sibling_map is now a per_cpu variable - it needs to -* be accessed after per_cpu_init() sets up the per_cpu area. -*/ - cpu_set(0, per_cpu(cpu_sibling_map, 0)); - cpu_set(0, cpu_core_map[0]); } return __per_cpu_start + __per_cpu_offset[smp_processor_id()]; } -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for testing] Re: Decreasing stime running confuses top
Am Freitag, 5. Oktober 2007 schrieb Chuck Ebbert: > On 10/04/2007 05:10 PM, Christian Borntraeger wrote: > > > > > Alternative patch: > > procfs: Don't read runtime twice when computing task's stime > > Current code reads p->se.sum_exec_runtime twice and goes through > multiple type conversions to calculate stime. Read it once and > skip some of the conversions. > > Signed-off-by: Chuck Ebbert <[EMAIL PROTECTED]> Looks better and makes the code nicer. s390 and power should work as well as CONFIG_VIRT_CPU_ACCOUNTING is unaffected. If Frans successfully tests this patch, feel free to add Acked-by: Christian Borntraeger <[EMAIL PROTECTED]> Christian - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Linux 2.6.23-rc9 and MAX_ARG_PAGES
Linus Torvalds writes: > Well, since others definitely don't see this, including me, and I can do > things like 62MB exec arrays: > > [EMAIL PROTECTED] linux]$ echo $(find /home/torvalds/) | wc > 1 883304 63000962 That wouldn't actually do an exec, assuming you're using bash, since echo is a shell builtin in bash. You'd need to do /bin/echo. Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Thu, Oct 04, 2007 at 03:07:18PM -0700, David Miller wrote: > From: Chuck Ebbert <[EMAIL PROTECTED]> Date: Thu, 04 Oct 2007 17:47:48 > -0400 > > > On 10/04/2007 05:11 PM, David Miller wrote: > > > From: Chuck Ebbert <[EMAIL PROTECTED]> Date: Thu, 04 Oct 2007 17:02:17 > > > -0400 > > > > > >> How do you simulate reading 100TB of data spread across 3000 disks, > > >> selecting 10% of it using some criterion, then sorting and summarizing > > >> the result? > > > > > > You repeatedly read zeros from a smaller disk into the same amount of > > > memory, and sort that as if it were real data instead. > > > > You've just replaced 3000 concurrent streams of data with a single stream. > > That won't test the memory allocator's ability to allocate memory to many > > concurrent users very well. > > You've kindly removed my "thinking outside of the box" comment. > > The point is was not that my specific suggestion would be perfect, but that > if you used your creativity and thought in similar directions you might find > a way to do it. > > People are too narrow minded when it comes to these things, and that's the > problem I want to address. And it's a good point, too, because often problems to one person are a no-brainer to someone else. Creating lots of "fake" disks is trivial to do, IMO. Use loopback on sparse files containing sparse filesxi, use ramdisks containing sparse files or write a sparse dm target for sparse block device mapping, etc. I'm sure there's more than the few I just threw out... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.23-rc9-rt2
We are pleased to announce the 2.6.23-rc9-rt2 tree, which can be downloaded from the new location: http://www.kernel.org/pub/linux/kernel/projects/rt/ Changes since 2.6.23-rc9-rt1 - x86_64 disable IST for debug (Andi Kleen) - Better handling of dynticks going bad in RCU (Steven Rostedt) - Preempt RCU boosting (Steven Rostedt based on Paul E. McKenney's stuff) Again, this still holds experimental code. But I've been running it on a few boxes already (and even the box I'm writing this on). to build a 2.6.23-rc9-rt2 tree, the following patches should be applied: http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.22.tar.bz2 http://www.kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.23-rc9.bz2 http://www.kernel.org/pub/linux/kernel/projects/rt/patch-2.6.23-rc9-rt2.bz2 The broken out patches are also available. -- Steve - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
From: Chuck Ebbert <[EMAIL PROTECTED]> Date: Thu, 04 Oct 2007 17:47:48 -0400 > On 10/04/2007 05:11 PM, David Miller wrote: > > From: Chuck Ebbert <[EMAIL PROTECTED]> > > Date: Thu, 04 Oct 2007 17:02:17 -0400 > > > >> How do you simulate reading 100TB of data spread across 3000 disks, > >> selecting 10% of it using some criterion, then sorting and > >> summarizing the result? > > > > You repeatedly read zeros from a smaller disk into the same amount of > > memory, and sort that as if it were real data instead. > > You've just replaced 3000 concurrent streams of data with a single > stream. That won't test the memory allocator's ability to allocate > memory to many concurrent users very well. You've kindly removed my "thinking outside of the box" comment. The point is was not that my specific suggestion would be perfect, but that if you used your creativity and thought in similar directions you might find a way to do it. People are too narrow minded when it comes to these things, and that's the problem I want to address. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for testing] Re: Decreasing stime running confuses top
On 10/04/2007 05:10 PM, Christian Borntraeger wrote: > Alternative patch: procfs: Don't read runtime twice when computing task's stime Current code reads p->se.sum_exec_runtime twice and goes through multiple type conversions to calculate stime. Read it once and skip some of the conversions. Signed-off-by: Chuck Ebbert <[EMAIL PROTECTED]> --- linux-2.6.23-rc6-dell.orig/fs/proc/array.c +++ linux-2.6.23-rc6-dell/fs/proc/array.c @@ -334,39 +334,38 @@ static cputime_t task_stime(struct task_ return p->stime; } #else -static cputime_t task_utime(struct task_struct *p) +static clock_t __task_utime(struct task_struct *p, u64 runtime) { clock_t utime = cputime_to_clock_t(p->utime), total = utime + cputime_to_clock_t(p->stime); - u64 temp; /* * Use CFS's precise accounting: */ - temp = (u64)nsec_to_clock_t(p->se.sum_exec_runtime); - if (total) { - temp *= utime; - do_div(temp, total); + runtime *= utime; + do_div(runtime, total); } - utime = (clock_t)temp; + return (clock_t)runtime; +} - return clock_t_to_cputime(utime); +static cputime_t task_utime(struct task_struct *p) +{ + u64 runtime = (u64)nsec_to_clock_t(p->se.sum_exec_runtime); + + return clock_t_to_cputime(__task_utime(p, runtime)); } static cputime_t task_stime(struct task_struct *p) { - clock_t stime; + u64 runtime = (u64)nsec_to_clock_t(p->se.sum_exec_runtime); /* * Use CFS's precise accounting. (we subtract utime from * the total, to make sure the total observed by userspace * grows monotonically - apps rely on that): */ - stime = nsec_to_clock_t(p->se.sum_exec_runtime) - - cputime_to_clock_t(task_utime(p)); - - return clock_t_to_cputime(stime); + return clock_t_to_cputime(runtime - __task_utime(p, runtime)); } #endif - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Linux 2.6.23-rc9 and MAX_ARG_PAGES
On Thu, Oct 04, 2007 at 07:17:50PM +0200, Peter Zijlstra wrote: > what happens if you up the stack limit to say 128M ? > > Also, do you happen to have execve syscall audit stuff enabled? Actually, you were right, not only it's enabled but it's also the culprit. If I stop it, all is well... Sorry for the noise. -- Mathieu Chouquet-Stringer [EMAIL PROTECTED] The sun itself sees not till heaven clears. -- William Shakespeare -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove throttle_vm_writeout()
On Thu, 04 Oct 2007 14:25:22 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote: > From: Miklos Szeredi <[EMAIL PROTECTED]> > > By relying on the global diry limits, this can cause a deadlock when > devices are stacked. > > If the stacking is done through a fuse filesystem, the __GFP_FS, > __GFP_IO tests won't help: the process doing the allocation doesn't > have any special flag. This description of the bug-which-is-being-fixed is nowhere near adequate enough for a reviewer to understand the problem. This makes it hard to suggest alternative fixes. > So why exactly does this function exist? That's described in the changelog for the patch which added throttle_vm_writeout(). Unsurprisingly ;) > Direct reclaim does not _increase_ the number of dirty pages in the > system, so rate limiting it seems somewhat pointless. > > There are two cases: > > 1) File backed pages -> file > > dirty + writeback count remains constant > > 2) Anonymous pages -> swap > > writeback count increases, dirty balancing will hold back file > writeback in favor of swap > > So the real question is: does case 2 need rate limiting, or is it OK > to let the device queue fill with swap pages as fast as possible? None of the above. [PATCH] vm: pageout throttling With silly pageout testcases it is possible to place huge amounts of memory under I/O. With a large request queue (CFQ uses 8192 requests) it is possible to place _all_ memory under I/O at the same time. This means that all memory is pinned and unreclaimable and the VM gets upset and goes oom. The patch limits the amount of memory which is under pageout writeout to be a little more than the amount of memory at which balance_dirty_pages() callers will synchronously throttle. This means that heavy pageout activity can starve heavy writeback activity completely, but heavy writeback activity will not cause starvation of pageout. Because we don't want a simple `dd' to be causing excessive latencies in page reclaim. afaict that problem is still there. It is possible to get all of ZONE_NORMAL dirty on a highmem machine. With a large queue (or lots of queues), vmscan can them place all of ZONE_NORMAL under IO. It could be that we've fixed this problem via other means in the interrim, but from a quick peek to seems to me that the scanner will still do a 100% CPU burn when all of a zone's pages are under writeback. throttle_vm_writeout() should be a per-zone thing, I guess. Perhaps fixing that would fix your deadlock. That's doubtful, but I don't know anything about your deadlock so I cannot say. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.23-rc9-git2: Known regressions from 2.6.22
Hi, This message contains a list of some known regressions from 2.6.22 for which there are no fixes in the mainline that I know of. If any of them have been fixed already, please let me know. If you know of any other unresolved regressions from 2.6.22, please let me know either and I'll add them to the list. Subject:zd1211 device is no longer configured Submitter: Oliver Neukum <[EMAIL PROTECTED]> References: http://marc.info/?l=linux-usb-devel=118854967709322=2 http://bugzilla.kernel.org/show_bug.cgi?id=8972 Caused-By: Daniel Drake <[EMAIL PROTECTED]> commit 74553aedd46b3a2cae986f909cf2a3f99369decc Subject:Oops while modprobing phy fixed module Submitter: Gabriel C <[EMAIL PROTECTED]> References: http://lkml.org/lkml/2007/7/14/63 http://bugzilla.kernel.org/show_bug.cgi?id=9060 Handled-By: Satyam Sharma <[EMAIL PROTECTED]> Vitaly Bordug <[EMAIL PROTECTED]> Tejun Heo <[EMAIL PROTECTED]> Patch: http://lkml.org/lkml/2007/7/18/506 Subject:ACPI problems: 2.6.22-git17 working, 2.6.23-rc1* is not Submitter: Danny ter Haar <[EMAIL PROTECTED]> References: http://lkml.org/lkml/2007/7/27/298 http://lkml.org/lkml/2007/7/29/371 http://bugzilla.kernel.org/show_bug.cgi?id=9061 Handled-By: Len Brown <[EMAIL PROTECTED]> Subject:empty suspend stopped working around 2.6.23-rc4 Submitter: Pavel Machek <[EMAIL PROTECTED]> References: http://lkml.org/lkml/2007/9/11/326 http://bugzilla.kernel.org/show_bug.cgi?id=9075 Subject:umount triggers a warning in jfs and takes almost a minute Submitter: Oliver Neukum <[EMAIL PROTECTED]> References: http://lkml.org/lkml/2007/9/4/73 http://bugzilla.kernel.org/show_bug.cgi?id=9076 Handled-By: Dave Kleikamp <[EMAIL PROTECTED]> Patch: http://bugzilla.kernel.org/attachment.cgi?id=13023=view Subject:build #301 failed for 2.6.23-rc6-g0d4cbb5 in linux/drivers/net/wireless/libertas/ Submitter: Toralf Förster <[EMAIL PROTECTED]> References: http://lkml.org/lkml/2007/9/11/150 http://bugzilla.kernel.org/show_bug.cgi?id=9077 Handled-By: Randy Dunlap <[EMAIL PROTECTED]> Patch: http://bugzilla.kernel.org/attachment.cgi?id=12963=view Subject:NETDEV WATCHDOG: eth0: transmit timed out Submitter: Karl Meyer <[EMAIL PROTECTED]> References: http://lkml.org/lkml/2007/8/13/737 http://bugzilla.kernel.org/show_bug.cgi?id=9079 Handled-By: Francois Romieu <[EMAIL PROTECTED]> Subject:Weird network problems with 2.6.23-rc2 Submitter: Shish <[EMAIL PROTECTED]> References: http://lkml.org/lkml/2007/8/11/40 http://bugzilla.kernel.org/show_bug.cgi?id=9080 Subject:powersaving degradation, (time spend in C0 goes up after a while) Submitter: Christian Leber <[EMAIL PROTECTED]> References: http://lkml.org/lkml/2007/9/2/142 http://lkml.org/lkml/2007/9/2/207 http://bugzilla.kernel.org/show_bug.cgi?id=9081 Subject:vga text console not working on 2.6.23-rc8 Submitter: Santiago Garcia Mantinan <[EMAIL PROTECTED]> References: http://lkml.org/lkml/2007/9/28/342 http://bugzilla.kernel.org/show_bug.cgi?id=9099 Handled-By: H. Peter Anvin <[EMAIL PROTECTED]> Antonino A. Daplas <[EMAIL PROTECTED]> Subject:kernel oops when unplugging usb mouse, sometimes hardlock when moving mouse Submitter: o. meijer <[EMAIL PROTECTED]> References: http://bugzilla.kernel.org/show_bug.cgi?id=9111 Handled-By: Dmitry Torokhov <[EMAIL PROTECTED]> Subject:2.6.23-rc9 boot failure (megaraid?) Submitter: Burton Windle <[EMAIL PROTECTED]> References: http://lkml.org/lkml/2007/10/2/243 http://bugzilla.kernel.org/show_bug.cgi?id=9113 Handled-By: Adrian Bunk <[EMAIL PROTECTED]> FUJITA Tomonori <[EMAIL PROTECTED]> Caused-By: FUJITA Tomonori <[EMAIL PROTECTED]> commit 3f6270ef76f2ce5c134615a470685d6c2a66c07e [SCSI] megaraid_old: convert to use the data buffer accessors Patch: http://lkml.org/lkml/2007/10/4/294 Subject:kernel BUG at arch/i386/mm/highmem.c:15! on 2.6.23-rc8/rc9 Submitter: gurudas pai <[EMAIL PROTECTED]> References: http://lkml.org/lkml/2007/10/4/61 http://bugzilla.kernel.org/show_bug.cgi?id=9122 Handled-By: Nick Piggin <[EMAIL PROTECTED]> Hugh Dickins <[EMAIL PROTECTED]> Patch: http://lkml.org/lkml/2007/10/4/256 Subject:2.6.23-rcX SG_GET_SCSI_ID regression? Submitter: Joerg Platte <[EMAIL PROTECTED]> References: http://lkml.org/lkml/2007/10/3/101 http://bugzilla.kernel.org/show_bug.cgi?id=9123 For details, please follow the links
Re: [BUG] Linux 2.6.23-rc9 and MAX_ARG_PAGES
On Thu, Oct 04, 2007 at 05:50:00PM -0400, Chuck Ebbert wrote: > On 10/04/2007 01:05 PM, Mathieu Chouquet-Stringer wrote: > > In the kernel source tree, if I run a stupid find | xargs ls, I now get > > this: > > xargs: ls: Argument list too long > > > > Can you strace it to see what syscall is failing? Sure: 25789 <... execve resumed> )= -1 E2BIG (Argument list too long) I'm going to reboot to a kernel that has Linus' printks... -- Mathieu Chouquet-Stringer [EMAIL PROTECTED] The sun itself sees not till heaven clears. -- William Shakespeare -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
pm qos infrastructure and interface
The following patch is a generalization of the latency.c implementation done by Arjan last year. It provides infrastructure for more than one parameter, and exposes a user mode interface for processes to register pm_qos expectations of processes. This interface provides a kernel and user mode interface for registering performance expectations by drivers, subsystems and user space applications on one of the parameters. Currently we have {cpu_dma_latency, network_latency, network_throughput} as the initial set of pm_qos parameters. The infrastructure exposes multiple misc device nodes one per implemented parameter. The set of parameters implement is defined by pm_qos_power_init() and pm_qos_params.h. This is done because having the available parameters being runtime configurable or changeable from a driver was seen as too easy to abuse. For each parameter a list of performance requirements is maintained along with an aggregated target value. The aggregated target value is updated with changes to the requirement list or elements of the list. Typically the aggregated target value is simply the max or min of the requirement values held in the parameter list elements. >From kernel mode the use of this interface is simple: pm_qos_add_requirement(param_id, name, target_value): Will insert a named element in the list for that identified PM_QOS parameter with the target value. Upon change to this list the new target is recomputed and any registered notifiers are called only if the target value is now different. pm_qos_update_requirement(param_id, name, new_target_value): Will search the list identified by the param_id for the named list element and then update its target value, calling the notification tree if the aggregated target is changed. with that name is already registered. pm_qos_remove_requirement(param_id, name): Will search the identified list for the named element and remove it, after removal it will update the aggregate target and call the notification tree if the target was changed as a result of removing the named requirement. >From user mode: Only processes can register a pm_qos requirement. To provide for automatic cleanup for process the interface requires the process to register its parameter requirements in the following way: To register the default pm_qos target for the specific parameter, the process must open one of /dev/[cpu_dma_latency, network_latency, network_throughput] As long as the device node is held open that process has a registered requirement on the parameter. The name of the requirement is "process_" derived from the current->pid from within the open system call. To change the requested target value the process needs to write a s32 value to the open device node. This translates to a pm_qos_update_requirement call. To remove the user mode request for a target value simply close the device node. --mgross Signed-off-by: mark gross <[EMAIL PROTECTED]> --- diff -urN -X linux-2.6.23-rc8/Documentation/dontdiff linux-2.6.23-rc8/Documentation/pm_qos_interface.txt linux-2.6.23-rc8-qos/Documentation/pm_qos_interface.txt --- linux-2.6.23-rc8/Documentation/pm_qos_interface.txt 1969-12-31 16:00:00.0 -0800 +++ linux-2.6.23-rc8-qos/Documentation/pm_qos_interface.txt 2007-10-04 14:26:58.0 -0700 @@ -0,0 +1,59 @@ +PM quality of Service interface. + +This interface provides a kernel and user mode interface for registering +performance expectations by drivers, subsystems and user space applications on +one of the parameters. + +Currently we have {cpu_dma_latency, network_latency, network_throughput} as the +initial set of pm_qos parameters. + +The infrastructure exposes multiple misc device nodes one per implemented +parameter. The set of parameters implement is defined by pm_qos_power_init() +and pm_qos_params.h. This is done because having the available parameters +being runtime configurable or changeable from a driver was seen as too easy to +abuse. + +For each parameter a list of performance requirements is maintained along with +an aggregated target value. The aggregated target value is updated with +changes to the requirement list or elements of the list. Typically the +aggregated target value is simply the max or min of the requirement values held +in the parameter list elements. + +From kernel mode the use of this interface is simple: +pm_qos_add_requirement(param_id, name, target_value): +Will insert a named element in the list for that identified PM_QOS parameter +with the target value. Upon change to this list the new target is recomputed +and any registered notifiers are called only if the target value is now +different. + +pm_qos_update_requirement(param_id, name, new_target_value): +Will search the list identified by the param_id for the named list element and +then update its target value, calling the notification tree if the aggregated +target is changed. with that name is already registered. +
Re: [BUG] Linux 2.6.23-rc9 and MAX_ARG_PAGES
On 10/04/2007 01:05 PM, Mathieu Chouquet-Stringer wrote: > In the kernel source tree, if I run a stupid find | xargs ls, I now get > this: > xargs: ls: Argument list too long > Can you strace it to see what syscall is failing? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On 10/04/2007 05:11 PM, David Miller wrote: > From: Chuck Ebbert <[EMAIL PROTECTED]> > Date: Thu, 04 Oct 2007 17:02:17 -0400 > >> How do you simulate reading 100TB of data spread across 3000 disks, >> selecting 10% of it using some criterion, then sorting and >> summarizing the result? > > You repeatedly read zeros from a smaller disk into the same amount of > memory, and sort that as if it were real data instead. You've just replaced 3000 concurrent streams of data with a single stream. That won't test the memory allocator's ability to allocate memory to many concurrent users very well. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Linux 2.6.23-rc9 and MAX_ARG_PAGES
On Thu, Oct 04, 2007 at 07:17:50PM +0200, Peter Zijlstra wrote: > /me tries > > yep works like a charm, and that is a tree with a full git repo and > several build dirs in it. Well, what can I say? ;-) > what happens if you up the stack limit to say 128M ? It's unlimited. > Also, do you happen to have execve syscall audit stuff enabled? Nope. -- Mathieu Chouquet-Stringer [EMAIL PROTECTED] The sun itself sees not till heaven clears. -- William Shakespeare -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23-rc9: Oops in cache_alloc_refill() mm/slab.c
On Thu, 2007-10-04 at 18:13 +0200, Valerie Clement wrote: > While running ffsb tests on my ext4 filesystem, I got an Oops in > cache_alloc_refill(). > I turned on SLAB debugging and here is the message I got: > > slab: Internal list corruption detected in cache 'buffer_head'(30), > slabp 81007e100100(1515870810). Hexdump: slabp->inuse = 1515870810 looks bogus. Is this easily reproducible ? What tests are you running through ffsb ? > 000: 5a 5a 5a 5a 5a 5a 5a 5a b8 23 34 7e 00 81 ff ff > 010: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a > 020: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a > 030: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a > 040: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a > 050: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a > 060: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a a5 > 070: c0 88 56 63 c5 56 41 d8 f1 37 4a 80 ff ff ff ff > 080: c0 88 56 63 c5 56 41 d8 80 33 53 7d 00 81 ff ff > 090: e8 25 60 7d 00 81 ff ff 68 cb 3b 01 00 81 ff ff > 0a0: 18 68 50 7d 00 81 ff ff > [ cut here ] > kernel BUG at /home/clementv/src/linux-2.6.23-rc9/mm/slab.c:2923! > invalid opcode: [1] SMP > CPU 2 > Modules linked in: qla2xxx > Pid: 4041, comm: ffsb Not tainted 2.6.23-rc9 #2 > RIP: 0010:[] [] check_slabp+0xb5/0xc1 > RSP: 0018:8100774bb958 EFLAGS: 00010096 > RAX: 0001 RBX: 81007e100100 RCX: 6d20 > RDX: RSI: 0046 RDI: 81007e347280 > RBP: 00a8 R08: 0005 R09: 8060bb10 > R10: 000ae468 R11: 00050002 R12: 00a8 > R13: 81007e347280 R14: 81007e347280 R15: 0002 > FS: 41802950(0063) GS:81007e0c4728() knlGS: > CS: 0010 DS: ES: CR0: 8005003b > CR2: 5f83d00c CR3: 78149000 CR4: 06e0 > DR0: DR1: DR2: > DR3: DR6: 0ff0 DR7: 0400 > Process ffsb (pid: 4041, threadinfo 8100774ba000, task 81007dbdc7a0) > Stack: 000d 000e 81007e100100 81007e342398 > 81007e078488 80277069 8050 81007e347280 > 8050 0246 80299539 f000 > Call Trace: > [] cache_alloc_refill+0xc8/0x23f > [] alloc_buffer_head+0x14/0x45 > [] kmem_cache_alloc+0x94/0xe9 > [] alloc_buffer_head+0x14/0x45 > [] alloc_page_buffers+0x38/0xd5 > [] create_empty_buffers+0x14/0x9b > [] __block_prepare_write+0x7c/0x45b > [] ext4_get_block+0x0/0x139 > [] block_prepare_write+0x1a/0x25 > [] ext4_prepare_write+0xaf/0x175 > [] generic_file_buffered_write+0x288/0x631 > [] __generic_file_aio_write_nolock+0x33f/0x3a9 > [] enqueue_entity+0x17c/0x1a3 > [] generic_file_aio_write+0x61/0xc1 > [] __check_preempt_curr_fair+0x56/0x76 > [] ext4_file_write+0x16/0x91 > [] do_sync_write+0xc9/0x10c > [] file_move+0x1d/0x4c > [] autoremove_wake_function+0x0/0x2e > [] do_filp_open+0x2a/0x38 > [] poison_obj+0x26/0x30 > [] vfs_write+0xad/0x136 > [] sys_write+0x45/0x6e > [] system_call+0x7e/0x83 > > > Valérie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Linux 2.6.23-rc9 and MAX_ARG_PAGES
Thank you for getting back to me. On Thu, Oct 04, 2007 at 10:27:52AM -0700, Linus Torvalds wrote: > What does your "ulimit -s" say? That's actually the first thing I checked. mchouque - /usr/src/kernel/linux %ulimit -s unlimited And for the record, ulimit -a yields: -t: cpu time (seconds) unlimited -f: file size (blocks) unlimited -d: data seg size (kbytes) unlimited -s: stack size (kbytes)unlimited -c: core file size (blocks)0 -m: resident set size (kbytes) unlimited -u: processes 16375 -n: file descriptors 1024 -l: locked-in-memory size (kb) 32 -v: address space (kb) unlimited -x: file locks unlimited -i: pending signals16375 -q: bytes in POSIX msg queues 819200 -N 13: 0 -N 14: 0 > I suspect that you might hit the code that limits execve() arguments to > one quarter of the maximum stack size. > > We could change that from 25% to something else (half? three quarters?), > but if you really are hitting that limit, it sounds like you may have a > really small stack size to begin with (ie if 25% is smaller than the old > argument size limit of 128kB, you're running with a stack limit of less > than half a meg, which sounds pretty dang small). > > So I'd like to verify that the stack limit really is the issue, and not > something else. Anything else you'd like me to try? -- Mathieu Chouquet-Stringer [EMAIL PROTECTED] The sun itself sees not till heaven clears. -- William Shakespeare -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC/PATCH -v2] Add sysfs control to modify a user's cpu share
On Thu, 04 Oct 2007 10:54:51 +0200, Heiko Carstens said: > > echo 2048 > /sys/kernel/uids/500/cpu_share > > > > this should just work too, regardless of there not being any UID 500 > > tasks yet. Likewise, once configured, the /sys/kernel/uids/* directories > > (with the settings in them) should probably not go away either. > > Shouldn't that be done via uevents? E.g. UID x gets added to the sysfs tree, > generates a uevent and a script then figures out the cpu_share and sets it. That would tend to be a tad racy - a site may want to set limits in the hypothetical /sys/kernel/uids/NNN before the program has a chance to fork-bomb or otherwise make it difficult to set a limitfrom within another userspace process. It's similar to why you want a process to be launched with all its ulimit's set, rather than set them after the fork/exec happens... pgpLeIh1OXCKR.pgp Description: PGP signature
Re: [PATCH 4/5] writeback: remove pages_skipped accounting in __block_write_full_page()
On Tue, 02 Oct 2007 16:41:47 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote: > This patch fixes this bug. Though I'm not sure why __block_write_full_page() > is called only to do nothing and who actually issued the writeback for us. kjourald wrote the page's buffers back (ext3 in ordered-data mode). The VM didn't know about that, so we have a PageDirty page which has clean buffers. We rely upon the VFS writeback code to "discover" that this dirty page has clean buffers: the VFS will attempt to write the dirty page and will end up marking the page clean without performing any IO. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Linux 2.6.23-rc9 and MAX_ARG_PAGES
On Thu, 4 Oct 2007, Mathieu Chouquet-Stringer wrote: > > Anything else you'd like me to try? Well, since others definitely don't see this, including me, and I can do things like 62MB exec arrays: [EMAIL PROTECTED] linux]$ echo $(find /home/torvalds/) | wc 1 883304 63000962 without getting any overflows (much less just on the kernel sources, which is less than a megabyte of pathnames), I think it would be good if you were to just instrument the kernel and make it do a "printk()" when it returns E2BIG in fs/execve.c (or the NULL returns from get_arg_page()). Just to figure out *which* test fails for you but apparently nobody else. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/3] Trace code and documentation
On Thu, Oct 04, 2007 at 12:19:35PM -0700, David Wilder wrote: > Andi Kleen wrote: > >"David J. Wilder" <[EMAIL PROTECTED]> writes: > >>@@ -0,0 +1,160 @@ > >>+Trace Setup and Control > >>+=== > >>+In the kernel, the trace interface provides a simple mechanism for > >>+starting and managing data channels (traces) to user space. > > > >Wasn't relayfs supposed to do that already? Why do you need another > >wrapper around it? > > The code in trace is exactly what all the current users of relay do. > Therefor trace reduces the duplication of code. If everybody does this then the code should be just put into relayfs? > > > > > >Is this also really still faster than a printk below log level > >(without console driver overhead). If not then why not just > >use printk? > > Are you arguing against relayfs or trace? Trace just makes relayfs > easer to use. I think relayfs can stand up for it's self. I'm arguing against complicated trace mechanisms that are not fast. At some point when I looked at relayfs it seemed to be reasonably fast (per cpu buffers; not much locking, over head per call roughtly like putchar()), but that might have regressed. Your example module with its lock definitely looks very slow and I don't approve of it. > > > The example shows a way to create an ASCII data layer. ASCII layers don't make much sense imho -- these should just use printk. Fast dedicated binary log channels make sense though; but you don't seem really to be very concentrated on that. > True, to make trace "fast" you need a data layer that can handle the > requirements of per-cpu buffers. However there are still advantages of > trace over printk even when using global bufferers: selectable bufferer > sizes, printk has selectable buffer sizes too. >"Long term we probably want more complex tracing based on lttng, > but I'm a big fan of starting out simple and doing incremental > changes." It's just that relayfs + another not simple layer are definitely not simple. For a simple logger I'm thinking more like something like SGI's old ktrace module (which undoubtedly many other people have recreated many times for specific debugging scenarios) But that all only makes sense if the overhead is really kept low and i don't see that in your approach. > One advantage of the trace approach is separating control and data > layers, therefor trace can support multiple data layers to fit multiple > requirements. > > I have my ideas on how to develop data layer, others may have their own > ideas and I welcome the input. relayfs was supposed to be that data layer. > PS: Systemtap has been criticized for introducing out-of-tree kernel > code. A clear direction from the community is to move re-usable code > in-tree where it can be maintained. Trace is a move in that direction. I'm all for that. I believe a simple fast efficient no frills logger would serve systemtap just fine too. But the approach here seems to be more to add all kinds of knobs and whizzles until you end up with something as slow with printk. And since we already have printk another one just doesn't seem to make much sense. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/