Re: [PATCH] update checkpatch.pl to version 0.10

2007-10-04 Thread Ingo Molnar

here is an update wrt. the latest checkpatch.pl-next version 
(v11-to-be), about kernel/sched.c warnings:

>   size  # warnings
>   
>   25383  checkpatch.pl.v6   5
>   26038  checkpatch.pl.v7   6
>   29603  checkpatch.pl.v8   65
>   31160  checkpatch.pl.v9   24
>   34950  checkpatch.pl.v10  28

35948  checkpatch.pl.v11pre   11

so things are heading in the right direction :)

of those 11 warnings, 6 are correct warnings (4 will be solved via 
KERN_CONT, 1 will be solved via a proper include file, and 1 is an 
overlength line), 4 are borderline warnings (easily fixed) and only one 
is a false positive! So v11-to-be gets the "best checkpatch.pl ever" 
badge from me :)

The false positive is:

  ERROR: need consistent spacing around '*' (ctx:WxV)
  #5322: 
  +static ctl_table *sd_alloc_ctl_cpu_table(int cpu)
^

i think checkpatch.pl mistook this function definition as an arithmetic 
expression?

But, there's a cleanliness bug underlying this false positive: 
'ctl_table' is a typedef, and it would be cleaner to use 'struct 
ctl_table' thoughout the kernel. When running checkpatch.pl over 
include/linux/sysctl.h, it warns about the typedef:

  WARNING: do not add new typedefs
  #944: 
  +typedef struct ctl_table ctl_table;

(but mistaking that function for an arithmetic expression is still a bug 
i think.)

nice work Andy!

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Make tasks always have non-zero pids

2007-10-04 Thread sukadev
Pavel Emelianov [EMAIL PROTECTED] wrote:
| Some time ago Sukadev noticed that the vmlinux size has

Cedric pointed it out to me first :-)

| grown 5Kb due to merged pid namespaces. One of the big
| problems with it was fat inline functions. The other thing
| was noticed by Matt - the checks for task's pid to be not
| NULL take place and make the kernel grow due to inlining,
| but these checks are not always needed.
| 
| In this series I introduce a static pid (dummy), according
| to Matt's proposal, which is assigned to tasks during the 
| detach_pid and transfer_pid instead of NULL. This pid lives 
| in the init pid namespace and has the id = 0, so all the 
| task_xid_xnr() calls will still return 0 on a dead task.
| 
| Places that get the struct pid from task either get it from
| the current (in this case they will never get this dummy),
| or use it to compare with some other value (so they will 
| work the same for both NULL and dummy pids).
| 
| This saves up to 340 bytes for i386 kernel with minimal 
| config  and probably more with more users of pids.
| 
| Tested on i386 and x86_64 boxes. Tasks still live and die,
| namespaces and proc still work.
| 
| Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>

Acked-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] add tunable_notifier function

2007-10-04 Thread Takenori Nagano
Vivek Goyal wrote:
> On Thu, Oct 04, 2007 at 08:38:34PM +0900, Takenori Nagano wrote:
>> This patch adds new notifier function tunable_notifier_chain. Its base is
>> atomic_notifier_chain.
>>
>> Thanks,
>>
>> ---
>>
>> Signed-off-by: Takenori Nagano <[EMAIL PROTECTED]>
>>
>> ---
>> diff -uprN linux-2.6.23-rc9.orig/include/linux/notifier.h
>> linux-2.6.23-rc9/include/linux/notifier.h
>> --- linux-2.6.23-rc9.orig/include/linux/notifier.h   2007-10-02 
>> 12:24:52.0
>> +0900
>> +++ linux-2.6.23-rc9/include/linux/notifier.h2007-10-03 
>> 14:48:04.28800 +0900
>> @@ -13,6 +13,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  /*
>>   * Notifier chains are of four types:
>> @@ -53,6 +54,14 @@ struct notifier_block {
>>  int priority;
>>  };
>>
>> +struct tunable_notifier_block {
>> +struct notifier_block *nb;
>> +struct tunable_notifier_head *head;
>> +struct dentry *dir;
>> +struct dentry *pri_dentry;
>> +struct dentry *desc_dentry;
>> +};
>> +
> 
> Should this be tunable_atomic_notifier_block? I think there are two kind
> of lists. One where handlers have to be atomic and other one where handlers
> can be blocking one. I think you are making atomic one tunable. If that's
> the case it should be reflected in the naming everywhere.

Hi Vivek,

Yes, it based on atomic_notifier_list. I think your opinion is reasonable.
I'll change the name tunable_notifier to tunable_atomic_notifier.

Thanks,

Takenori Nagano <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-04 Thread Kyle Moffett

On Oct 05, 2007, at 00:45:17, Eric W. Biederman wrote:

Kyle Moffett <[EMAIL PROTECTED]> writes:


On Oct 04, 2007, at 21:44:02, Eric W. Biederman wrote:
SElinux is not all encompassing or it is generally  
incomprehensible I don't know which.  Or someone long ago would  
have said a better  way to implement containers was with a  
selinux ruleset, here is a  selinux ruleset that does that.   
Although it is completely possible  to implement all of the  
isolation with the existing LSM hooks as  Serge showed.


The difference between SELinux and containers is that SELinux (and  
LSM as a whole) returns -EPERM to operations outside the scope of  
the  subject, whereas containers return -ENOENT (because it's not  
even in  the same namespace).


Yes.  However if you look at what the first implementations were.   
Especially something like linux-vserver.  All they provided was  
isolation.  So perhaps you would not see every process ps but they  
all had unique pid values.


I'm pretty certain Serge at least prototyped a simplified version  
of that using the LSM hooks.  Is there something I'm not remember  
in those hooks that allows hiding of information like processes?


Yes. Currently with containers we are taking that one step farther  
as that solves a wider set of problems.


IMHO, containers have a subtly different purpose from LSM even though  
both are about information hiding.  Basically a container is  
information hiding primarily for administrative reasons; either as a  
convenience to help prevent errors or as a way of describing  
administrative boundaries.  For example, even in an environment where  
all sysadmins are trusted employees, a few head-honcho sysadmins  
would get root container access, and all others would get access to  
specific containers as a way of preventing "oops" errors.  Basically  
a container is about "full access inside this box and no access  
outside".


By contrast, LSM is more strictly about providing *limited* access to  
resources.  For an accounting business all client records would  
grouped and associated together, however those which have passed this  
year's review are read-only except by specific staff and others may  
have information restricted to some subset of the employees.


So containers are exclusive subsets of "the system" while LSM should  
be about non-exclusive information restriction.



We also have in the kernel another parallel security mechanism  
(for what is generally a different class of operations) that has  
been  quite successful, and different groups get along quite  
well, and  ordinary mortals can understand it.   The linux  
firewalling code.


Well, I wouldn't go so far as the "ordinary mortals can understand  
it" part; it's still pretty high on the obtuse-o-meter.


True.  Probably a more accurate statement is:`unix command line  
power users can and do handle it after reading the docs.  That's  
not quite ordinary mortals but it feels like it some days.  It  
might all be perception...


I have seen more *wrong* iptables firewalls than I've seen correct  
ones.  Securing TCP/IP traffic properly requires either a lot of  
training/experience or a good out-of-the-box system like Shorewall  
which structures the necessary restrictions for you based on an  
abstract description of the desired functionality.  For instance what  
percentage of admins do you think could correctly set up their  
netfilter firewalls to log christmas-tree packets, smurfs, etc  
without the help of some external tool?  Hell, I don't trust myself  
to reliably do it without a lot of reading of docs and testing, and  
I've been doing netfilter firewalls for a while.


The bottom line is that with iptables it is *CRITICAL* to have a good  
set of interface tools to take the users' "My system is set up  
like..." description in some form and turn it into the necessary set  
of efficient security rules.  The *exact* same issue applies to  
SELinux, with 2 major additional problems:


1)  Half the tools are still somewhat beta-ish and under heavy  
development.  Furthermore the semi-official reference policy is  
nowhere near comprehensive and pretty ugly to read (go back to the  
point about the tools being beta-ish).


2)  If you break your system description or translation tools then  
instead of just your network dying your entire *system* dies.



The linux firewalling codes has hooks all throughout the  
networking stack, just like the LSM has hooks all throughout the  
rest of linux  kernel.  There is a difference however.  The linux  
firewalling code in addition to hooks has tables behind those  
hooks that it  consults. There is generic code to walk those  
tables and consult with different kernel modules to decide if we  
should drop a packet.  Each of those kernel modules provides a  
different capability that can be used to generate a firewall.


This is almost *EXACTLY* what SELinux provides as an LSM module.   
The one difference is that with 

Re: [PATCH 2/2] implement new notifier function to panic_notifier_list

2007-10-04 Thread Vivek Goyal
On Thu, Oct 04, 2007 at 08:38:50PM +0900, Takenori Nagano wrote:
> This patch implements new notifier function to panic_notifier_list. We can
> change the list of order by debugfs.
> 
> Thanks,
> 
> ---
> 
> Signed-off-by: Takenori Nagano <[EMAIL PROTECTED]>
> 
> ---
> diff -uprN linux-2.6.23-rc9.orig/arch/alpha/kernel/setup.c
> linux-2.6.23-rc9/arch/alpha/kernel/setup.c
> --- linux-2.6.23-rc9.orig/arch/alpha/kernel/setup.c   2007-10-02
> 12:24:52.0 +0900
> +++ linux-2.6.23-rc9/arch/alpha/kernel/setup.c2007-10-04 
> 09:49:34.44000 +0900
> @@ -45,14 +45,22 @@
>  #include 
>  #include 
> 
> -extern struct atomic_notifier_head panic_notifier_list;
> +extern struct tunable_notifier_head panic_notifier_list;
>  static int alpha_panic_event(struct notifier_block *, unsigned long, void *);
> -static struct notifier_block alpha_panic_block = {
> +static struct notifier_block alpha_panic_block_base = {
>   alpha_panic_event,
>  NULL,
>  INT_MAX /* try to do it first */
>  };
> 
> +static struct tunable_notifier_block alpha_panic_block = {
> + _panic_block_base,
> + NULL,
> + NULL,
> + NULL,
> + NULL
> +};
> +
>  #include 
>  #include 
>  #include 
> @@ -522,8 +530,8 @@ setup_arch(char **cmdline_p)
>   }
> 
>   /* Register a call for panic conditions. */
> - atomic_notifier_chain_register(_notifier_list,
> - _panic_block);
> + tunable_notifier_chain_register(_notifier_list,
> + _panic_block, "alpha_panic", NULL);
> 

I think it might be good idea to somehow create provisions for another a
help string. This help string will inform admin that what a registered
user does? Ideally this should be visible in /sys/kernel/debug//description file.

This kind of description can help admin to decide the priority among various
registered users withoug having to look at the source code.

Thanks
Vivek
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] add new notifier function

2007-10-04 Thread Takenori Nagano
Vivek Goyal wrote:
> On Thu, Oct 04, 2007 at 08:38:05PM +0900, Takenori Nagano wrote:
>
> In summary, right now co-existence of kdb with kdump seems to be your pain
> point. I would prefer that kdb just puts a break point on panic() and we move
> on. If there are more candidates down the line and these can't be easily
> executed in second kernel then we can re-visit this notification list
> mechanism.

Hi Vivek,

Thank you for your comment. :-)

I don't mind kdb and kdump problem now. Because my patches are not merged into
mainline kernel yet. If they are merged, I think how we can resolve about RAS
tools problem.

>> # ls
>> ipmi_msghandler  ipmi_wdog
>> # cat ipmi_msghandler/priority
>> 200
>> # cat ipmi_wdog/priority
>> 150
>> #
>> Kernel panic - not syncing: panic
>> ipmi_msghandler : notifier calls panic_event().
>> ipmi_watchdog : notifier calls wdog_panic_handler().
>>
>> .(reboot)
>>
> 
> We also need to implement a file which can give a consolidated view. All
> the registered members and their priority.

I tried to implement it, but its impact is large. And we can get all priority
values using "ls" and "cat */priority". I'll implement it if user strongly
expects it.

ex)
# cd panic_notifier_list
# ls
ipmi_msghandler  ipmi_wdog
# cat */priority
200
150
#

Thanks,

Takenori Nagano <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] add tunable_notifier function

2007-10-04 Thread Vivek Goyal
On Thu, Oct 04, 2007 at 08:38:34PM +0900, Takenori Nagano wrote:
> This patch adds new notifier function tunable_notifier_chain. Its base is
> atomic_notifier_chain.
> 
> Thanks,
> 
> ---
> 
> Signed-off-by: Takenori Nagano <[EMAIL PROTECTED]>
> 
> ---
> diff -uprN linux-2.6.23-rc9.orig/include/linux/notifier.h
> linux-2.6.23-rc9/include/linux/notifier.h
> --- linux-2.6.23-rc9.orig/include/linux/notifier.h2007-10-02 
> 12:24:52.0
> +0900
> +++ linux-2.6.23-rc9/include/linux/notifier.h 2007-10-03 14:48:04.28800 
> +0900
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  /*
>   * Notifier chains are of four types:
> @@ -53,6 +54,14 @@ struct notifier_block {
>   int priority;
>  };
> 
> +struct tunable_notifier_block {
> + struct notifier_block *nb;
> + struct tunable_notifier_head *head;
> + struct dentry *dir;
> + struct dentry *pri_dentry;
> + struct dentry *desc_dentry;
> +};
> +

Should this be tunable_atomic_notifier_block? I think there are two kind
of lists. One where handlers have to be atomic and other one where handlers
can be blocking one. I think you are making atomic one tunable. If that's
the case it should be reflected in the naming everywhere.

Thanks
Vivek
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.6.23-rc9 and a heads-up for the 2.6.24 series..

2007-10-04 Thread Ingo Molnar

* Glauber de Oliveira Costa <[EMAIL PROTECTED]> wrote:

> On 10/2/07, Alistair John Strachan <[EMAIL PROTECTED]> wrote:
> > This is certainly a tool issue, but if I use Debian's kernel-image 
> > "make-kpkg"
> > wrapper around the kernel build system, it fails with:
> >
> > cp: cannot stat `arch/x86_64/boot/bzImage': No such file or directory
> >
> > Obviously, this file has moved to arch/x86/boot, but it seems like possibly
> > unnecessary breakage. I've been copying bzImage for years from
> > arch/x86_64/boot, and I'm sure there's a handful of scripts (other than
> > Debian's kernel-image) doing this too.
> 
> I believe most sane tools would be using the output of uname -m, so a 
> possible way to fix this would be fixing the data passed to userspace 
> from uname. However, that might be the case that it creates a new set 
> of problems too, with tools relying on the output of uname -m to 
> determine wheter the machine is 32 or 64 bit, and so on.

there are two problems with the use of uname -m:

- the build machine architecture is not necessarily the same as the
  target architecture. (for example i cross-compile all my 32-bit
  kernels on a 64-bit box.)

- we kept uname -m compatile. multilib depends on it, and other pieces
  of userspace as well. So uname -m still outputs 'i386' on 32-bit and
  'x86_64' on 64-bit - not 'x86'.

a symlink looks like the best solution to me.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.6.23-rc9 and a heads-up for the 2.6.24 series..

2007-10-04 Thread Ingo Molnar

* Alistair John Strachan <[EMAIL PROTECTED]> wrote:

> On Tuesday 02 October 2007 04:41:49 Linus Torvalds wrote:
> [snip]
> > In other words, people who know they may be affected and would want to
> > prepare can look at (for example)
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-x86.git x86
> >
> > and generally get ready for the switch-over.
> 
> This is certainly a tool issue, but if I use Debian's kernel-image 
> "make-kpkg" 
> wrapper around the kernel build system, it fails with:
> 
> cp: cannot stat `arch/x86_64/boot/bzImage': No such file or directory
> 
> Obviously, this file has moved to arch/x86/boot, but it seems like 
> possibly unnecessary breakage. I've been copying bzImage for years 
> from arch/x86_64/boot, and I'm sure there's a handful of scripts 
> (other than Debian's kernel-image) doing this too.
> 
> For now, I hacked the tool[1]. Maybe, if we care, a symlink could be 
> set up between arch/x86/boot and arch/$ARCH/boot ? Or would papering 
> over this be more trouble than it's worth?

yeah, a symlink is the right solution i think. Our first-step goal is to 
make the switchover seamless for all practical purposes, and a 
compatibility symlink in arch/i386/boot/ will not hurt. (we shouldnt 
worry about the really old zImage target though)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 2/2] IRQ: Modularize the setup_irq code (2)

2007-10-04 Thread Ahmed S. Darwish
Introduce irq_desc_match_fist_irqaction() to support setup_irq() 
code modularity.

Signed-off-by: Ahmed S. Darwish <[EMAIL PROTECTED]>
---

Any ideas for a better method name ?

 manage.c |   89 ---
 1 file changed, 51 insertions(+), 38 deletions(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 6a0d778..4e96d56 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -293,6 +293,55 @@ int can_add_irqaction_on_allocated_irq(unsigned int irq, 
struct irqaction *new)
 }
 
 /*
+ * Configure the passed irq descriptor to satisfy our first newly
+ * added irqaction needs
+ * must be called with the irq_desc[irq]->lock held
+ */
+void irq_desc_match_fist_irqaction(unsigned int irq, struct irqaction *new)
+{
+   struct irq_desc *desc = irq_desc + irq;
+
+   /* We must be the first and the only irqaction */
+   BUG_ON(desc->action != new || new->next);
+
+   irq_chip_set_defaults(desc->chip);
+
+#if defined(CONFIG_IRQ_PER_CPU)
+   if (new->flags & IRQF_PERCPU)
+   desc->status |= IRQ_PER_CPU;
+#endif
+
+   /* Setup the type (level, edge polarity) if configured: */
+   if (new->flags & IRQF_TRIGGER_MASK) {
+   if (desc->chip && desc->chip->set_type)
+   desc->chip->set_type(irq,
+new->flags & IRQF_TRIGGER_MASK);
+   else
+   /*
+* IRQF_TRIGGER_* but the PIC does not support
+* multiple flow-types?
+*/
+   printk(KERN_WARNING "No IRQF_TRIGGER set_type "
+  "function for IRQ %d (%s)\n", irq,
+  desc->chip ? desc->chip->name : "unknown");
+   } else
+   compat_irq_chip_set_default_handler(desc);
+
+   desc->status &= ~(IRQ_AUTODETECT | IRQ_WAITING | IRQ_INPROGRESS);
+
+   if (!(desc->status & IRQ_NOAUTOEN)) {
+   desc->depth = 0;
+   desc->status &= ~IRQ_DISABLED;
+   if (desc->chip->startup)
+   desc->chip->startup(irq);
+   else
+   desc->chip->enable(irq);
+   } else
+   /* Undo nested disables: */
+   desc->depth = 1;
+}
+
+/*
  * Internal function to register an irqaction - typically used to
  * allocate special interrupts that are part of the architecture.
  */
@@ -352,45 +401,9 @@ int setup_irq(unsigned int irq, struct irqaction *new)
if (new->flags & IRQF_NOBALANCING)
desc->status |= IRQ_NO_BALANCING;
 
-   if (!shared) {
-   irq_chip_set_defaults(desc->chip);
+   if (!shared)
+   irq_desc_match_fist_irqaction(irq, new);
 
-#if defined(CONFIG_IRQ_PER_CPU)
-   if (new->flags & IRQF_PERCPU)
-   desc->status |= IRQ_PER_CPU;
-#endif
-
-   /* Setup the type (level, edge polarity) if configured: */
-   if (new->flags & IRQF_TRIGGER_MASK) {
-   if (desc->chip && desc->chip->set_type)
-   desc->chip->set_type(irq,
-   new->flags & IRQF_TRIGGER_MASK);
-   else
-   /*
-* IRQF_TRIGGER_* but the PIC does not support
-* multiple flow-types?
-*/
-   printk(KERN_WARNING "No IRQF_TRIGGER set_type "
-  "function for IRQ %d (%s)\n", irq,
-  desc->chip ? desc->chip->name :
-  "unknown");
-   } else
-   compat_irq_chip_set_default_handler(desc);
-
-   desc->status &= ~(IRQ_AUTODETECT | IRQ_WAITING |
- IRQ_INPROGRESS);
-
-   if (!(desc->status & IRQ_NOAUTOEN)) {
-   desc->depth = 0;
-   desc->status &= ~IRQ_DISABLED;
-   if (desc->chip->startup)
-   desc->chip->startup(irq);
-   else
-   desc->chip->enable(irq);
-   } else
-   /* Undo nested disables: */
-   desc->depth = 1;
-   }
/* Reset broken irq detection when installing new handler */
desc->irq_count = 0;
desc->irqs_unhandled = 0;

-- 
Ahmed S. Darwish
HomePage: http://darwish.07.googlepages.com
Blog: http://darwish-07.blogspot.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 1/2] IRQ: Modularize the setup_irq code (1)

2007-10-04 Thread Ahmed S. Darwish
Hi Thomas/lkml,

setup_irq() code contains a big chunk of 130 code lines that 
can be divided to several smaller methods. These 2 patches introduce 
those small functions to aid toward setup_irq() code modularity. 
No major code logic changes exist.

Patches can be applied cleanly over v2.6.23-rc9.

Thanks,

==> (Description for Logs)

Introduce can_add_irqaction_on_allocated_irq and warn_about_irqaction_mismatch
methods to support setup_irq() code modularity.

Signed-off-by: Ahmed S. Darwish <[EMAIL PROTECTED]>
---

 manage.c |   92 +--
 1 file changed, 55 insertions(+), 37 deletions(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 7230d91..6a0d778 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -248,6 +248,50 @@ void compat_irq_chip_set_default_handler(struct irq_desc 
*desc)
desc->handle_irq = NULL;
 }
 
+static inline void warn_about_irqaction_mismatch(unsigned int irq,
+struct irqaction *new)
+{
+#ifdef CONFIG_DEBUG_SHIRQ
+   const char *name = irq_desc[irq].action->name;
+   /* If device doesn't expect the mismatch */
+   if (!(new->flags & IRQF_PROBE_SHARED)) {
+   printk(KERN_ERR "IRQ handler type mismatch for IRQ %d\n", irq);
+   if (name)
+   printk(KERN_ERR "current handler: %s\n", name);
+   dump_stack();
+   }
+#endif
+}
+
+/*
+ * Test if an irqaction can be added to the passed allocated IRQ line
+ * Must be called with the irq_desc[irq]->lock held.
+ */
+int can_add_irqaction_on_allocated_irq(unsigned int irq, struct irqaction *new)
+{
+   struct irqaction *old = irq_desc[irq].action;
+
+   BUG_ON(!old);
+   /*
+* Can't share interrupts unless both agree to and are
+* the same type (level, edge, polarity). So both flag
+* fields must have IRQF_SHARED set and the bits which
+* set the trigger type must match.
+*/
+   if (!((old->flags & new->flags) & IRQF_SHARED) ||
+   ((old->flags ^ new->flags) & IRQF_TRIGGER_MASK))
+   return 0;
+
+#if defined(CONFIG_IRQ_PER_CPU)
+   /* All handlers must agree on per-cpuness */
+   if ((old->flags & IRQF_PERCPU) !=
+   (new->flags & IRQF_PERCPU))
+   return 0;
+#endif
+
+   return 1;
+}
+
 /*
  * Internal function to register an irqaction - typically used to
  * allocate special interrupts that are part of the architecture.
@@ -256,7 +300,6 @@ int setup_irq(unsigned int irq, struct irqaction *new)
 {
struct irq_desc *desc = irq_desc + irq;
struct irqaction *old, **p;
-   const char *old_name = NULL;
unsigned long flags;
int shared = 0;
 
@@ -289,31 +332,18 @@ int setup_irq(unsigned int irq, struct irqaction *new)
p = >action;
old = *p;
if (old) {
-   /*
-* Can't share interrupts unless both agree to and are
-* the same type (level, edge, polarity). So both flag
-* fields must have IRQF_SHARED set and the bits which
-* set the trigger type must match.
-*/
-   if (!((old->flags & new->flags) & IRQF_SHARED) ||
-   ((old->flags ^ new->flags) & IRQF_TRIGGER_MASK)) {
-   old_name = old->name;
-   goto mismatch;
-   }
-
-#if defined(CONFIG_IRQ_PER_CPU)
-   /* All handlers must agree on per-cpuness */
-   if ((old->flags & IRQF_PERCPU) !=
-   (new->flags & IRQF_PERCPU))
-   goto mismatch;
-#endif
-
-   /* add new interrupt at end of irq queue */
-   do {
-   p = >next;
-   old = *p;
-   } while (old);
shared = 1;
+   if (can_add_irqaction_on_allocated_irq(irq, new)) {
+   /* add new interrupt at end of irq queue */
+   do {
+   p = >next;
+   old = *p;
+   } while (old);
+   } else {
+   warn_about_irqaction_mismatch(irq, new);
+   spin_unlock_irqrestore(>lock, flags);
+   return -EBUSY;
+   }
}
 
*p = new;
@@ -372,18 +402,6 @@ int setup_irq(unsigned int irq, struct irqaction *new)
register_handler_proc(irq, new);
 
return 0;
-
-mismatch:
-#ifdef CONFIG_DEBUG_SHIRQ
-   if (!(new->flags & IRQF_PROBE_SHARED)) {
-   printk(KERN_ERR "IRQ handler type mismatch for IRQ %d\n", irq);
-   if (old_name)
-   printk(KERN_ERR "current handler: %s\n", old_name);
-   dump_stack();
-   }
-#endif
-   spin_unlock_irqrestore(>lock, flags);
-   return -EBUSY;
 }
 
 /**


Re: [PATCH 1/2] add tunable_notifier function

2007-10-04 Thread Takenori Nagano
Randy Dunlap wrote:
> On Thu, 04 Oct 2007 20:38:34 +0900 Takenori Nagano wrote:
>> diff -uprN linux-2.6.23-rc9.orig/kernel/sys.c linux-2.6.23-rc9/kernel/sys.c
>> --- linux-2.6.23-rc9.orig/kernel/sys.c   2007-10-02 12:24:52.0 
>> +0900
>> +++ linux-2.6.23-rc9/kernel/sys.c2007-10-03 14:48:15.16000 +0900
>> @@ -38,6 +38,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  #include 
>>  #include 
>> @@ -393,6 +394,234 @@ int blocking_notifier_call_chain(struct
> 
>> +/**
>> + *  tunable_notifier_chain_register - Add notifier to an tunable notifier 
>> chain
>> + *  @nh: Pointer to head of the tunable notifier chain
>> + *  @n: New entry in notifier chain
>> + *  @name: Pointer to the name of this notifier chain
> 
> Is @name the name of a notifier chain or of the new notifier entry?

Hi Randy,

@name: Pointer to the name of the new notifier entry.

I'll change the explanation.

Thanks,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] implement new notifier function to panic_notifier_list

2007-10-04 Thread Takenori Nagano
Randy Dunlap wrote:
> On Thu, 04 Oct 2007 20:38:50 +0900 Takenori Nagano wrote:
> 
>> This patch implements new notifier function to panic_notifier_list. We can
>> change the list of order by debugfs.
>>
>> Thanks,
>>
>> ---
>>
>> Signed-off-by: Takenori Nagano <[EMAIL PROTECTED]>
>>
>> ---
>>   * Returns seconds, approximately.  We don't need nanosecond
>>   * resolution, and we don't need to waste time with a big divide when
>> @@ -193,5 +201,6 @@ __init void spawn_softlockup_task(void)
>>  cpu_callback(_nfb, CPU_ONLINE, cpu);
>>  register_cpu_notifier(_nfb);
>>
>> -atomic_notifier_chain_register(_notifier_list, _block);
>> +tunable_notifier_chain_register(_notifier_list, _block,
>> +"softlookup", NULL);
>>  }
> 
> "softlockup"

Hi Randy,

Thank you for reviewing. :)
I'll fix next version.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-04 Thread Eric W. Biederman
Kyle Moffett <[EMAIL PROTECTED]> writes:

> On Oct 04, 2007, at 21:44:02, Eric W. Biederman wrote:
>> What we want from the LSM is the ability to say -EPERM when we can clearly
>> articulate that we want to disallow something.
>
> This sort of depends on perspective; typically with security infrastructure 
> you
> actually want "... the ability to return success when we can clearly 
> articulate
> that we want to *ALLOW* something".   File permissions work this way; we don't
> have a list of forbidden  users attached to each file, we have an owner, a
> group, and a mode representing positive permissions.  With that said in 
> certain
> high-
> risk environments you need something even stronger that cannot be changed by 
> the
> "owner" of the file, if we don't entirely trust them,

Yes.  However last I looked at the LSM hooks we first do the normal unix
permission checks.  Then we run the hook.  So it can only increase the
number of times we say -EPERM.

>> SElinux is not all encompassing or it is generally incomprehensible I don't
>> know which.  Or someone long ago would have said a better  way to implement
>> containers was with a selinux ruleset, here is a  selinux ruleset that does
>> that.  Although it is completely possible  to implement all of the isolation
>> with the existing LSM hooks as  Serge showed.
>
> The difference between SELinux and containers is that SELinux (and LSM as a
> whole) returns -EPERM to operations outside the scope of the  subject, whereas
> containers return -ENOENT (because it's not even in  the same namespace).

Yes.  However if you look at what the first implementations were.  Especially
something like linux-vserver.  All they provided was isolation.  So perhaps
you would not see every process ps but they all had unique pid values.

I'm pretty certain Serge at least prototyped a simplified version
of that using the LSM hooks.  Is there something I'm not remember in
those hooks that allows hiding of information like processes?

Yes. Currently with containers we are taking that one step farther as
that solves a wider set of problems.

>> We also have in the kernel another parallel security mechanism (for what is
>> generally a different class of operations) that has been  quite successful,
>> and different groups get along quite well, and  ordinary mortals can
>> understand it.   The linux firewalling code.
>
> Well, I wouldn't go so far as the "ordinary mortals can understand it" part;
> it's still pretty high on the obtuse-o-meter.

True.  Probably a more accurate statement is:`unix command line power
users can and do handle it after reading the docs.  That's not quite
ordinary mortals but it feels like it some days.  It might all be
perception...

>> The linux firewalling codes has hooks all throughout the networking stack,
>> just like the LSM has hooks all throughout the rest of linux  kernel.  There
>> is a difference however.  The linux firewalling code in addition to hooks has
>> tables behind those hooks that it  consults. There is generic code to walk
>> those tables and consult with different kernel modules to decide if we should
>> drop a packet.  Each of those kernel modules provides a different capability
>> that can be used to generate a firewall.
>
> This is almost *EXACTLY* what SELinux provides as an LSM module.  The one
> difference is that with SELinux some compromises and restrictions  have been
> made so that (theoretically) the resulting policy can be  exhaustively 
> analyzed
> to *prove* what it allows and disallows.  It  may be that SELinux should be
> split into 2 parts, one that provides  the underlying table-matching and the
> other that uses it to provide  the provability guarantees.  Here's a direct
> comparison:
>
> netfilter:
>   (A) Each packet has src, dst, port, etc that can be matched
>   (B) Table of rules applied sequentially (MATCH => ACTION)
>   (C) Rules may alter the properties of packets as they are routed/
> bridged/etc
>
> selinux:
>   (A) Each object has user, role, and type that can be matched
>   (B) Table of rules searched by object parameters (MATCH => allow/
> auditallow/transition)
>   (C) Rules may alter the properties of objects through transition rules.

Ok.  There is something here.

However in a generic setup, at least role would be an extended match
criteria provided by the selinux module.  It would not be a core
attribute.  It would need to depend on some extra functionality being
compiled in.

>> I'm not yet annoyed enough to go implement an iptables like interface to the
>> LSM enhancing it with more generic mechanism to make the problem simpler, but
>> I'm getting there.  Perhaps next time  I'm bored.
>
> I think a fair amount of what we need is already done in SELinux, and efforts
> would be better spent in figuring out what seems too complicated in SELinux 
> and
> making it simpler.  Probably a fair amount  of that just means better tools.

How about thinking of it another way.  

Perform the split up you 

Re: [PATCH 0/2] add new notifier function

2007-10-04 Thread Vivek Goyal
On Thu, Oct 04, 2007 at 08:38:05PM +0900, Takenori Nagano wrote:
> Hi,
> 
> These patches add new notifier function and implement it to 
> panic_notifier_list.
> We used the hardcoded notifier chain so far, but it was not flexible. New
> notifier is very flexible, because user can change a list of order by debugfs.
> 

Hi Takenori,

There were some more discussions regarding configurable notifier list.
Following is the link. Please go through it.

http://marc.info/?l=linux-kernel=118968996202991=2

Not everybody is too happy about it. Personally I am not against it. My take
is that after panic() there is no gurantee that all the registered notifer
will be executed. Just that kernel will try its best. If a notifier handler
is written badly, kernel can't do much about it. It is left more on to
administrator what he considers most important and give priority accordingly.

So if kdump is of utmost priority, then administrator should give highest
priority to kdump. 

Having said that, what are the RAS tools which require this infrastructure.
Currently only kdb seems to be the only candidate which needs to run in
the crashing kernel. Rest of the actions can be performed in second kernel.
If that is the case, then probably it is better that kdb puts a break point
on panic(), as suggested by Eric, and rest of the post panic actions are
executed in second kernel.

Executing rest of the actions have got both pros and cons. Executing rest
of the notifications in second kernel makes things more reliable. At the same
time it makes things little complex as one needs to pass all the configuration
information required to second kernel, secondly all the notification handlers
need to be ready to run in two contexts. These handlers will run in the 
context of first kernel if kdump is not configured, otherwise these will need
to run in second kernel.

In summary, right now co-existence of kdb with kdump seems to be your pain
point. I would prefer that kdb just puts a break point on panic() and we move
on. If there are more candidates down the line and these can't be easily
executed in second kernel then we can re-visit this notification list
mechanism.
   
 
> Please review, and give some comments.
> 
> Thanks,
> 
> Example)
> 
> # cd /sys/kernel/debug/
> # ls
> kprobes  pktcdvd
> # insmod ipmi_msghandler.ko
> # ls
> kprobes  panic_notifier_list  pktcdvd
> # cd panic_notifier_list/
> # ls
> ipmi_msghandler
> # insmod ipmi_watchdog.ko
> # ls
> ipmi_msghandler  ipmi_wdog
> # cat ipmi_msghandler/priority
> 200
> # cat ipmi_wdog/priority
> 150
> #
> Kernel panic - not syncing: panic
> ipmi_msghandler : notifier calls panic_event().
> ipmi_watchdog : notifier calls wdog_panic_handler().
> 
> .(reboot)
> 

We also need to implement a file which can give a consolidated view. All
the registered members and their priority.

Thanks
Vivek
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] LOCKDEP: fix mismatched lockdep_depth/curr_chain_hash

2007-10-04 Thread Gregory Haskins
Hi Ingo,
  I am seeing a problem on the latest -rt where lockdep completely overwhelms
  the system to the point that it grinds to a halt on large (8-way+) systems.
  The problem seems to be that the class->locks_before and locks_after grow
  unbounded (I have observed over 1M+ entries in them) so a lock_acquire call
  can take over 10 seconds to finish resolving.  Related to this seems to be
  that lockdep appears to see a chain-hash miss over and over for what I would
  assume should be an established graph (for instance, in
  double_lock_balance() in an rt_overload condition).  Turning off
  PROVE_LOCKING (statically, or by setting debug_locks=0 dynamically restores
  the system to normal behavior.

  I took some time tonight to study lockdep (it is quite an impressive body of
  code!), and came up with the following "fix".  It does improve things
  significantly by addressing what I believe is the issue with the
  cache-misses (though it would appear there are still a few more issues
  there that need addressing as some boots are still very lethargic).  I use 
  the term "fix" loosely since I am not confident that I fully understand the
  intention of your logic here so I can't say for sure if it was really
  broken, or if I have made it worse ;)

  Could you comment on what I have done here, or offer any advice on what to
  look for elsewhere?  I based the patch on pure linux-2.6.git since I see the
  same issue (by visual inspection, that is) there as well.

  Thanks in advance!
  -Greg   

--

LOCKDEP: fix mismatched lockdep_depth/curr_chain_hash

It is possible for the current->curr_chain_key to become inconsistent with the
current index if the chain fails to validate.  The end result is that future
lock_acquire() operations may inadvertently fail to find a hit in the cache
resulting in a new node being added to the graph for every acquire.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
---

 kernel/lockdep.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 734da57..efb0d7e 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -2450,11 +2450,11 @@ static int __lock_acquire(struct lockdep_map *lock, 
unsigned int subclass,
chain_head = 1;
}
chain_key = iterate_chain_key(chain_key, id);
-   curr->curr_chain_key = chain_key;
 
if (!validate_chain(curr, lock, hlock, chain_head))
return 0;
 
+   curr->curr_chain_key = chain_key;
curr->lockdep_depth++;
check_chain_key(curr);
 #ifdef CONFIG_DEBUG_LOCKDEP

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] LOCKDEP: fix mismatched lockdep_depth/curr_chain_hash

2007-10-04 Thread Gregory Haskins
Doh!  I guess there should be a rule about sending patches out after midnight
;)

The original patch I worked on was written before the code was moved to
validate_chain(), so my previous posting didnt quite translate when I merged
with git HEAD.  Here is an updated patch.  Sorry for the confusion.

Regards,
-Greg


---

 kernel/lockdep.c |   10 +-
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 734da57..42ae4a5 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -1521,7 +1521,7 @@ cache_hit:
 }
 
 static int validate_chain(struct task_struct *curr, struct lockdep_map *lock,
-   struct held_lock *hlock, int chain_head)
+   struct held_lock *hlock, int chain_head, u64 chain_key)
 {
/*
 * Trylock needs to maintain the stack of held locks, but it
@@ -1534,7 +1534,7 @@ static int validate_chain(struct task_struct *curr, 
struct lockdep_map *lock,
 * graph_lock for us)
 */
if (!hlock->trylock && (hlock->check == 2) &&
-   lookup_chain_cache(curr->curr_chain_key, hlock->class)) 
{
+   lookup_chain_cache(chain_key, hlock->class)) {
/*
 * Check whether last held lock:
 *
@@ -1576,7 +1576,7 @@ static int validate_chain(struct task_struct *curr, 
struct lockdep_map *lock,
 #else
 static inline int validate_chain(struct task_struct *curr,
struct lockdep_map *lock, struct held_lock *hlock,
-   int chain_head)
+   int chain_head, u64 chain_key)
 {
return 1;
 }
@@ -2450,11 +2450,11 @@ static int __lock_acquire(struct lockdep_map *lock, 
unsigned int subclass,
chain_head = 1;
}
chain_key = iterate_chain_key(chain_key, id);
-   curr->curr_chain_key = chain_key;
 
-   if (!validate_chain(curr, lock, hlock, chain_head))
+   if (!validate_chain(curr, lock, hlock, chain_head, chain_key))
return 0;
 
+   curr->curr_chain_key = chain_key;
curr->lockdep_depth++;
check_chain_key(curr);
 #ifdef CONFIG_DEBUG_LOCKDEP

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: SLUB performance regression vs SLAB

2007-10-04 Thread David Schwartz

> On 10/04/2007 07:39 PM, David Schwartz wrote:

> > But this is just a preposterous position to put him in. If there's no
> > reproduceable test case, then why should he care that one
> > program he can't
> > even see works badly? If you care, you fix it.

> People have been trying for years to make reproducible test cases
> for huge and complex workloads. It doesn't work. The tests that do
> work take weeks to run and need to be carefully validated before
> they can be officially released. The open source community can and
> should be working on similar tests, but they will never be simple.

That's true, but irrelevent. Either the test can identify a problem that
applies generally, or it's doing nothing but measuring how good the system
is at doing the test. If the former, it should be possible to create a
simple test case once you know from the complex test where the problem is.
If the latter, who cares about a supposed regression?

It should be possible to identify exactly what portion of the test shows the
regression the most and exactly what the system is doing during that moment.
The test may be great at finding regressions, but once it finds them, they
should be forever *found*.

Did you follow the recent incident when iperf fout what seemed to be a
significnat CFS networking regression? The only way to identify that it was
a quirk in what iperf was doing was by looking at exactly what iperf was
doing. The only efficient way was to look at iperf's source and see that
iperf's weird yielding meant it didn't replicate typical use cases like it
was supposed to.

DS


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

2007-10-04 Thread Jeremy Fitzhardinge
Rik van Riel wrote:
> Either of these two would work.  Another alternative could be to
> let test_and_clear_pte_flags have an exception table entry, where
> we jump right to the next instruction if the instruction clearing
> the flag fails.
>
> That is the essentially variant you need for Xen, except the fast
> path is still exactly the same it is as when running on native
> hardware.
>   

Hm, that wouldn't end up clearing the bit.  You'd need a Xen-specific
exception handler to do that, which would turn the whole thing into
Xen-specific code, and you're back at adding a pv-op.

J

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

2007-10-04 Thread Jeremy Fitzhardinge
Andrew Morton wrote:
> y'know, I think I think it's been several years since I saw a report of an
> honest to goodness, genuine SMP race in core kernel.  We used to be
> infested by them, but the term has fallen into disuse.  Interesting, but
> OT.
>   

I was a bit surprised to find myself typing it too.  I guess it could
also be a preempt race, which has been a bit more common.  Anyway, its a
deliberately unlocked access to the pagetable structure, so not terribly
surprising.

>> It seems to me that there are a few ways to fix this:
>>
>>1. Use asm-generic/pgtable.h when CONFIG_PARAVIRT is enabled.  This
>>   will clearly work, but is pretty blunt.
>>2. Make test_and_clear_pte_flags a new paravirt-op, which can be
>>   implemented in Xen as a hypercall, and as a raw test_and_clear_bit
>>   for everyone else.  The downside is adding yet another pv-op.
>>3. Restructure the pagetable setup code so that the mm is not added
>>   to the prio tree until after arch_dup_mmap has been called (and
>>   the converse for exit_mmap).  This is arguably cleaner, but I
>>   haven't looked to see how much trouble this would be.
>>
>> Thoughts anyone?  Does making the pagetables visible "early" cause
>> problems for anyone else?
>> 
>
> I expect that 2) has the maximum niceness*suitable-for-2.6.23 product.
>   

OK, I'll whip a patch together.

> That's if you actually care much about kernel.org major releases - do many
> people run kernel.org kernels on Xen? 

Well, given that there hasn't been a Xen-capable kernel.org release yet,
no...  But we'll see what happens when .23 goes out the door.

>  If "not many" then we could perhaps
> do something more elaborate for 2.6.23.1.  But adding ever more pvops as
> core kernel evolves was always expected.
>   

I think keep it simple for now; anything significant can wait for the
brave new world of unified x86.

J

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Accessing 64-bit BARs

2007-10-04 Thread yogeshwar sonawane
hello,
Thanks rolf & roland.

pci_iomap() is not doing something extra. only it is some kind of
abstraction for IO-mapped OR memory mapped.
I know that my BARs are MMIO, so using ioremap() & readl()/writel()
combination should be fine.

But for the problem as explained in my first mail, any
help/suggestions will be helpful.

-Yogeshwar

On 10/4/07, Roland Dreier <[EMAIL PROTECTED]> wrote:
>  > You should use pci_iomap() to get an access pointer to the BAR. After this 
> you
>  > can access the memory with ioread*() and iowrite*(). See "man pci_iomap(9)"
>  > if you build kernel manpages.
>
> That works fine, but ioremap() and readl()/writel() is also perfectly
> fine for regions that you know are always MMIO.
>
>  - R.
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] kernel BUG at arch/i386/mm/highmem.c:15! on 2.6.23-rc8/rc9

2007-10-04 Thread gurudas pai

gurudas pai wrote:

Hugh Dickins wrote:

On Thu, 4 Oct 2007, gurudas pai wrote:

Nick Piggin wrote:

While running Oracle database test on x86/6GB RAM machine panics with
following messages.

Hmm, seems like something in sys_remap_file_pages might have broken.
It's a bit hard to work out from the backtrace, though.

Is it possible you can strace to find the arguments for the
remap_file_pages that goes wrong?

Ahh, I think it's just underflowing the preempt count somewhere, which
is leading highmem.c:15 to just *think* it is in an interrupt.

But you aren't running a preemptible kernel, which makes it unusual...
it would have to be coming from interrupt code (or just random 
corruption).

Still, preempt debugging should catch those cases as well.

So, can you disregard my last message, and instead compile a kernel
with CONFIG_PREEMPT and CONFIG_DEBUG_PREEMPT, and see what
messages come up?
With CONFIG_PREEMPT and CONFIG_DEBUG_PREEMPT set I got following 
messages on

rc9.

BUG: using smp_processor_id() in preemptible [0001] code: 
oracle/3631

caller is kunmap_atomic+0xb/0x82
 [] debug_smp_processor_id+0xa1/0xb4
 [] kunmap_atomic+0xb/0x82
 [] __do_fault+0x55/0x35b
 [] handle_mm_fault+0x4d0/0x909
 [] follow_page+0x1d9/0x228
 [] get_user_pages+0x250/0x332
 [] make_pages_present+0x7b/0x90
 [] sys_remap_file_pages+0x2de/0x330
 [] syscall_call+0x7/0xb
 [] ioctl_standard_call+0x209/0x2ce


Very helpful, thanks.  Guru, please try the appended patch, I think
you'll find it fixes it for you (it did for me, once I'd puzzled out
why I was failing to reproduce the problem - tests on ext3 don't work).
Thank you so much for reporting this just in time!


[PATCH] fix sys_remap_file_pages BUG at highmem.c:15!

Gurudas Pai reports kernel BUG at arch/i386/mm/highmem.c:15! below
sys_remap_file_pages, while running Oracle database test on x86 in 6GB 
RAM:

kunmap thinks we're in_interrupt because the preempt count has wrapped.

That's because __do_fault expected to unmap page_table, but one of its 
two

callers do_nonlinear_fault already unmapped it: let do_linear_fault unmap
it first too, and then there's no need to pass the page_table arg down.

Why have we been so slow to notice this?  Probably through forgetting
that the mapping_cap_account_dirty test means that sys_remap_file_pages
nowadays only goes the full nonlinear vma route on a few memory-backed
filesystems like ramfs, tmpfs and hugetlbfs.

Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]>

--- 2.6.23-rc9/mm/memory.c2007-07-26 19:49:58.0 +0100
+++ linux/mm/memory.c2007-10-04 15:42:20.0 +0100
@@ -2307,13 +2307,14 @@ oom:
  * do not need to flush old virtual caches or the TLB.
  *
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
- * but allow concurrent faults), and pte mapped but not yet locked.
+ * but allow concurrent faults), and pte neither mapped nor locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
 static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-unsigned long address, pte_t *page_table, pmd_t *pmd,
+unsigned long address, pmd_t *pmd,
 pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
 {
+pte_t *page_table;
 spinlock_t *ptl;
 struct page *page;
 pte_t entry;
@@ -2327,7 +2328,6 @@ static int __do_fault(struct mm_struct *
 vmf.flags = flags;
 vmf.page = NULL;
 
-pte_unmap(page_table);

 BUG_ON(vma->vm_flags & VM_PFNMAP);
 
 if (likely(vma->vm_ops->fault)) {

@@ -2468,8 +2468,8 @@ static int do_linear_fault(struct mm_str
 - vma->vm_start) >> PAGE_CACHE_SHIFT) + vma->vm_pgoff;
 unsigned int flags = (write_access ? FAULT_FLAG_WRITE : 0);
 
-return __do_fault(mm, vma, address, page_table, pmd, pgoff,

-flags, orig_pte);
+pte_unmap(page_table);
+return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
 
@@ -2552,9 +2552,7 @@ static int do_nonlinear_fault(struct mm_

 }
 
 pgoff = pte_to_pgoff(orig_pte);

-
-return __do_fault(mm, vma, address, page_table, pmd, pgoff,
-flags, orig_pte);
+return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
 /*


Yes, indeed this patch worked for me , test completed successfully!! (on 
preempt kernel). Will continue testing with non-preempt kernel and 
update you if I hit any issue.


Completed testing on non-preempt successfully without any issue.

Thanks,
-Guru
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-04 Thread Fengguang Wu
On Thu, Oct 04, 2007 at 03:03:44PM +1000, David Chinner wrote:
> On Thu, Oct 04, 2007 at 10:21:33AM +0800, Fengguang Wu wrote:
> > On Wed, Oct 03, 2007 at 12:41:19PM +1000, David Chinner wrote:
> > > On Wed, Oct 03, 2007 at 09:34:39AM +0800, Fengguang Wu wrote:
> > > > On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote:
> > > > > On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote:
> > > > > > wbc.pages_skipped = 0;
> > > > > > @@ -560,8 +561,9 @@ static void background_writeout(unsigned
> > > > > > min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> > > > > > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> > > > > > /* Wrote less than expected */
> > > > > > -   congestion_wait(WRITE, HZ/10);
> > > > > > -   if (!wbc.encountered_congestion)
> > > > > > +   if (wbc.encountered_congestion || wbc.more_io)
> > > > > > +   congestion_wait(WRITE, HZ/10);
> > > > > > +   else
> > > > > > break;
> > > > > > }
> > > > > 
> > > > > Why do you call congestion_wait() if there is more I/O to issue?  If
> > > > > we have a fast filesystem, this might cause the device queues to
> > > > > fill, then drain on congestion_wait(), then fill again, etc. i.e. we
> > > > > will have trouble keeping the queues full, right?
> > > > 
> > > > You mean slow writers and fast RAID? That would be exactly the case
> > > > these patches try to improve.
> > > 
> > > I mean any writers and a fast block device (raid or otherwise).
> > > 
> > > > This patchset makes kupdate/background writeback more responsible,
> > > > so that if (avg-write-speed < device-capabilities), the dirty data are
> > > > synced timely, and we don't have to go for balance_dirty_pages().
> > > 
> > > Sure, but I'm asking about the effect of the patches on the
> > > (avg-write-speed == device-capabilities) case. I agree that
> > > they are necessary for timely syncing of data but I'm trying
> > > to understand what effect they have on the normal write case
> > 
> > > (i.e. keeping the disk at full write throughput).
> > 
> > OK, I guess it is the focus of all your questions: Why should we sleep
> > in congestion_wait() and possibly hurt the write throughput? I'll try
> > to summary it:
> > 
> > - congestion_wait() is necessary
> > Besides device congestions, there may be other blockades we have to
> > wait on, e.g. temporary page locks, NFS/journal issues(I guess).
> 
> We skip locked pages in writeback, and if some filesystems have
> blocking issues that require non-blocking writeback waits for some
> I/O to complete before re-entering writeback, then perhaps they should be
> setting wbc->encountered_congestion to tell writeback to back off.

We have wbc->pages_skipped for that :-)

> The question I'm asking is that if more_io tells us we have more
> work to do, why do we have to sleep first if the block dev is
> able to take more I/O?

See below.

> > 
> > - congestion_wait() is called only when necessary
> > congestion_wait() will only be called we saw blockades:
> > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> > congestion_wait(WRITE, HZ/10);
> > }
> > So in normal case, it may well write 128MB data without any waiting.
> 
> Sure, but wbc.more_io doesn't indicate a blockade - just that there
> is more work to do, right?
 
It's not wbc.more_io, but the context(wbc.pages_skipped > 0) indicates
a blockade:

if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {/* all-written or 
blockade... */
if (wbc.encountered_congestion || wbc.more_io) /* blockade! */
congestion_wait(WRITE, HZ/10);
else   /* all-written! */
break;
}

We can also read the whole background_writeout() logic as

while (!done) {
/* sync _all_ sync-able data */
congestion_wait(100ms);
}

And an example run could be:

sync 1000MB, skipped 100MB
congestion_wait(100ms);
sync 100MB, skipped 10MB
congestion_wait(100ms);
sync 10MB, all done

Note that it's far from "wait 100ms for every 4MB" (which is merely
the worst possible case).

> > - congestion_wait() won't hurt write throughput
> > When not congested, congestion_wait() will be wake up on each write
> > completion.
> 
> What happens if the I/O we issued has already completed before we
> got back up to the congestion_wait() call? We'll spend 100ms
> sleeping when we shouldn't have and throughput goes down by 10% on
> every occurrence

Ah, that was out of my imagination. Maybe we could do with

if (wbc.more_io)
congestion_wait(WRITE, 1);

It's at least 10 times better.

> if we've got more work to do, then we should do it without an
> arbitrary, non-deterministic delay being inserted. If the delay is
> needed to prevent he system from "going mad" (whatever tht 

Re: [PATCH] RCU torture update for preemption

2007-10-04 Thread Paul E. McKenney
On Wed, Oct 03, 2007 at 04:59:51PM -0400, Steven Rostedt wrote:
> Paul,
> 
> I ran your original preemption test of RCU torture, and after several
> minutes, my preempt boost patch had one Preemption stall.  I then
> disabled preemption boosting, and ran the preempt torture again, and it
> seemed to never stall.  Something seemed strange, so I took a look.
> 
> Looks like you have a single thread that will run at max prio that runs
> for 10 secs and then sleeps again. This thread seems to only push rcu
> readers around. But it doesn't seem to do much else. That is a good test
> to see if RCU readers can handle being pushed around, but it doesn't
> test preemption boosting.

Looks like I shot myself in the foot by complaining about a bug...  :-/

http://lkml.org/lkml/2007/6/10/234

With the bug, the readers weren't migrating, without it, they do.

Good catch!!!  Thank you!!!

> To do that, I modified the test to create CPUS-1 preempt boost hogs (or
> 1 if it is UP). But instead of putting it at max prio, I set it to the
> lowest RT prio of 1. This way it's still at a higher priority than the
> readers. I also switched the writers to run at 1+n where n increases for
> every fake writer there is.
> 
> Without preempt boosting, after a couple of minutes I had 83 preemption
> stalls.  When I turned my boosting back on, after several minutes (still
> running as I type this) it has no preemption stalls.
> 
> This seems to be a good test for RCU preemption boosting.

I am testing it out against my earlier patchset, with some encouraging
results -- I will incorporate into the next round of my mainline patchset.
Some questions and comments below.

> -- Steve
> 
> PS. I got rid of your rcu_preeempt_task for rcu_preempt_tasks ;-)
> 
> (No the above is _not_ a typo)

:-/

> Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
> 
> Index: linux-2.6.23-rc9-rt1/kernel/rcutorture.c
> ===
> --- linux-2.6.23-rc9-rt1.orig/kernel/rcutorture.c
> +++ linux-2.6.23-rc9-rt1/kernel/rcutorture.c
> @@ -54,6 +54,7 @@ MODULE_AUTHOR("Paul E. McKenney  
>  static int nreaders = -1;/* # reader threads, defaults to 2*ncpus */
>  static int nfakewriters = 4; /* # fake writer threads */
> +static int npreempthogs = -1;/* # preempt hogs to run (defaults to 
> ncpus-1) or 1 */
>  static int stat_interval;/* Interval between stats, in seconds. */
>   /*  Defaults to "only at end of test". */
>  static int verbose;  /* Print more debug info. */
> @@ -90,9 +91,11 @@ MODULE_PARM_DESC(torture_type, "Type of 
>  static char printk_buf[4096];
> 
>  static int nrealreaders;
> +static int nrealpreempthogs;

I made the above be a module parameter.  This OK?

>  static struct task_struct *writer_task;
>  static struct task_struct **fakewriter_tasks;
>  static struct task_struct **reader_tasks;
> +static struct task_struct **rcu_preempt_tasks;
>  static struct task_struct *stats_task;
>  static struct task_struct *shuffler_task;
> 
> @@ -264,7 +267,6 @@ static void rcu_torture_deferred_free(st
>   call_rcu(>rtort_rcu, rcu_torture_cb);
>  }
> 
> -static struct task_struct *rcu_preeempt_task;
>  static unsigned long rcu_torture_preempt_errors;
> 
>  static int rcu_torture_preempt(void *arg)
> @@ -274,7 +276,7 @@ static int rcu_torture_preempt(void *arg
>   time_t gcstart;
>   struct sched_param sp;
> 
> - sp.sched_priority = MAX_RT_PRIO - 1;
> + sp.sched_priority = 1;
>   err = sched_setscheduler(current, SCHED_RR, );
>   if (err != 0)
>   printk(KERN_ALERT "rcu_torture_preempt() priority err: %d\n",
> @@ -297,24 +299,43 @@ static int rcu_torture_preempt(void *arg
>  static long rcu_preempt_start(void)
>  {
>   long retval = 0;
> + int i;
> 
> - rcu_preeempt_task = kthread_run(rcu_torture_preempt, NULL,
> - "rcu_torture_preempt");
> - if (IS_ERR(rcu_preeempt_task)) {
> - VERBOSE_PRINTK_ERRSTRING("Failed to create preempter");
> - retval = PTR_ERR(rcu_preeempt_task);
> - rcu_preeempt_task = NULL;
> + rcu_preempt_tasks = kzalloc(nrealpreempthogs * 
> sizeof(rcu_preempt_tasks[0]),
> + GFP_KERNEL);
> + if (rcu_preempt_tasks == NULL) {
> + VERBOSE_PRINTK_ERRSTRING("out of memory");
> + retval = -ENOMEM;
> + goto out;
>   }
> +
> + for (i=0; i < nrealpreempthogs; i++) {
> + rcu_preempt_tasks[i] = kthread_run(rcu_torture_preempt, NULL,
> + "rcu_torture_preempt");
> + if (IS_ERR(rcu_preempt_tasks[i])) {
> + VERBOSE_PRINTK_ERRSTRING("Failed to create preempter");
> + retval = PTR_ERR(rcu_preempt_tasks[i]);
> + rcu_preempt_tasks[i] = NULL;
> + break;
> + }
> + }
> + out:
>   

Re: [BUG] Linux 2.6.23-rc9 and MAX_ARG_PAGES

2007-10-04 Thread Mathieu Chouquet-Stringer
On Thu, Oct 04, 2007 at 05:12:11PM -0700, Linus Torvalds wrote:
> I also tested that "ulimit -s" seems to do the right thing for me.
> 
> I'm also assuming Mathieu is running x86 (or x86-64): HP-PA has a stack 
> that grows upwards, and that has traditionally been exciting.

Correct, x86 it is but as I said it's this stupid auditd thing that
breaks the whole process.  I'm gonna file a bug against it.

Thanks for the help though.
-- 
Mathieu Chouquet-Stringer   [EMAIL PROTECTED]
The sun itself sees not till heaven clears.
 -- William Shakespeare --
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch take 2][Intel-IOMMU] Fix for IOMMU early crash

2007-10-04 Thread Benjamin Herrenschmidt

> Subject: [Intel-IOMMU] Fix for IOMMU early crash
> 
> pci_dev's->sysdata is highly overloaded and currently
> IOMMU is broken due to IOMMU code depending on this field.
> 
> This patch introduces new field in pci_dev's dev.archdata struct to
> hold IOMMU specific per device IOMMU private data.
> 
> Signed-off-by: Anil S Keshavamurthy <[EMAIL PROTECTED]>

Looks good. Won't break powerpc.

Acked-by: Benjamin Herrenschmidt <[EMAIL PROTECTED]>

> ---
>  drivers/pci/intel-iommu.c   |   22 +++---
>  include/asm-x86_64/device.h |3 +++
>  2 files changed, 14 insertions(+), 11 deletions(-)
> 
> Index: 2.6-mm/drivers/pci/intel-iommu.c
> ===
> --- 2.6-mm.orig/drivers/pci/intel-iommu.c 2007-10-04 11:35:09.0 
> -0700
> +++ 2.6-mm/drivers/pci/intel-iommu.c  2007-10-04 11:47:47.0 -0700
> @@ -1348,7 +1348,7 @@
>   list_del(>link);
>   list_del(>global);
>   if (info->dev)
> - info->dev->sysdata = NULL;
> + info->dev->dev.archdata.iommu = NULL;
>   spin_unlock_irqrestore(_domain_lock, flags);
>  
>   detach_domain_for_dev(info->domain, info->bus, info->devfn);
> @@ -1361,7 +1361,7 @@
>  
>  /*
>   * find_domain
> - * Note: we use struct pci_dev->sysdata stores the info
> + * Note: we use struct pci_dev->dev.archdata.iommu stores the info
>   */
>  struct dmar_domain *
>  find_domain(struct pci_dev *pdev)
> @@ -1369,7 +1369,7 @@
>   struct device_domain_info *info;
>  
>   /* No lock here, assumes no domain exit in normal case */
> - info = pdev->sysdata;
> + info = pdev->dev.archdata.iommu;
>   if (info)
>   return info->domain;
>   return NULL;
> @@ -1519,7 +1519,7 @@
>   }
>   list_add(>link, >devices);
>   list_add(>global, _domain_list);
> - pdev->sysdata = info;
> + pdev->dev.archdata.iommu = info;
>   spin_unlock_irqrestore(_domain_lock, flags);
>   return domain;
>  error:
> @@ -1579,7 +1579,7 @@
>  static inline int iommu_prepare_rmrr_dev(struct dmar_rmrr_unit *rmrr,
>   struct pci_dev *pdev)
>  {
> - if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO)
> + if (pdev->dev.archdata.iommu == DUMMY_DEVICE_DOMAIN_INFO)
>   return 0;
>   return iommu_prepare_identity_map(pdev, rmrr->base_address,
>   rmrr->end_address + 1);
> @@ -1595,7 +1595,7 @@
>   int ret;
>  
>   for_each_pci_dev(pdev) {
> - if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO ||
> + if (pdev->dev.archdata.iommu == DUMMY_DEVICE_DOMAIN_INFO ||
>   !IS_GFX_DEVICE(pdev))
>   continue;
>   printk(KERN_INFO "IOMMU: gfx device %s 1-1 mapping\n",
> @@ -1836,7 +1836,7 @@
>   int prot = 0;
>  
>   BUG_ON(dir == DMA_NONE);
> - if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO)
> + if (pdev->dev.archdata.iommu == DUMMY_DEVICE_DOMAIN_INFO)
>   return virt_to_bus(addr);
>  
>   domain = get_valid_domain_for_dev(pdev);
> @@ -1900,7 +1900,7 @@
>   unsigned long start_addr;
>   struct iova *iova;
>  
> - if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO)
> + if (pdev->dev.archdata.iommu == DUMMY_DEVICE_DOMAIN_INFO)
>   return;
>   domain = find_domain(pdev);
>   BUG_ON(!domain);
> @@ -1974,7 +1974,7 @@
>   size_t size = 0;
>   void *addr;
>  
> - if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO)
> + if (pdev->dev.archdata.iommu == DUMMY_DEVICE_DOMAIN_INFO)
>   return;
>  
>   domain = find_domain(pdev);
> @@ -2032,7 +2032,7 @@
>   unsigned long start_addr;
>  
>   BUG_ON(dir == DMA_NONE);
> - if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO)
> + if (pdev->dev.archdata.iommu == DUMMY_DEVICE_DOMAIN_INFO)
>   return intel_nontranslate_map_sg(hwdev, sg, nelems, dir);
>  
>   domain = get_valid_domain_for_dev(pdev);
> @@ -2234,7 +2234,7 @@
>   for (i = 0; i < drhd->devices_cnt; i++) {
>   if (!drhd->devices[i])
>   continue;
> - drhd->devices[i]->sysdata = DUMMY_DEVICE_DOMAIN_INFO;
> + drhd->devices[i]->dev.archdata.iommu = 
> DUMMY_DEVICE_DOMAIN_INFO;
>   }
>   }
>  }
> Index: 2.6-mm/include/asm-x86_64/device.h
> ===
> --- 2.6-mm.orig/include/asm-x86_64/device.h   2007-10-04 11:35:09.0 
> -0700
> +++ 2.6-mm/include/asm-x86_64/device.h2007-10-04 11:49:44.0 
> -0700
> @@ -10,6 +10,9 @@
>  #ifdef CONFIG_ACPI
>   void*acpi_handle;
>  #endif
> +#ifdef CONFIG_DMAR
> + void *iommu; /* hook for IOMMU specific extension */
> +#endif
>  };
>  
>  #endif /* _ASM_X86_64_DEVICE_H */

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of 

Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

2007-10-04 Thread Balbir Singh
Hugh Dickins wrote:
> On Thu, 4 Oct 2007, Balbir Singh wrote:
>> Hugh Dickins wrote:
>>> Well, swap control is another subject.  I guess for that you'll need
>>> to track which cgroup each swap page belongs to (rather more expensive
>>> than the current swap_map of unsigned shorts).  And I doubt it'll be
>>> swap control as such that's required, but control of rss+swap.
>> I see what you mean now, other people have recommending a per cgroup
>> swap file/device.
> 
> Sounds too inflexible, and too many swap areas to me.  Perhaps the
> right answer will fall in between: assign clusters of swap pages to
> different cgroups as needed.  But worry about that some other time.
> 

Yes, depending on the number of cgroups, we'll need to share swap
areas between them. It requires more work and thought process.

>>> But here I'm just worrying about how the existence of swap makes
>>> something of a nonsense of your rss control.
>>>
>> Ideally, pages would not reside for too long in swap cache (unless
> 
> Thinking particularly of those brought in by swapoff or swap readahead:
> some will get attached to mms once accessed, others will simply get
> freed when tasks exit or munmap, others will hang around until they
> reach the bottom of the LRU and are reclaimed again by memory pressure.
> 
> But as your code stands, that'll be total memory pressure: in-cgroup
> memory pressure will tend to miss them, since typically they're
> assigned to the wrong cgroup; until then their presence is liable
> to cause other pages to be reclaimed which ideally should not be.
> 

in-cgroup pressure will not affect them, since they are in different
cgroups. If there is pressure in the cgroup to which they are wrongly
assigned, they would get reclaimed first.

>> I've misunderstood swap cache or there are special cases for tmpfs/
>> ramfs).
> 
> ramfs pages are always in RAM, never go out to swap, no need to
> worry about them in this regard.  But tmpfs pages can indeed go
> out to swap, so whatever we come up with needs to make sense
> with them too, yes.  I don't think its swapoff/readahead issues
> are any harder to handle than the anonymous mapped page case,
> but it will need its own code to handle them.
> 
>> Once pages have been swapped back in, they get assigned
>> back to their respective cgroup's in do_swap_page() (where we charge
>> them back to the cgroup).
>>
> 
> That's where it should happen, yes; but my point is that it very
> often does not.  Because the swap cache page (read in as part of
> the readaround cluster of some other cgroup, or in swapoff by some
> other cgroup) is already assigned to that other cgroup (by the
> mem_cgroup_cache_charge in __add_to_swap_cache), and so goes "The
> page_cgroup exists and the page has already been accounted" route
> when mem_cgroup_charge is called from do_swap_page.  Doesn't it?
> 

You are right, at this point I am beginning to wonder if I should
account for the swap cache at all? We account for the pages in RSS
and when the page comes back into the page table(s) via do_swap_page.
If we believe that the swap cache is transitional and the current
expected working behaviour does not seem right or hard to fix,
it might be easy to ignore unuse_pte() and add/remove_from_swap_cache()
for accounting and control.

The expected working behaviour of the memory controller is that
currently, as you point out several pages get accounted to the
cgroup that initiates swapin readahead or swapoff. On
cgroup pressure (the one that initiated swapin or swapoff), the
cgroup would discard these pages first. These pages are discarded
from the cgroup, but still live on the global LRU.

When the original cgroup is under pressure, these pages might not
be effected as they belong to a different cgroup, which might not
be under any sort of pressure.

> Are we misunderstanding each other, because I'm assuming
> MEM_CGROUP_TYPE_ALL and you're assuming MEM_CGROUP_TYPE_MAPPED?
> though I can't see that _MAPPED and _CACHED are actually supported,
> there being no reference to them outside the enum that defines them.
> 

I am also assuming MEM_CGROUP_TYPE_ALL for the purpose of our
discussion. The accounting is split into mem_cgroup_charge() and
mem_cgroup_cache_charge(). While charging the caches is when we
check for the control_type.

> Or are you deceived by that ifdef NUMA code in swapin_readahead,
> which propagates the fantasy that swap allocation follows vma layout?
> That nonsense has been around too long, I'll soon be sending a patch
> to remove it.
> 

The swapin readahead code under #ifdef NUMA is very confusing. I also
noticed another confusing thing during my test, swap cache does not
drop to 0, even though I've disabled all swap using swapoff. May be
those are tmpfs pages. The other interesting thing I tried was running
swapoff after a cgroup went over it's limit, the swapoff succeeded,
but I see strange numbers for free swap. I'll start another thread
after investigating a bit more.

>> The swap 

Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-04 Thread Kyle Moffett

On Oct 04, 2007, at 21:44:02, Eric W. Biederman wrote:
What we want from the LSM is the ability to say -EPERM when we can  
clearly articulate that we want to disallow something.


This sort of depends on perspective; typically with security  
infrastructure you actually want "... the ability to return success  
when we can clearly articulate that we want to *ALLOW* something".   
File permissions work this way; we don't have a list of forbidden  
users attached to each file, we have an owner, a group, and a mode  
representing positive permissions.  With that said in certain high- 
risk environments you need something even stronger that cannot be  
changed by the "owner" of the file, if we don't entirely trust them,


SElinux is not all encompassing or it is generally incomprehensible  
I don't know which.  Or someone long ago would have said a better  
way to implement containers was with a selinux ruleset, here is a  
selinux ruleset that does that.  Although it is completely possible  
to implement all of the isolation with the existing LSM hooks as  
Serge showed.


The difference between SELinux and containers is that SELinux (and  
LSM as a whole) returns -EPERM to operations outside the scope of the  
subject, whereas containers return -ENOENT (because it's not even in  
the same namespace).



We also have in the kernel another parallel security mechanism (for  
what is generally a different class of operations) that has been  
quite successful, and different groups get along quite well, and  
ordinary mortals can understand it.   The linux firewalling code.


Well, I wouldn't go so far as the "ordinary mortals can understand  
it" part; it's still pretty high on the obtuse-o-meter.



The linux firewalling codes has hooks all throughout the networking  
stack, just like the LSM has hooks all throughout the rest of linux  
kernel.  There is a difference however.  The linux firewalling code  
in addition to hooks has tables behind those hooks that it  
consults. There is generic code to walk those tables and consult  
with different kernel modules to decide if we should drop a  
packet.  Each of those kernel modules provides a different  
capability that can be used to generate a firewall.


This is almost *EXACTLY* what SELinux provides as an LSM module.  The  
one difference is that with SELinux some compromises and restrictions  
have been made so that (theoretically) the resulting policy can be  
exhaustively analyzed to *prove* what it allows and disallows.  It  
may be that SELinux should be split into 2 parts, one that provides  
the underlying table-matching and the other that uses it to provide  
the provability guarantees.  Here's a direct comparison:


netfilter:
  (A) Each packet has src, dst, port, etc that can be matched
  (B) Table of rules applied sequentially (MATCH => ACTION)
  (C) Rules may alter the properties of packets as they are routed/ 
bridged/etc


selinux:
  (A) Each object has user, role, and type that can be matched
  (B) Table of rules searched by object parameters (MATCH => allow/ 
auditallow/transition)
  (C) Rules may alter the properties of objects through transition  
rules.


If there are areas where people are confused about SELinux, think it  
may be improved, etc, we would be *GLAD* to hear it.  I'm currently  
struggling to find the time between a hundred other things to finish  
a script I offered to Casey Schaufler a month and a half ago which  
generated an SELinux policy based on a SMACK ruleset.



So I propose that if people want to work towards a one true linux  
solution for additional security checks, then they should look  
towards the linux firewalling code.  It works and it seems to very  
nicely allow cooperations between different groups.  For the people  
who will scream mixing security models causes problems, the answer  
is simple recommend users don't set up their policies that way.


Actually the one thing which really frustrates me about the Linux  
firewalling code is that you cannot selectively apply various  
transformation phases, they are automatically applied for you.  I  
have had a couple very-transparent-routing-firewalling-bridging  
scenarios where I wished I could run the bridging phase, compare-and- 
change the result, and then run the bridging phase again to forward  
the packet elsewhere.  For example if I was to set up a diverted  
ethernet port I would need to apply the bridging code, compare the  
destination port against the selected diverted port and change the  
MAC address, then reapply the bridging code.  To mirror you would  
also need a phase which could create multiple clones of packets and  
conditionalize rules based on which of the copies it was.



I'm not yet annoyed enough to go implement an iptables like  
interface to the LSM enhancing it with more generic mechanism to  
make the problem simpler, but I'm getting there.  Perhaps next time  
I'm bored.


I think a fair amount of what we need is already 

Re: SLUB performance regression vs SLAB

2007-10-04 Thread Arjan van de Ven
On Thu, 4 Oct 2007 19:43:58 -0700 (PDT)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> So there could still be page struct contention left if multiple
> processors frequently and simultaneously free to the same slab and
> that slab is not the per cpu slab of a cpu. That could be addressed
> by optimizing the object free handling further to not touch the page
> struct even if we miss the per cpu slab.
> 
> That get_partial* is far up indicates contention on the list lock
> that should be addressable by either increasing the slab size or by
> changing the object free handling to batch in some form.
> 
> This is an SMP system right? 2 cores with 4 cpus each? The main loop
> is always hitting on the same slabs? Which slabs would this be? Am I
> right in thinking that one process allocates objects and then lets
> multiple other processors do work and then the allocated object is
> freed from a cpu that did not allocate the object? If neighboring
> objects in one slab are allocated on one cpu and then are almost
> simultaneously freed from a set of different cpus then this may be
> explain the situation. -

one of the characteristics of the application in use is the following:
all cores submit IO (which means they allocate various scsi and block
structures on all cpus).. but only 1 will free it (the one the IRQ is
bound to). SO it's allocate-on-one-free-on-another at a high rate.

That is assuming this is the IO slab; that's a bit of an assumption
obviously (it's one of the slab things that are hot, but it's a complex
workload, there could be others)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] kernel BUG at arch/i386/mm/highmem.c:15! on 2.6.23-rc8/rc9

2007-10-04 Thread gurudas pai

Hugh Dickins wrote:

On Thu, 4 Oct 2007, gurudas pai wrote:

Nick Piggin wrote:

While running Oracle database test on x86/6GB RAM machine panics with
following messages.

Hmm, seems like something in sys_remap_file_pages might have broken.
It's a bit hard to work out from the backtrace, though.

Is it possible you can strace to find the arguments for the
remap_file_pages that goes wrong?

Ahh, I think it's just underflowing the preempt count somewhere, which
is leading highmem.c:15 to just *think* it is in an interrupt.

But you aren't running a preemptible kernel, which makes it unusual...
it would have to be coming from interrupt code (or just random corruption).
Still, preempt debugging should catch those cases as well.

So, can you disregard my last message, and instead compile a kernel
with CONFIG_PREEMPT and CONFIG_DEBUG_PREEMPT, and see what
messages come up?

With CONFIG_PREEMPT and CONFIG_DEBUG_PREEMPT set I got following messages on
rc9.

BUG: using smp_processor_id() in preemptible [0001] code: oracle/3631
caller is kunmap_atomic+0xb/0x82
 [] debug_smp_processor_id+0xa1/0xb4
 [] kunmap_atomic+0xb/0x82
 [] __do_fault+0x55/0x35b
 [] handle_mm_fault+0x4d0/0x909
 [] follow_page+0x1d9/0x228
 [] get_user_pages+0x250/0x332
 [] make_pages_present+0x7b/0x90
 [] sys_remap_file_pages+0x2de/0x330
 [] syscall_call+0x7/0xb
 [] ioctl_standard_call+0x209/0x2ce


Very helpful, thanks.  Guru, please try the appended patch, I think
you'll find it fixes it for you (it did for me, once I'd puzzled out
why I was failing to reproduce the problem - tests on ext3 don't work).
Thank you so much for reporting this just in time!


[PATCH] fix sys_remap_file_pages BUG at highmem.c:15!

Gurudas Pai reports kernel BUG at arch/i386/mm/highmem.c:15! below
sys_remap_file_pages, while running Oracle database test on x86 in 6GB RAM:
kunmap thinks we're in_interrupt because the preempt count has wrapped.

That's because __do_fault expected to unmap page_table, but one of its two
callers do_nonlinear_fault already unmapped it: let do_linear_fault unmap
it first too, and then there's no need to pass the page_table arg down.

Why have we been so slow to notice this?  Probably through forgetting
that the mapping_cap_account_dirty test means that sys_remap_file_pages
nowadays only goes the full nonlinear vma route on a few memory-backed
filesystems like ramfs, tmpfs and hugetlbfs.

Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]>

--- 2.6.23-rc9/mm/memory.c  2007-07-26 19:49:58.0 +0100
+++ linux/mm/memory.c   2007-10-04 15:42:20.0 +0100
@@ -2307,13 +2307,14 @@ oom:
  * do not need to flush old virtual caches or the TLB.
  *
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
- * but allow concurrent faults), and pte mapped but not yet locked.
+ * but allow concurrent faults), and pte neither mapped nor locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
 static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-   unsigned long address, pte_t *page_table, pmd_t *pmd,
+   unsigned long address, pmd_t *pmd,
pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
 {
+   pte_t *page_table;
spinlock_t *ptl;
struct page *page;
pte_t entry;
@@ -2327,7 +2328,6 @@ static int __do_fault(struct mm_struct *
vmf.flags = flags;
vmf.page = NULL;
 
-	pte_unmap(page_table);

BUG_ON(vma->vm_flags & VM_PFNMAP);
 
 	if (likely(vma->vm_ops->fault)) {

@@ -2468,8 +2468,8 @@ static int do_linear_fault(struct mm_str
- vma->vm_start) >> PAGE_CACHE_SHIFT) + vma->vm_pgoff;
unsigned int flags = (write_access ? FAULT_FLAG_WRITE : 0);
 
-	return __do_fault(mm, vma, address, page_table, pmd, pgoff,

-   flags, orig_pte);
+   pte_unmap(page_table);
+   return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
 
@@ -2552,9 +2552,7 @@ static int do_nonlinear_fault(struct mm_

}
 
 	pgoff = pte_to_pgoff(orig_pte);

-
-   return __do_fault(mm, vma, address, page_table, pmd, pgoff,
-   flags, orig_pte);
+   return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
 /*


Yes, indeed this patch worked for me , test completed successfully!! (on 
preempt kernel). Will continue testing with non-preempt kernel and 
update you if I hit any issue.


Thank you all for your time and effort.

Regards,
-Guru

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] isofs: add +w bit for non-RR discs

2007-10-04 Thread Matthew Wilcox
On Tue, Oct 02, 2007 at 08:00:26PM +0200, Jan Engelhardt wrote:
> Add %S_IWUGO bit for files on ISO-9660 filesystems without RockRidge

Looks to me like you've added S_IWUSR, not S_IWUGO.

> - popt->mode = S_IRUGO | S_IXUGO; /*
> + popt->mode = S_IRUGO | S_IWUSR | S_IXUGO;
> - inode->i_mode = S_IRUGO | S_IXUGO | S_IFDIR;
> + inode->i_mode = S_IRUGO | S_IWUSR | S_IXUGO | S_IFDIR;

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

2007-10-04 Thread Andrew Morton
On Thu, 04 Oct 2007 18:43:32 -0700 Jeremy Fitzhardinge <[EMAIL PROTECTED]> 
wrote:

> David's change 10a8d6ae4b3182d6588a5809a8366343bc295c20, "i386: add
> ptep_test_and_clear_{dirty,young}" has introduced an SMP race which
> affects the Xen pv-ops backend.

y'know, I think I think it's been several years since I saw a report of an
honest to goodness, genuine SMP race in core kernel.  We used to be
infested by them, but the term has fallen into disuse.  Interesting, but
OT.

> It seems to me that there are a few ways to fix this:
> 
>1. Use asm-generic/pgtable.h when CONFIG_PARAVIRT is enabled.  This
>   will clearly work, but is pretty blunt.
>2. Make test_and_clear_pte_flags a new paravirt-op, which can be
>   implemented in Xen as a hypercall, and as a raw test_and_clear_bit
>   for everyone else.  The downside is adding yet another pv-op.
>3. Restructure the pagetable setup code so that the mm is not added
>   to the prio tree until after arch_dup_mmap has been called (and
>   the converse for exit_mmap).  This is arguably cleaner, but I
>   haven't looked to see how much trouble this would be.
> 
> Thoughts anyone?  Does making the pagetables visible "early" cause
> problems for anyone else?

I expect that 2) has the maximum niceness*suitable-for-2.6.23 product.

That's if you actually care much about kernel.org major releases - do many
people run kernel.org kernels on Xen?  If "not many" then we could perhaps
do something more elaborate for 2.6.23.1.  But adding ever more pvops as
core kernel evolves was always expected.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB performance regression vs SLAB

2007-10-04 Thread Christoph Lameter
I just spend some time looking at the functions that you see high in the 
list. The trouble is that I have to speculate and that I have nothing to 
verify my thoughts. If you could give me the hitlist for each of the 
3 runs then this would help to check my thinking. I could be totally off 
here.

It seems that we miss the per cpu slab frequently on slab_free() which 
leads to the calling of __slab_free() and which in turn needs to take a 
lock on the page (in the page struct). Typically the page lock is 
uncontended which seems to not be the case here otherwise it would not be 
that high up.

The per cpu patch in mm should reduce the contention on the page struct by 
not touching the page struct on alloc and on free. Does not seem to work 
all the way though. slab_free() still has to touch the page struct if the 
free is not to the currently active cpu slab.

So there could still be page struct contention left if multiple processors 
frequently and simultaneously free to the same slab and that slab is not 
the per cpu slab of a cpu. That could be addressed by optimizing the 
object free handling further to not touch the page struct even if we miss 
the per cpu slab.

That get_partial* is far up indicates contention on the list lock that 
should be addressable by either increasing the slab size or by changing 
the object free handling to batch in some form.

This is an SMP system right? 2 cores with 4 cpus each? The main loop is 
always hitting on the same slabs? Which slabs would this be? Am I right in 
thinking that one process allocates objects and then lets multiple other 
processors do work and then the allocated object is freed from a cpu that 
did not allocate the object? If neighboring objects in one slab are 
allocated on one cpu and then are almost simultaneously freed from a set 
of different cpus then this may be explain the situation.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kbuild-devel] A bit of kconfig rewrite (Re: [PATCH] 9p: fix compile error if !CONFIG_SYSCTL)

2007-10-04 Thread Roman Zippel
Hi,

On Mon, 1 Oct 2007, Oleg Verych wrote:

> Today's kconfig was proposed and accepted in a very unpleasant
> circumstances, has very poor design, development and no working
> alternative (for 5+ years now).

If you want to make such statements, you have to offer a little more than 
the hot air you're producing right now...
If you want to improve the design, you're more than welcome. I'm the first 
one to admit that there's still lots of room for improvement, but if you 
want to claim this can only be done via a rewrite, then you have to be 
a lot more specific what's wrong the current design and why it's 
unfixable.
Quite some thought has been put into this design and if you were a little 
more specific, I could actually tell you why it is this way and maybe how 
to improve it incrementally instead of trying to reinvent everything.

>   + shell-like[0] (not like CML1, which is just shell) scripting, allowing
> to extend easily (if there is no one available) capabilities,
> config values or actions for particular sub-system or compilation
> unit,

Just to pick this one point as example: I like scripting and maybe I 
should just update the swig wrapper script I already have and merge it, 
which would make it easier to play with the kconfig database in whatever 
language you like.
OTOH due to the necessary build dependencies I don't see this become a 
mandatory feature, so unless there is a compelling reason a certain set of 
base function will remain in C.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc4] qconf ("make xconfig") Search Dialog Enhancement (rev8)

2007-10-04 Thread Roman Zippel
Hi,

On Thu, 20 Sep 2007, Shlomi Fish wrote:

> Which specific problems do you see with the coding style of the patch? Can 
> you 
> comment on it?

Mostly whitespace around any braces, please keep it close to the other 
source.

> > I would also prefer to move more of the search functionality into the
> > generic code, so it can be used by other front ends as well, otherwise a
> > lot of this had to be duplicated.
> 
> That would be a good idea, but I cannot use Qt there, which makes my job 
> harder.

Where is the problem with implementing it in C? Just try to keep it a 
simple at first.

> > I think a filter function makes it maybe a bit to flexible, if a front
> > end wants to do some weird filtering, it can still access the symbol
> > data base directly. 
> 
> A filter function would still be convenient in this context, as the symbol 
> data base API may change, and the filter function has a little logic in it.

This API is not really fixed at the moment, so it's not really a problem.

> > So what I have in mind is something like this: 
> >
> > struct symbol **sym_generic_search(const char *pattern, unsigned int
> > flags);
> >
> > This means the back end provides a basic search facility for the most
> > common search operations. The flags would specify what to search (e.g.
> > symbol name, help text, prompts) and how to do it.
> 
> I suggest we don't call it sym_generic_search, as generic implies it is a 
> generic filter. We can call it "sym_string_search" or whatever. Then, I 
> suggest we have separate arguments for every parameter (i.e: search type, 
> case sensitivity, what to search, etc.).

I don't care much about the name, but please keep it as a simple flag, 
which is a lot easier to extend.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] New message-logging API (kprint)

2007-10-04 Thread Rob Landley
On Thursday 04 October 2007 3:17:03 pm Randy Dunlap wrote:
> On Thu, 04 Oct 2007 22:04:07 +0200 Vegard Nossum wrote:
> > Description: This patch largely implements the kprint API as previously
> > posted to the LKML and described in Documentation/kprint.txt (see patch).
> >
> > The main purpose of this change is provide a unified logging API to the
> > kernel and at the same time make it easy to add extensions, now and
> > later.
> >
> > My changes and additions are as follows:
>
> $ diffstat -p1 -w70 kprint.patch
...
>  40 files changed, 1660 insertions(+), 72 deletions(-)

I started this thread by posting an idea I had for shrinking the kernel by 
allowing more code to be configured out.  The API change was exactly one new 
parameter, with a direct 1->1 mapping from the old API to the new one, which 
was trivial to convert and which the compiler would catch if you missed one.

The result of the discussion is a patch adding 1600 lines to the kernel, 
without removing anything.

Last I checked, the current prink() worked just fine.  Why is this _not_ the 
dreaded "infrastructure in search of a use"?  What exactly can we _not_ do 
with the current code?  What does this allow us to remove and simplify?

I'm confused about what people are trying to accomplish here...

Rob
-- 
"One of my most productive days was throwing away 1000 lines of code."
  - Ken Thompson.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Vague maybe ppp-related panic report for 2.6.23-rc9

2007-10-04 Thread Roland Dreier
Just as a quick update -- I seem to only be able to reproduce this
crash when my ppp session drops, which seems associated with marginal
signal.  And unfortunately I have great coverage at home so I haven't
been able to reproduce this again today.  Maybe on the train tomorrow
I can crash my laptop...

 - R.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

2007-10-04 Thread Rik van Riel
On Thu, 04 Oct 2007 18:43:32 -0700
Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote:

> It seems to me that there are a few ways to fix this:
> 
>1. Use asm-generic/pgtable.h when CONFIG_PARAVIRT is enabled.  This
>   will clearly work, but is pretty blunt.
>2. Make test_and_clear_pte_flags a new paravirt-op, which can be
>   implemented in Xen as a hypercall, and as a raw
> test_and_clear_bit for everyone else.  The downside is adding yet
> another pv-op.

Either of these two would work.  Another alternative could be to
let test_and_clear_pte_flags have an exception table entry, where
we jump right to the next instruction if the instruction clearing
the flag fails.

That is the essentially variant you need for Xen, except the fast
path is still exactly the same it is as when running on native
hardware.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-04 Thread Eric W. Biederman
Linus Torvalds <[EMAIL PROTECTED]> writes:

> To get back to security: I didn't want pluggable security because I 
> thought that was a technically good solution. No, the reason Linux has LSM 
> (and yes, I was the one who pushed hard for the whole thing, even if I 
> didn't actually write any of it) was because the problem wasn't technical 
> to begin with.
>
> It was social/political and administrative.
>
> See? Another fundamental difference between schedulers and security 
> modules. 

>
> But no, that's not really why we have LSM. I'd have *much* preferred to 
> have one unified security module setup that we could all agree on, and no 
> pluggable security modules. It was not to be - and the reason we have LSM 
> is not because "it makes more sense than a CPU scheduler", but simply 
> because "people didn't actually get anything done at all, because they 
> just argued about what to do".
>
> In the CPU schedulers, Ingo still gets work done, even though people argue 
> about it. So we haven't needed to go to the extreme of an "LSM for CPU 
> schedulers", because the arguments don't actually hold up the work.
>
> And THAT is what matters in the end.

Sounds good.

I want to inject some fresh ideas into this discussion from a completely
different viewpoint, who knows I might get lucky and make things
better.

All you can do with the LSM is return -EPERM when the normal unix
permissions would not have allowed an operation.  I don't see where
there is any magic or mystery in that, or any need for deep
understanding.


What we want from the LSM is the ability to say -EPERM when we can
clearly articulate that we want to disallow something.


SElinux is not all encompassing or it is generally incomprehensible I
don't know which.  Or someone long ago would have said a better
way to implement containers was with a selinux ruleset, here is a
selinux ruleset that does that.  Although it is completely possible to
implement all of the isolation with the existing LSM hooks as Serge
showed.

It is a legitimate criticism of the LSM that we are not improving our
in-kernel abstractions to allow better concepts to base decisions
upon when to return -EPERM.  My first dealing with selinux and the lsm
was when I fixed a security issue in /proc fixed the abstractions we
were using and the default selinux security policy had a fit.  If
don't have good concepts in /proc/pid/xxx which is heavily used it
would not surprise me at all if there are lots of other places in the
kernel where our abstractions holes that have not yet been shorn up.

We also have in the kernel another parallel security mechanism (for
what is generally a different class of operations) that has been quite
successful, and different groups get along quite well, and ordinary
mortals can understand it.   The linux firewalling code.


The linux firewalling codes has hooks all throughout the networking
stack, just like the LSM has hooks all throughout the rest of linux
kernel.  There is a difference however.  The linux firewalling code in
addition to hooks has tables behind those hooks that it consults.
There is generic code to walk those tables and consult with different
kernel modules to decide if we should drop a packet.  Each of those
kernel modules provides a different capability that can be used to
generate a firewall.

Meanwhile composition of a policy using code from different clients
of the LSM hooks is impossible, and thus cooperation or wider use of
the LSM hooks is difficult.

So I propose that if people want to work towards a one true linux
solution for additional security checks, then they should look towards
the linux firewalling code.  It works and it seems to very nicely
allow cooperations between different groups.  For the people who will
scream mixing security models causes problems, the answer is simple
recommend users don't set up their policies that way.


I know we can't solve human problems with technical measures but
perhaps a technical suggestion can open the way to the solution to
some human problems.

I'm not yet annoyed enough to go implement an iptables like interface
to the LSM enhancing it with more generic mechanism to make the
problem simpler, but I'm getting there.  Perhaps next time I'm bored.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

2007-10-04 Thread Jeremy Fitzhardinge
David's change 10a8d6ae4b3182d6588a5809a8366343bc295c20, "i386: add
ptep_test_and_clear_{dirty,young}" has introduced an SMP race which
affects the Xen pv-ops backend.

In Xen, pagetables are normally kept RO so that the hypervisor can
mediate all updates to them.  If Xen sees a write to an active
(currently pointed to by cr3) or pinned (a currently inactive but
registered pagetable), it will trap the write fault and emulate the
instruction making the update; this means that most pagetable-modifying
code doesn't need to know or care that pagetables are RO.

When a pagetable is first created (either in execve or fork), the the
Xen paravirt backend pins the pagetable, and conversely, on exit it is
unpinned; this is done via the arch_dup_mmap() and activate_mm() hooks. 
Pinning is done in two phases: first the pagetable pages are marked RO,
and then the pagetable is registered with Xen; unpinning is the
opposite.  This works assuming that the pagetable is not in use, and not
yet visible to other cpus.

The race on pagetable creation is this: in kernel/fork.c:dup_mmap(), it
copies the old pagetable into the new one, and registers each vma with
the rmap prio tree.  Once everything is copied, it calls
arch_dup_mmap(), which ends up doing the Xen pagetable pin.  However,
because the pagetable is visible to other cpus via the prio tree,
pagetable modifications (specifically, clearing the access bit) can race
with pinning.  If it hits between making the pagetable pages RO but
before they're registered with Xen, modifications to the flags will
fault, and Xen won't know to do the fixup.

The converse is also true in exit_mmap(): arch_exit_mmap is called
before removing the vmas from the prio tree, so it can race with unpinning.

The specific oops I'm seeing is this:

BUG: unable to handle kernel paging request at virtual address c5b023e8
 printing eip:
c016d3f2
*pdpt = 4bc1a001
Oops: 0003 [#1]
PREEMPT SMP 
Modules linked in:
CPU:1
EIP:0061:[]Not tainted VLI
EFLAGS: 00010202   (2.6.23-rc9-paravirt #1656)
EIP is at page_referenced_one+0xb8/0x12a
eax: c0401b17   ebx: c5b023e8   ecx: c2398000   edx: c044ceca
esi: 0087d000   edi: c5660688   ebp: c2399af4   esp: c2399acc
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0069
Process cc1 (pid: 31474, ti=c2398000 task=c2dc9000 task.ti=c2398000)
Stack: c04014a7 c040f47a 011e c03697fe c2399b1c c5eb4500 c113e87c c116b1c8 
   c5660688 c13aa890 c2399b2c c016d4d8  c7917340 0008  
    c13aa8c4   0005 c116b1c8 0001 c0473940 
Call Trace:
 [] show_trace_log_lvl+0x1a/0x2f
 [] show_stack_log_lvl+0x9d/0xa5
 [] show_registers+0x1f7/0x336
 [] die+0x11b/0x23b
 [] do_page_fault+0x758/0x838
 [] error_code+0x72/0x78
 [] page_referenced_file+0x74/0xa0
 [] page_referenced+0xbd/0xd0
 [] shrink_active_list+0x170/0x3a3
 [] shrink_zone+0xb9/0xf8
 [] try_to_free_pages+0x13c/0x208
 [] __alloc_pages+0x197/0x290
 [] __do_page_cache_readahead+0xd4/0x1d7
 [] do_page_cache_readahead+0x4b/0x56
 [] filemap_fault+0x1b7/0x3de
 [] __do_fault+0x79/0x407
 [] handle_mm_fault+0x27e/0xca0
 [] do_page_fault+0x391/0x838
 [] error_code+0x72/0x78
 ===
Code: 0c fe 97 36 c0 c7 44 24 08 1e 01 00 00 c7 44 24 04 7a f4 40 c0 c7 04 24 
a7 14 40 c0 e8 d4 e5 fb ff e8 29 c9 f9 ff f6 03 20 74 27  0f ba 33 05 19 c0 
85 c0 74 1c 8b 07 89 f2 89 d9 8d b6 00 00 
EIP: [] page_referenced_one+0xb8/0x12a SS:ESP 0069:c2399acc


It all worked OK before David's change, because asm-generic/pgtable.h
uses set_pte_at(), which ends up making a hypercall to update the
pagetable, which always works regardless of the state of the pagetable
pages.


It seems to me that there are a few ways to fix this:

   1. Use asm-generic/pgtable.h when CONFIG_PARAVIRT is enabled.  This
  will clearly work, but is pretty blunt.
   2. Make test_and_clear_pte_flags a new paravirt-op, which can be
  implemented in Xen as a hypercall, and as a raw test_and_clear_bit
  for everyone else.  The downside is adding yet another pv-op.
   3. Restructure the pagetable setup code so that the mm is not added
  to the prio tree until after arch_dup_mmap has been called (and
  the converse for exit_mmap).  This is arguably cleaner, but I
  haven't looked to see how much trouble this would be.

Thoughts anyone?  Does making the pagetables visible "early" cause
problems for anyone else?

Thanks,
J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [NFS] What's slated for inclusion in 2.6.24-rc1 from the NFS client git tree...

2007-10-04 Thread Trond Myklebust
On Thu, 2007-10-04 at 12:59 -0700, Andrew Morton wrote:
> On Thu, 04 Oct 2007 15:16:03 -0400
> Trond Myklebust <[EMAIL PROTECTED]> wrote:
> 
> > > > 
> > > > That would be perfect. It can even be in non-legacy mode by default,
> > > > just as long as you can go back to the old behaviour when/if you run
> > > > into a non-LFS application.
> > > > 
> > > 
> > > Wouldn't a mount option be better?
> > 
> > I suppose that might be OK if you know that the 32-bit legacy
> > applications will only touch one or two servers, but that sounds like a
> > niche thing.
> > 
> > On the downside, forcing all those people who have portable 64-bit aware
> > applications to upgrade their version of mount just in order to have
> > stat64() work correctly seems unnecessarily complicated. I'd prefer not
> > to have to do that unless someone comes up with a good reason why we
> > must.
> 
> Confused.  You don't need to modify mount(8) when adding a new mount option?

Prior to 2.6.22, the 'mount' program used a binary blob for passing the
NFS mount options to the kernel.
It is only very recently that we have started doing in-kernel parsing of
text strings, and in order to make use of that, people will need to
upgrade to the latest version of nfs-utils.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] signal(i386): alternative signal stack wraparound occurs

2007-10-04 Thread Shi Weihua

Mikael Pettersson wrote::

On Thu, 4 Oct 2007 21:47:30 +0900, KAMEZAWA Hiroyuki wrote:

On Thu, 04 Oct 2007 21:33:12 +0900
Shi Weihua <[EMAIL PROTECTED]> wrote:


KAMEZAWA Hiroyuki wrote::

On Thu, 04 Oct 2007 20:56:14 +0900
Shi Weihua <[EMAIL PROTECTED]> wrote:


stack.ss_sp = addr + pagesize;
stack.ss_flags = 0;
stack.ss_size = pagesize;
Here is bad. 
stack,ss_sp = addr;

stack.ss_flags = 0;
stack.ss_size = pagesize * 2;

[What the test code want to do]
addr+pagesize*2 - addr+pagesize  -> sigaltstack
addr+pagesize   - addr   -> protected region
The code want to catch overflow when esp enter the protected region.


You have to protect the top of *registered* sigaltstack.
The reason of wraparound is %esp will be set to the bottom of sigaltstack
if it is not on sigaltstack area when signaled.
What you have to do is protect the top of registerd sigaltstack.
If %esp is in the range of registerd sigaltstack at SEGV, wraparound
will stop.


Exactly right. You mprotect or munmap the end of the altstack,
not the area beyond it.
So we tell users "Even if you protectted half of mmap's space, but you must to register all space to 
kernel. " ?


The image about my test code's result:
  No patchPatched
┌───┐
│   │← 1 ┌ ← 3  ← 1
│A  ││(wraparound)
│   ││
│   │← 2 │  ← 2
│   ││
├───┤│
│▒▒▒│← 3 ┘  ← 3
│B▒▒│  (caught)
│▒protected▒│
│▒▒▒│
│▒▒▒│
└───┘
A+B  mmap's space
Asigaltstack
Bprotectted

I agree that if register A+B to kernel, the wraparound will stop.
But if register A to kernel, why not kernel do something?

Thanks
Shi Weihua


/Mikael





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Andrew Morton
On Fri, 05 Oct 2007 02:12:30 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> > 
> > I don't think I understand that.  Sure, it _shouldn't_ be a problem.  But it
> > _is_.  That's what we're trying to fix, isn't it?
> 
> The problem, I believe is in the memory allocation code, not in fuse.

fuse is trying to do something which page reclaim was not designed for. 
Stuff broke.

> In the example, memory allocation may be blocking indefinitely,
> because we have 4MB under writeback, even though 28MB can still be
> made available.  And that _should_ be fixable.

Well yes.  But we need to work out how, without re-breaking the thing which
throttle_vm_writeout() fixed.

> > > So the only thing the kernel should be careful about, is not to block
> > > on an allocation if not strictly necessary.
> > > 
> > > Actually a trivial fix for this problem could be to just tweak the
> > > thresholds, so to make the above scenario impossible.  Although I'm
> > > still not convinced, this patch is perfect, because the dirty
> > > threshold can actually change in time...
> > > 
> > > Index: linux/mm/page-writeback.c
> > > ===
> > > --- linux.orig/mm/page-writeback.c  2007-10-05 00:31:01.0 
> > > +0200
> > > +++ linux/mm/page-writeback.c   2007-10-05 00:50:11.0 +0200
> > > @@ -515,6 +515,12 @@ void throttle_vm_writeout(gfp_t gfp_mask
> > >  for ( ; ; ) {
> > > get_dirty_limits(_thresh, _thresh, NULL, 
> > > NULL);
> > > 
> > > +   /*
> > > +* Make sure the theshold is over the hard limit of
> > > +* dirty_thresh + ratelimit_pages * nr_cpus
> > > +*/
> > > +   dirty_thresh += ratelimit_pages * num_online_cpus();
> > > +
> > >  /*
> > >   * Boost the allowable dirty threshold a bit for page
> > >   * allocators so they don't get DoS'ed by heavy writers
> > 
> > I can probably kind of guess what you're trying to do here.  But if
> > ratelimit_pages * num_online_cpus() exceeds the size of the offending zone
> > then things might go bad.
> 
> I think the admin can do quite a bit of other damage, by setting
> dirty_ratio too high.
> 
> Maybe this writeback throttling should just have a fixed limit of 80%
> ZONE_NORMAL, and limit dirty_ratio to something like 50%.

Bear in mind that the same problem will occur for the 16MB ZONE_DMA, and
we cannot limit the system-wide dirty-memory threshold to 12MB.

iow, throttle_vm_writeout() needs to become zone-aware.  Then it only
throttles when, say, 80% of ZONE_FOO is under writeback.

Except I don't think that'll fix the problem 100%: if your fuse kernel
component somehow manages to put 80% of ZONE_FOO under writeback (and
remmeber this might be only 12MB on a 16GB machine) then we get stuck again
- the fuse server process (is that the correct terminology, btw?) ends up
waiting upon itself.

I'll think about it a bit.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Vague maybe ppp-related panic report for 2.6.23-rc9

2007-10-04 Thread Herbert Xu
On Thu, Oct 04, 2007 at 01:51:13PM -0700, David Miller wrote:
> 
> I don't want to jump the gun on the analysis but it just might
> be the packet sharing fixes Herbert put in a short time ago.

I think the only change of mine that could affect ppp over a
serial line is this one.  I couldn't see anything obvious in
it but maybe someone else can.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
2a38b775b77f99308a4e571c13d908df78ac5e57
diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c
index 7e21342..4b49d0e 100644
--- a/drivers/net/ppp_generic.c
+++ b/drivers/net/ppp_generic.c
@@ -1525,7 +1525,7 @@ ppp_input_error(struct ppp_channel *chan, int code)
 static void
 ppp_receive_frame(struct ppp *ppp, struct sk_buff *skb, struct channel *pch)
 {
-   if (skb->len >= 2) {
+   if (pskb_may_pull(skb, 2)) {
 #ifdef CONFIG_PPP_MULTILINK
/* XXX do channel-level decompression here */
if (PPP_PROTO(skb) == PPP_MP)
@@ -1577,7 +1577,7 @@ ppp_receive_nonmp_frame(struct ppp *ppp, struct sk_buff 
*skb)
if (ppp->vj == 0 || (ppp->flags & SC_REJ_COMP_TCP))
goto err;
 
-   if (skb_tailroom(skb) < 124) {
+   if (skb_tailroom(skb) < 124 || skb_cloned(skb)) {
/* copy to a new sk_buff with more tailroom */
ns = dev_alloc_skb(skb->len + 128);
if (ns == 0) {
@@ -1648,23 +1648,29 @@ ppp_receive_nonmp_frame(struct ppp *ppp, struct sk_buff 
*skb)
/* check if the packet passes the pass and active filters */
/* the filter instructions are constructed assuming
   a four-byte PPP header on each packet */
-   *skb_push(skb, 2) = 0;
-   if (ppp->pass_filter
-   && sk_run_filter(skb, ppp->pass_filter,
-ppp->pass_len) == 0) {
-   if (ppp->debug & 1)
-   printk(KERN_DEBUG "PPP: inbound frame not 
passed\n");
-   kfree_skb(skb);
-   return;
-   }
-   if (!(ppp->active_filter
- && sk_run_filter(skb, ppp->active_filter,
-  ppp->active_len) == 0))
-   ppp->last_recv = jiffies;
-   skb_pull(skb, 2);
-#else
-   ppp->last_recv = jiffies;
+   if (ppp->pass_filter || ppp->active_filter) {
+   if (skb_cloned(skb) &&
+   pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
+   goto err;
+
+   *skb_push(skb, 2) = 0;
+   if (ppp->pass_filter
+   && sk_run_filter(skb, ppp->pass_filter,
+ppp->pass_len) == 0) {
+   if (ppp->debug & 1)
+   printk(KERN_DEBUG "PPP: inbound frame "
+  "not passed\n");
+   kfree_skb(skb);
+   return;
+   }
+   if (!(ppp->active_filter
+ && sk_run_filter(skb, ppp->active_filter,
+  ppp->active_len) == 0))
+   ppp->last_recv = jiffies;
+   __skb_pull(skb, 2);
+   } else
 #endif /* CONFIG_PPP_FILTER */
+   ppp->last_recv = jiffies;
 
if ((ppp->dev->flags & IFF_UP) == 0
|| ppp->npmode[npi] != NPMODE_PASS) {
@@ -1762,7 +1768,7 @@ ppp_receive_mp_frame(struct ppp *ppp, struct sk_buff 
*skb, struct channel *pch)
struct channel *ch;
int mphdrlen = (ppp->flags & SC_MP_SHORTSEQ)? MPHDRLEN_SSN: MPHDRLEN;
 
-   if (!pskb_may_pull(skb, mphdrlen) || ppp->mrru == 0)
+   if (!pskb_may_pull(skb, mphdrlen + 1) || ppp->mrru == 0)
goto err;   /* no good, throw it away */
 
/* Decode sequence number and begin/end bits */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc9-git2: Known regressions from 2.6.22

2007-10-04 Thread Rafael J. Wysocki
On Friday, 5 October 2007 02:11, H. Peter Anvin wrote:
> Rafael J. Wysocki wrote:
> > 
> > Subject:vga text console not working on 2.6.23-rc8
> > Submitter:  Santiago Garcia Mantinan <[EMAIL PROTECTED]>
> > References: http://lkml.org/lkml/2007/9/28/342
> > http://bugzilla.kernel.org/show_bug.cgi?id=9099
> > Handled-By: H. Peter Anvin <[EMAIL PROTECTED]>
> > Antonino A. Daplas <[EMAIL PROTECTED]>
> > 
> 
> This one was user error.  Not a regression.

OK, will drop.

Thanks,
Rafael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] Linux 2.6.23-rc9 and MAX_ARG_PAGES

2007-10-04 Thread Linus Torvalds


On Fri, 5 Oct 2007, Paul Mackerras wrote:
> Linus Torvalds writes:
> > 
> > Well, since others definitely don't see this, including me, and I can do 
> > things like 62MB exec arrays:
> > 
> > [EMAIL PROTECTED] linux]$ echo $(find /home/torvalds/) | wc
> >   1  883304 63000962
> 
> That wouldn't actually do an exec, assuming you're using bash, since
> echo is a shell builtin in bash.  You'd need to do /bin/echo.

Right you are, silly me. But yes, it works for me even with that (and 
since I downloaded the gcc source tree, it now has six more megs of 
arguments).

I also tested that "ulimit -s" seems to do the right thing for me.

I'm also assuming Mathieu is running x86 (or x86-64): HP-PA has a stack 
that grows upwards, and that has traditionally been exciting.

IA64 also has some strange things for the register backing store.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Miklos Szeredi
> > > This is a somewhat general problem: a userspace process is in the IO 
> > > path. 
> > > Userspace block drivers, for example - pretty much anything which involves
> > > kernel->userspace upcalls for storage applications.
> > > 
> > > I solved it once in the past by marking the userspace process as
> > > PF_MEMALLOC and I beleive that others have implemented the same hack.
> > > 
> > > I suspect that what we need is a general solution, and that the solution
> > > will involve explicitly telling the kernel that this process is one which
> > > actually cleans memory and needs special treatment.
> > > 
> > > Because I bet there will be other corner-cases where such a process needs
> > > kernel help, and there might be optimisation opportunities as well.
> > > 
> > > Problem is, any such mark-me-as-special syscall would need to be
> > > privileged, and FUSE servers presently don't require special perms (do
> > > they?)
> > 
> > No, and that's a rather important feature, that I'd rather not give
> > up.
> 
> Can fuse do it?  Perhaps the fs can diddle the server's task_struct at
> registration time?

No, it's futile.  What if another process is involved (ssh in case of
sshfs), etc.

> >  But with the dirty limiting, the memory cleaning really shouldn't
> > be a problem, as there is plenty of memory _not_ used for dirty file
> > data, that the filesystem can use during the writeback.
> 
> I don't think I understand that.  Sure, it _shouldn't_ be a problem.  But it
> _is_.  That's what we're trying to fix, isn't it?

The problem, I believe is in the memory allocation code, not in fuse.

In the example, memory allocation may be blocking indefinitely,
because we have 4MB under writeback, even though 28MB can still be
made available.  And that _should_ be fixable.

> > So the only thing the kernel should be careful about, is not to block
> > on an allocation if not strictly necessary.
> > 
> > Actually a trivial fix for this problem could be to just tweak the
> > thresholds, so to make the above scenario impossible.  Although I'm
> > still not convinced, this patch is perfect, because the dirty
> > threshold can actually change in time...
> > 
> > Index: linux/mm/page-writeback.c
> > ===
> > --- linux.orig/mm/page-writeback.c  2007-10-05 00:31:01.0 +0200
> > +++ linux/mm/page-writeback.c   2007-10-05 00:50:11.0 +0200
> > @@ -515,6 +515,12 @@ void throttle_vm_writeout(gfp_t gfp_mask
> >  for ( ; ; ) {
> > get_dirty_limits(_thresh, _thresh, NULL, 
> > NULL);
> > 
> > +   /*
> > +* Make sure the theshold is over the hard limit of
> > +* dirty_thresh + ratelimit_pages * nr_cpus
> > +*/
> > +   dirty_thresh += ratelimit_pages * num_online_cpus();
> > +
> >  /*
> >   * Boost the allowable dirty threshold a bit for page
> >   * allocators so they don't get DoS'ed by heavy writers
> 
> I can probably kind of guess what you're trying to do here.  But if
> ratelimit_pages * num_online_cpus() exceeds the size of the offending zone
> then things might go bad.

I think the admin can do quite a bit of other damage, by setting
dirty_ratio too high.

Maybe this writeback throttling should just have a fixed limit of 80%
ZONE_NORMAL, and limit dirty_ratio to something like 50%.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc9-git2: Known regressions from 2.6.22

2007-10-04 Thread H. Peter Anvin

Rafael J. Wysocki wrote:


Subject:vga text console not working on 2.6.23-rc8
Submitter:  Santiago Garcia Mantinan <[EMAIL PROTECTED]>
References: http://lkml.org/lkml/2007/9/28/342
http://bugzilla.kernel.org/show_bug.cgi?id=9099
Handled-By: H. Peter Anvin <[EMAIL PROTECTED]>
Antonino A. Daplas <[EMAIL PROTECTED]>



This one was user error.  Not a regression.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: vm86.c audit_syscall_exit() call trashes registers

2007-10-04 Thread Chuck Ebbert
On 10/04/2007 07:58 PM, William Cattey wrote:
> 
> Sadly, the effect of the patch is the same as the most recent candidate
> patch from Jeremy Fitzhardinge:  The EDID transfer still comes up all
> zeros.
> 

I think maybe a better question is: why does read_edid still work?
The X server might be making some invalid assumption about system
state. Comparing the code the two programs use could provide some clues.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc7-mm1 -- powerpc rtas panic

2007-10-04 Thread Nish Aravamudan
On 10/2/07, Tony Breeds <[EMAIL PROTECTED]> wrote:
> On Wed, Oct 03, 2007 at 10:30:16AM +1000, Michael Ellerman wrote:
>
> > I realise it'll make the patch bigger, but this doesn't seem like a
> > particularly good name for the variable anymore.
>
> Sure, what about?
>
> Clarify when RTAS logging is enabled.
>
> Signed-off-by: Tony Breeds <[EMAIL PROTECTED]>

For what it's worth, on a different ppc64 box, this resolves a similar
panic for me.

Tested-by: Nishanth Aravamudan <[EMAIL PROTECTED]>

Thanks,
Nish
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: vm86.c audit_syscall_exit() call trashes registers

2007-10-04 Thread William Cattey
Thanks very much for thinking about this and providing a revised  
candidate patch.


Sadly, the effect of the patch is the same as the most recent  
candidate patch from Jeremy Fitzhardinge:  The EDID transfer still  
comes up all zeros.


This is very perplexing to me.  If I take the code that appears in  
2.6.18's vm86.c, and simply put #if 0 around the call to  
audit_syscall_exit I get good data.


If this is indeed a correct minimal correction to the  
audit_syscall_exit code, then perhaps there's some other condition  
being exercised.  I guess my next step is to take the whole pt_regs  
patch (commit 49d26b6eaa8e970c8cf6e299e6ccba2474191bf5) from  
kernel.org and see if that has a beneficial effect.


-Bill



William Cattey
Linux Platform Coordinator
MIT Information Services & Technology

N42-040M, 617-253-0140, [EMAIL PROTECTED]
http://web.mit.edu/wdc/www/


On Oct 2, 2007, at 12:44 PM, Chuck Ebbert wrote:


On 09/25/2007 07:38 PM, William Cattey wrote:


I'd feel a lot more confident we were on the right track if I  
could just
correctly patch Fitzhardinge's cleanup into the test setup I have  
now.




I think you need to zero both registers if you're using 2.6.16, and  
force

%eax as the source so it doesn't choose %ebp?

--- a/arch/i386/kernel/vm86.c
+++ b/arch/i386/kernel/vm86.c
@@ -306,19 +334,19 @@ static void do_sys_vm86(struct  
kernel_vm86_struct *info, struct task_struct *tsk

tsk->thread.screen_bitmap = info->screen_bitmap;
if (info->flags & VM86_SCREEN_BITMAP)
mark_screen_rdonly(tsk->mm);
-	__asm__ __volatile__("xorl %eax,%eax; movl %eax,%fs; movl %eax,%gs 
\n\t");

-   __asm__ __volatile__("movl %%eax, %0\n" :"=r"(eax));

 	/*call audit_syscall_exit since we do not exit via the normal  
paths */

if (unlikely(current->audit_context))
-   audit_syscall_exit(AUDITSC_RESULT(eax), eax);
+   audit_syscall_exit(AUDITSC_RESULT(0), 0);

__asm__ __volatile__(
"movl %0,%%esp\n\t"
"movl %1,%%ebp\n\t"
+   "mov  %2, %%fs\n\t"
+   "mov  %2, %%gs\n\t"
"jmp resume_userspace"
: /* no outputs */
-   :"r" (>regs), "r" (task_thread_info(tsk)));
+   :"r" (>regs), "r" (task_thread_info(tsk)), "a" (0));
/* we never return here */
 }



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB performance regression vs SLAB

2007-10-04 Thread Chuck Ebbert
On 10/04/2007 07:39 PM, David Schwartz wrote:
> But this is just a preposterous position to put him in. If there's no
> reproduceable test case, then why should he care that one program he can't
> even see works badly? If you care, you fix it.
> 

People have been trying for years to make reproducible test cases
for huge and complex workloads. It doesn't work. The tests that do
work take weeks to run and need to be carefully validated before
they can be officially released. The open source community can and
should be working on similar tests, but they will never be simple.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Andrew Morton
On Fri, 05 Oct 2007 01:26:12 +0200
Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> > This is a somewhat general problem: a userspace process is in the IO path. 
> > Userspace block drivers, for example - pretty much anything which involves
> > kernel->userspace upcalls for storage applications.
> > 
> > I solved it once in the past by marking the userspace process as
> > PF_MEMALLOC and I beleive that others have implemented the same hack.
> > 
> > I suspect that what we need is a general solution, and that the solution
> > will involve explicitly telling the kernel that this process is one which
> > actually cleans memory and needs special treatment.
> > 
> > Because I bet there will be other corner-cases where such a process needs
> > kernel help, and there might be optimisation opportunities as well.
> > 
> > Problem is, any such mark-me-as-special syscall would need to be
> > privileged, and FUSE servers presently don't require special perms (do
> > they?)
> 
> No, and that's a rather important feature, that I'd rather not give
> up.

Can fuse do it?  Perhaps the fs can diddle the server's task_struct at
registration time?

>  But with the dirty limiting, the memory cleaning really shouldn't
> be a problem, as there is plenty of memory _not_ used for dirty file
> data, that the filesystem can use during the writeback.

I don't think I understand that.  Sure, it _shouldn't_ be a problem.  But it
_is_.  That's what we're trying to fix, isn't it?

> So the only thing the kernel should be careful about, is not to block
> on an allocation if not strictly necessary.
> 
> Actually a trivial fix for this problem could be to just tweak the
> thresholds, so to make the above scenario impossible.  Although I'm
> still not convinced, this patch is perfect, because the dirty
> threshold can actually change in time...
> 
> Index: linux/mm/page-writeback.c
> ===
> --- linux.orig/mm/page-writeback.c  2007-10-05 00:31:01.0 +0200
> +++ linux/mm/page-writeback.c   2007-10-05 00:50:11.0 +0200
> @@ -515,6 +515,12 @@ void throttle_vm_writeout(gfp_t gfp_mask
>  for ( ; ; ) {
> get_dirty_limits(_thresh, _thresh, NULL, 
> NULL);
> 
> +   /*
> +* Make sure the theshold is over the hard limit of
> +* dirty_thresh + ratelimit_pages * nr_cpus
> +*/
> +   dirty_thresh += ratelimit_pages * num_online_cpus();
> +
>  /*
>   * Boost the allowable dirty threshold a bit for page
>   * allocators so they don't get DoS'ed by heavy writers

I can probably kind of guess what you're trying to do here.  But if
ratelimit_pages * num_online_cpus() exceeds the size of the offending zone
then things might go bad.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-04 Thread Derek Fawcus
On Thu, Oct 04, 2007 at 07:18:47PM -0400, Chuck Ebbert wrote:
> > I ran firefox setuid to a different (not my main user),  uid+gid,  gave
> > my main account that gid as a supplemental group,  and gave that uid
> > access to the X magic cookie.
> 
> You need to use runxas to get any kind of real security.

Interesting script - sad how everyone reinvents equivalent things.

I had been experimenting with running the whole lot under Xnest,
with two extra users - one for the Xnest which had the main X
cookie, and another for the browser.  But found that it was just
too awkward (since I use multiple browser windows as well a tabs).

So I ended up trading a small security gain vs usablity.

The other thing I started playing with was the NX version of Xnest,
since it allows for a rootless server...

DF
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: SLUB performance regression vs SLAB

2007-10-04 Thread David Schwartz

David Miller wrote:

> Using an unpublishable benchmark, whose results even cannot be
> published, really stretches the limits of "reasonable" don't you
> think?
>
> This "SLUB isn't ready yet" bullshit is just a shamans dance which
> distracts attention away from the real problem, which is that a
> reproducable, publishable test case, is not being provided to the
> developer so he can work on fixing the problem.
>
> I can tell you this thing would be fixed overnight if a proper test
> case had been provided by now.

I would just like to echo what you said just a bit angrier. This is the same
as someone asking him to fix a bug that they can only see with a binary-only
kernel module. I think he's perfectly justified in simply responding "the
bug is as likely to be in your code as mine".

Now, just because he's justified in doing that doesn't mean he should. I
presume he has an honest desire to improve his own code and if they've found
a real problem, I'm sure he'd love to fix it.

But this is just a preposterous position to put him in. If there's no
reproduceable test case, then why should he care that one program he can't
even see works badly? If you care, you fix it.

Matthew Wilcox wrote:

> Yet here we stand.  Christoph is aggressively trying to get slab removed
> from the tree.  There is a testcase which shows slub performing worse
> than slab.  It's not my fault I can't publish it.  And just because I
> can't publish it doesn't mean it doesn't exist.

It means it may or may not exist. All we have is your word that slub is the
problem. If I said I found a bug in the Linux kernel that caused it to panic
but I could only reproduce it with the nVidia driver, I'd be laughed at.

It may even be that slub is better, your benchmark simply interprets this as
worse. Without the details of your benchmark, we can't know. For example,
I've seen benchmarks that (usually unintentionally) actually do a *variable*
amount of work and details of the implementation may result in the benchmark
actually doing *more* work, so it taking longer does not mean it ran slower.

DS


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/2] getattr - fill the size of pipes

2007-10-04 Thread Alan Cox
> Cute feature, but it is (I assume) a Linux-specific extension and is
> something which we'll need to maintain for ever and it invites

Actually it used to work on the old old Linux pipe code.

> unportability to older Linuxes and other OSes and it introduces some risk
> of breakage of existing applications.  And it slows down fstat on a pipe.

Most Sys5 based boxes happen to put the right value there but not
everyone and its not guaranteed in the slightest
> 
> Given that the info can already be obtained via ioctl(FIONREAD) anyway, I
> don't think that (gain > pain)?

Nor me - any application trying to reduce the syscall count would just do
a very large read and get the data and size in one go.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] Prepare pid_nr() etc functions to work with not-NULL pids

2007-10-04 Thread Matt Mackall
On Thu, Oct 04, 2007 at 12:54:17PM +0400, Pavel Emelyanov wrote:
> Matt Mackall wrote:
> > On Wed, Oct 03, 2007 at 06:20:43PM +0400, Pavel Emelyanov wrote:
> >> Just make the __pid_nr() etc functions that expect the argument
> >> to always be not NULL.
> >>
> >> Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>
> > 
> >>  static inline pid_t pid_nr(struct pid *pid)
> >>  {
> >>pid_t nr = 0;
> >>if (pid)
> >> -  nr = pid->numbers[0].nr;
> >> +  nr = __pid_nr(pid);
> >>return nr;
> >>  }
> > 
> > Is there a patch that removes these inlines? Otherwise this looks good
> > to me.
> 
> Not yet. Some of are uninlined already, but others are not. I'd like 
> to make some testing before uninline them.

I was asking about the whole function, actually, not the keyword. Is
this function not equivalent to __pid_nr now?

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_sil24 broken since 2.6.23-rc4-mm1

2007-10-04 Thread Matt Mackall
On Thu, Oct 04, 2007 at 07:32:52AM +0200, Torsten Kaiser wrote:
> On 10/3/07, Matt Mackall <[EMAIL PROTECTED]> wrote:
> > Well I can see no reason why the vma we just got to by the mm->mmap
> > would have a vm_mm != mm, but I've certainly been wrong before.
> >
> > Try changing it to:
> >
> > for (vma = mm->mmap; vma; vma = vma->vm_next)
> > if (!is_vm_hugetlb_page(vma)) {
> > if (vma->vm_mm != mm)
> > printk("WTF: vma->vm_mm %p mm %p\n",
> > vma->vm_mm, mm);
> > walk_page_range(vma->vm_mm, vma->vm_start, 
> > vma->vm_end,
> > _refs_walk, vma);
> > }
> 
> You were right.
> I was able to trigger the error with above printk added, but nothing
> was written to the syslog.
> 
> So now I'm rather out of ideas what to test... :(

I'd give your previous bisect step another try.

Looking back at the thread a bit, anything that requires the machine
to be off for more than a couple seconds to manifest stops looking
like software and firmware and starts looking like a heat-related
electrical or mechanical issue. Make sure your backups are current.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Miklos Szeredi
> This is a somewhat general problem: a userspace process is in the IO path. 
> Userspace block drivers, for example - pretty much anything which involves
> kernel->userspace upcalls for storage applications.
> 
> I solved it once in the past by marking the userspace process as
> PF_MEMALLOC and I beleive that others have implemented the same hack.
> 
> I suspect that what we need is a general solution, and that the solution
> will involve explicitly telling the kernel that this process is one which
> actually cleans memory and needs special treatment.
> 
> Because I bet there will be other corner-cases where such a process needs
> kernel help, and there might be optimisation opportunities as well.
> 
> Problem is, any such mark-me-as-special syscall would need to be
> privileged, and FUSE servers presently don't require special perms (do
> they?)

No, and that's a rather important feature, that I'd rather not give
up.  But with the dirty limiting, the memory cleaning really shouldn't
be a problem, as there is plenty of memory _not_ used for dirty file
data, that the filesystem can use during the writeback.

So the only thing the kernel should be careful about, is not to block
on an allocation if not strictly necessary.

Actually a trivial fix for this problem could be to just tweak the
thresholds, so to make the above scenario impossible.  Although I'm
still not convinced, this patch is perfect, because the dirty
threshold can actually change in time...

Index: linux/mm/page-writeback.c
===
--- linux.orig/mm/page-writeback.c  2007-10-05 00:31:01.0 +0200
+++ linux/mm/page-writeback.c   2007-10-05 00:50:11.0 +0200
@@ -515,6 +515,12 @@ void throttle_vm_writeout(gfp_t gfp_mask
 for ( ; ; ) {
get_dirty_limits(_thresh, _thresh, NULL, NULL);

+   /*
+* Make sure the theshold is over the hard limit of
+* dirty_thresh + ratelimit_pages * nr_cpus
+*/
+   dirty_thresh += ratelimit_pages * num_online_cpus();
+
 /*
  * Boost the allowable dirty threshold a bit for page
  * allocators so they don't get DoS'ed by heavy writers


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/2] getattr - fill the size of pipes

2007-10-04 Thread Andrew Morton
On Tue, 2 Oct 2007 19:54:53 +0200 (CEST)
Jan Engelhardt <[EMAIL PROTECTED]> wrote:

> [PATCH]: Fill the size of pipes
> 
> Instead of reporting 0 in size when stating() a pipe, we give the number of
> queued bytes. This might avoid using ioctl(FIONREAD) to get this information.
> 
> References and derived from: http://lkml.org/lkml/2007/4/2/138
> Cc: Eric Dumazet <[EMAIL PROTECTED]>
> Signed-off-by: Jan Engelhardt <[EMAIL PROTECTED]>


Cute feature, but it is (I assume) a Linux-specific extension and is
something which we'll need to maintain for ever and it invites
unportability to older Linuxes and other OSes and it introduces some risk
of breakage of existing applications.  And it slows down fstat on a pipe.

Given that the info can already be obtained via ioctl(FIONREAD) anyway, I
don't think that (gain > pain)?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] reiser4: do not allocate struct file on stack

2007-10-04 Thread Edward Shishkin

Edward Shishkin wrote:


Dave Hansen wrote:


...



I think that stack allocation is a pretty nasty trick for a structure
that's supposed to be pretty persistent and dynamically allocated, and
is certainly something that needs to get fixed up in a proper way.




agreed.


This works around the problem for now, but this could potentially cause
more bugs any time we add some member to 'struct file' and depend on its
state being sane anywhere in the VFS. If there's a list anywhere of
merge-stopper reiser4 bugs around, this should probably go in there.




will be fixed.



The promised fixup is attached.
Andrew, please apply.

Thanks,
Edward.
Do not allocate struct file on stack, pass the persistent one instead.

Signed-off-by: Edward Shishkin <[EMAIL PROTECTED]>
---
 linux-2.6.23-rc8-mm2/fs/reiser4/plugin/file/file.c|   35 --
 linux-2.6.23-rc8-mm2/fs/reiser4/plugin/file/file.h|2 
 linux-2.6.23-rc8-mm2/fs/reiser4/plugin/file/tail_conversion.c |   23 ++
 3 files changed, 26 insertions(+), 34 deletions(-)

--- linux-2.6.23-rc8-mm2/fs/reiser4/plugin/file/file.c.orig
+++ linux-2.6.23-rc8-mm2/fs/reiser4/plugin/file/file.c
@@ -566,23 +566,18 @@
  * items or add them to represent a hole at the end of file. The caller has to
  * obtain exclusive access to the file.
  */
-static int truncate_file_body(struct inode *inode, loff_t new_size)
+static int truncate_file_body(struct inode *inode, struct iattr *attr)
 {
 	int result;
+	loff_t new_size = attr->ia_size;
 
 	if (inode->i_size < new_size) {
 		/* expanding truncate */
-		struct dentry dentry;
-		struct file file;
-		struct unix_file_info *uf_info;
+		struct file * file = attr->ia_file;
+		struct unix_file_info *uf_info = unix_file_inode_data(inode);
+
+		assert("edward-1532", attr->ia_valid & ATTR_FILE);
 
-		dentry.d_inode = inode;
-		file.f_dentry = 
-		file.private_data = NULL;
-		file.f_pos = new_size;
-		file.private_data = NULL;
-		file.f_vfsmnt = NULL;
-		uf_info = unix_file_inode_data(inode);
 		result = find_file_state(inode, uf_info);
 		if (result)
 			return result;
@@ -615,19 +610,19 @@
 		return result;
 }
 			}
-			result = reiser4_write_extent(, NULL, 0,
+			result = reiser4_write_extent(file, NULL, 0,
 		  _size);
 			if (result)
 return result;
 			uf_info->container = UF_CONTAINER_EXTENTS;
 		} else {
 			if (uf_info->container ==  UF_CONTAINER_EXTENTS) {
-result = reiser4_write_extent(, NULL, 0,
+result = reiser4_write_extent(file, NULL, 0,
 			  _size);
 if (result)
 	return result;
 			} else {
-result = reiser4_write_tail(, NULL, 0,
+result = reiser4_write_tail(file, NULL, 0,
 			_size);
 if (result)
 	return result;
@@ -636,10 +631,10 @@
 		}
 		BUG_ON(result > 0);
 		INODE_SET_FIELD(inode, i_size, new_size);
-		file_update_time();
+		file_update_time(file);
 		result = reiser4_update_sd(inode);
 		BUG_ON(result != 0);
-		reiser4_free_file_fsdata();
+		reiser4_free_file_fsdata(file);
 	} else
 		result = shorten_file(inode, new_size);
 	return result;
@@ -2092,7 +2087,7 @@
 		 * first item is formatting item, therefore there was
 		 * incomplete extent2tail conversion. Complete it
 		 */
-		result = extent2tail(unix_file_inode_data(inode));
+		result = extent2tail(file, unix_file_inode_data(inode));
 	else
 		result = -EIO;
 
@@ -2372,7 +2367,7 @@
 		uf_info->container == UF_CONTAINER_EXTENTS &&
 		!should_have_notail(uf_info, inode->i_size) &&
 		!rofs_inode(inode)) {
-			result = extent2tail(uf_info);
+			result = extent2tail(file, uf_info);
 			if (result != 0) {
 warning("nikita-3233",
 	"Failed (%d) to convert in %s (%llu)",
@@ -2638,7 +2633,7 @@
 	if (result == 0)
 		result = safe_link_add(inode, SAFE_TRUNCATE);
 	if (result == 0)
-		result = truncate_file_body(inode, attr->ia_size);
+		result = truncate_file_body(inode, attr);
 	if (result)
 		warning("vs-1588", "truncate_file failed: oid %lli, "
 			"old size %lld, new size %lld, retval %d",
@@ -2724,7 +2719,7 @@
 	/* truncate file bogy first */
 	uf_info = unix_file_inode_data(inode);
 	get_exclusive_access(uf_info);
-	result = truncate_file_body(inode, 0 /* size */ );
+	result = shorten_file(inode, 0 /* size */ );
 	drop_exclusive_access(uf_info);
 
 	if (result)
--- linux-2.6.23-rc8-mm2/fs/reiser4/plugin/file/file.h.orig
+++ linux-2.6.23-rc8-mm2/fs/reiser4/plugin/file/file.h
@@ -237,7 +237,7 @@
 #define WRITE_GRANULARITY 32
 
 int tail2extent(struct unix_file_info *);
-int extent2tail(struct unix_file_info *);
+int extent2tail(struct file *, struct unix_file_info *);
 
 int goto_right_neighbor(coord_t *, lock_handle *);
 int find_or_create_extent(struct page *);
--- linux-2.6.23-rc8-mm2/fs/reiser4/plugin/file/tail_conversion.c.orig
+++ linux-2.6.23-rc8-mm2/fs/reiser4/plugin/file/tail_conversion.c
@@ -546,7 +546,7 @@
 
 /* for every page of file: read page, cut part of extent pointing to this page,
put data of page tree by tail item */
-int 

[patch] reiserfs: do not repair wrong journal params

2007-10-04 Thread Edward Shishkin

Jan Engelhardt wrote:


On Aug 23 2007 15:59, Martin Vogt wrote:
 


...


Even if knoppix should not be used as a rescue/live CD, then
the reiserfs module should not try to correct something,
this should be done by another tool.(fsck.reiserfs or a module option...)
   



The attached patch fixes this badness.

Thanks,
Edward.
When mounting a file system with wrong journal params 
do not try to repair them, suggest fsck instead.

Signed-off-by: Edward Shishkin <[EMAIL PROTECTED]>
---
 linux-2.6.23-rc8-mm2/fs/reiserfs/journal.c |  100 -
 1 files changed, 57 insertions(+), 43 deletions(-)

--- linux-2.6.23-rc8-mm2/fs/reiserfs/journal.c.orig
+++ linux-2.6.23-rc8-mm2/fs/reiserfs/journal.c
@@ -2649,6 +2649,61 @@
 	return result;
 }
 
+/**
+ * When creating/tuning a file system user can assign some
+ * journal params within boundaries which depend on the ratio
+ * blocksize/standard_blocksize.
+ *
+ * For blocks >= standard_blocksize transaction size should
+ * be not less then JOURNAL_TRANS_MIN_DEFAULT, and not more
+ * then JOURNAL_TRANS_MAX_DEFAULT.
+ *
+ * For blocks < standard_blocksize these boundaries should be
+ * decreased proportionally.
+ */
+#define REISERFS_STANDARD_BLKSIZE (4096)
+
+static int check_advise_trans_params(struct super_block *p_s_sb,
+ struct reiserfs_journal *journal)
+{
+if (journal->j_trans_max) {
+	/* Non-default journal params.
+		   Do sanity check for them. */
+	int ratio = 1;
+		if (p_s_sb->s_blocksize < REISERFS_STANDARD_BLKSIZE)
+		ratio = REISERFS_STANDARD_BLKSIZE / p_s_sb->s_blocksize;
+
+		if (journal->j_trans_max > JOURNAL_TRANS_MAX_DEFAULT / ratio ||
+		journal->j_trans_max < JOURNAL_TRANS_MIN_DEFAULT / ratio ||
+		SB_ONDISK_JOURNAL_SIZE(p_s_sb) / journal->j_trans_max <
+		JOURNAL_MIN_RATIO) {
+		reiserfs_warning(p_s_sb,
+ "sh-462: bad transaction max size (%u). FSCK?",
+ journal->j_trans_max);
+			return 1;
+		}
+		if (journal->j_max_batch != (journal->j_trans_max) *
+		JOURNAL_MAX_BATCH_DEFAULT/JOURNAL_TRANS_MAX_DEFAULT) {
+		reiserfs_warning(p_s_sb,
+"sh-463: bad transaction max batch (%u). FSCK?",
+journal->j_max_batch);
+			return 1;
+		}
+	} else {
+		/* Default journal params.
+   The file system was created by old version
+		   of mkreiserfs, so some fields contain zeros,
+		   and we need to advise proper values for them */
+	if (p_s_sb->s_blocksize != REISERFS_STANDARD_BLKSIZE)
+	reiserfs_panic(p_s_sb, "sh-464: bad blocksize (%u)",
+   p_s_sb->s_blocksize);
+		journal->j_trans_max = JOURNAL_TRANS_MAX_DEFAULT;
+		journal->j_max_batch = JOURNAL_MAX_BATCH_DEFAULT;
+		journal->j_max_commit_age = JOURNAL_MAX_COMMIT_AGE;
+	}
+	return 0;
+}
+
 /*
 ** must be called once on fs mount.  calls journal_read for you
 */
@@ -2744,49 +2799,8 @@
 	le32_to_cpu(jh->jh_journal.jp_journal_max_commit_age);
 	journal->j_max_trans_age = JOURNAL_MAX_TRANS_AGE;
 
-	if (journal->j_trans_max) {
-		/* make sure these parameters are available, assign it if they are not */
-		__u32 initial = journal->j_trans_max;
-		__u32 ratio = 1;
-
-		if (p_s_sb->s_blocksize < 4096)
-			ratio = 4096 / p_s_sb->s_blocksize;
-
-		if (SB_ONDISK_JOURNAL_SIZE(p_s_sb) / journal->j_trans_max <
-		JOURNAL_MIN_RATIO)
-			journal->j_trans_max =
-			SB_ONDISK_JOURNAL_SIZE(p_s_sb) / JOURNAL_MIN_RATIO;
-		if (journal->j_trans_max > JOURNAL_TRANS_MAX_DEFAULT / ratio)
-			journal->j_trans_max =
-			JOURNAL_TRANS_MAX_DEFAULT / ratio;
-		if (journal->j_trans_max < JOURNAL_TRANS_MIN_DEFAULT / ratio)
-			journal->j_trans_max =
-			JOURNAL_TRANS_MIN_DEFAULT / ratio;
-
-		if (journal->j_trans_max != initial)
-			reiserfs_warning(p_s_sb,
-	 "sh-461: journal_init: wrong transaction max size (%u). Changed to %u",
-	 initial, journal->j_trans_max);
-
-		journal->j_max_batch = journal->j_trans_max *
-		JOURNAL_MAX_BATCH_DEFAULT / JOURNAL_TRANS_MAX_DEFAULT;
-	}
-
-	if (!journal->j_trans_max) {
-		/*we have the file system was created by old version of mkreiserfs 
-		   so this field contains zero value */
-		journal->j_trans_max = JOURNAL_TRANS_MAX_DEFAULT;
-		journal->j_max_batch = JOURNAL_MAX_BATCH_DEFAULT;
-		journal->j_max_commit_age = JOURNAL_MAX_COMMIT_AGE;
-
-		/* for blocksize >= 4096 - max transaction size is 1024. For block size < 4096
-		   trans max size is decreased proportionally */
-		if (p_s_sb->s_blocksize < 4096) {
-			journal->j_trans_max /= (4096 / p_s_sb->s_blocksize);
-			journal->j_max_batch = (journal->j_trans_max) * 9 / 10;
-		}
-	}
-
+	if (check_advise_trans_params(p_s_sb, journal) != 0)
+	goto free_and_return;
 	journal->j_default_max_commit_age = journal->j_max_commit_age;
 
 	if (commit_max_age != 0) {


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-04 Thread Chuck Ebbert
On 10/04/2007 06:56 PM, Derek Fawcus wrote:
> 
> I ran firefox setuid to a different (not my main user),  uid+gid,  gave
> my main account that gid as a supplemental group,  and gave that uid
> access to the X magic cookie.

You need to use runxas to get any kind of real security.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[IRQ map] VIA C7 CN700 2.6.23-rc9-git USB IRQs disabled

2007-10-04 Thread Guennadi Liakhovetski
Booting git snapshot of about 6 hours ago, getting the following:

USB Universal Host Controller Interface driver v3.0
ACPI: PCI Interrupt Link [ALKB] enabled at IRQ 21
ACPI: PCI Interrupt :00:10.0[A] -> Link [ALKB] -> GSI 21 (level, low) -> 
IRQ 18
ACPI: PCI interrupt for device :00:10.0 disabled
uhci_hcd :00:10.0: init :00:10.0 fail, -16
uhci_hcd: probe of :00:10.0 failed with error -16
ACPI: PCI Interrupt :00:10.1[A] -> Link [ALKB] -> GSI 21 (level, low) -> 
IRQ 18
ACPI: PCI interrupt for device :00:10.1 disabled
uhci_hcd :00:10.1: init :00:10.1 fail, -16
uhci_hcd: probe of :00:10.1 failed with error -16
ACPI: PCI Interrupt :00:10.2[B] -> Link [ALKB] -> GSI 21 (level, low) -> 
IRQ 18
ACPI: PCI interrupt for device :00:10.2 disabled
uhci_hcd :00:10.2: init :00:10.2 fail, -16
uhci_hcd: probe of :00:10.2 failed with error -16
ACPI: PCI Interrupt :00:10.3[B] -> Link [ALKB] -> GSI 21 (level, low) -> 
IRQ 18
ACPI: PCI interrupt for device :00:10.3 disabled
uhci_hcd :00:10.3: init :00:10.3 fail, -16
uhci_hcd: probe of :00:10.3 failed with error -16
ACPI: PCI Interrupt :00:10.4[C] -> Link [ALKB] -> GSI 21 (level, low) -> 
IRQ 18
ACPI: PCI interrupt for device :00:10.4 disabled
ehci_hcd :00:10.4: init :00:10.4 fail, -16
ehci_hcd: probe of :00:10.4 failed with error -16

With "pci=routeirq" it is the same, but then it's "IRQ 17" instead of 18, 
and the line

ACPI: PCI Interrupt Link [ALKB] enabled at IRQ 21

is missing. Works with Debian etch default 2.6.18. /proc/interrupts under 
.23-rc9-...:

$ cat /proc/interrupts
   CPU0
  0:  31756   IO-APIC-edge  timer
  1:  2   IO-APIC-edge  i8042
  8:  1   IO-APIC-edge  rtc
  9:  0   IO-APIC-fasteoi   acpi
 12:  4   IO-APIC-edge  i8042
 16:   2627   IO-APIC-fasteoi   sata_via
 19:472   IO-APIC-fasteoi   eth0

Under 2.6.18:

ACPI: PCI Interrupt Link [ALKB] enabled at IRQ 21
ACPI: PCI Interrupt :00:10.0[A] -> Link [ALKB] -> GSI 21 (level, low) -> 
IRQ 177
PCI: VIA IRQ fixup for :00:10.0, from 10 to 1
uhci_hcd :00:10.0: UHCI Host Controller
uhci_hcd :00:10.0: new USB bus registered, assigned bus number 1
uhci_hcd :00:10.0: irq 177, io base 0xf900

Thanks
Guennadi
---
Guennadi Liakhovetski
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 09/12] fuse: add list of writable files to fuse_inode

2007-10-04 Thread Miklos Szeredi
> hm.  At no point in this patch series does anything actually get added to
> these lists, so this patch is presently a no-op.
> 
> I'll assume that it will get used later.  But it is a bit odd to add
> infrastructure in a patch series, then not use it.  Why not hold the patch
> back and include it in the patch series which actually uses these lists for
> something?

My stupidity.  I somehow thought the patch does actually do something
interesting when including it in this series, instead of holding it
back for the writable-mmap series.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 12/12] fuse: add blksize field to fuse_attr

2007-10-04 Thread Miklos Szeredi
> > From: Miklos Szeredi <[EMAIL PROTECTED]>
> > 
> > Allow the userspace filesystem to supply a blksize value to be
> > returned by stat() and friends.  If the field is zero, it defaults to
> > the old PAGE_CACHE_SIZE value.
> > 
> 
> Why does fuse need this feature?

There are cases, when the filesystem will be passed the buffer from a
single read or write call, namely:

 1) in 'direct-io' mode (not O_DIRECT), read/write requests don't go
through the page cache, but go directly to the userspace fs

 2) currently buffered writes are done with single page requests, but
if Nick's ->perform_write() patch goes it, it will be possible to
do larger write requests.  But only if the original write() was
also bigger than a page.


In these cases the filesystem might want to give a hint to the app
about the optimal I/O size.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] Trace code and documentation

2007-10-04 Thread David Wilder

Andi Kleen wrote:

On Thu, Oct 04, 2007 at 12:19:35PM -0700, David Wilder wrote:

Andi Kleen wrote:

"David J. Wilder" <[EMAIL PROTECTED]> writes:

@@ -0,0 +1,160 @@
+Trace Setup and Control
+===
+In the kernel, the trace interface provides a simple mechanism for
+starting and managing data channels (traces) to user space.

Wasn't relayfs supposed to do that already? Why do you need another
wrapper around it? 
The code in trace is exactly what all the current users of relay do. 
Therefor trace reduces the duplication of code.


If everybody does this then the code should be just put into
relayfs?


I disagree, I keeping the code separate (layering if you will) makes it 
easer to use and maintain.







Is this also really still faster than a printk below log level
(without console driver overhead). If not then why not just
use printk?
Are you arguing against relayfs or trace?  Trace just makes relayfs 
easer to use.  I think relayfs can stand up for it's self.


I'm arguing against complicated trace mechanisms that are not fast.


What makes trace complicated?  It is just, open ,start/stop, close.  I 
can't see how an trace API could be any simpler.




At some point when I looked at relayfs it seemed to be reasonably
fast (per cpu buffers; not much locking,

 over head per call roughtly like putchar()),
but that might have regressed. 


No regression has occurred.  According the relay documentation if you 
use global bufferers you must use locking.  If you don't want to use 
locking use per-cpu bufferers.




Your example module with its lock definitely looks very slow and I don't approve
of it.



If you don't approve of the locking then use per-cpu bufferers.  The 
example will do ether.




The example shows a way to create an ASCII data layer.


ASCII layers don't make much sense imho -- these should just use printk.



So the only way I should pass ASCII to user space is using printk?  I 
don't understand that.  Again nothing in trace limits you to ASCII data.



Fast dedicated binary log channels make sense though; but you don't
seem really to be very concentrated on that.


I impose no restriction on what type of data you can pass over trace's 
fast dedicated channels.




True, to make trace "fast" you need a data layer that can handle the 
requirements of per-cpu buffers.  However there are still advantages of 
trace over printk even when using global bufferers: selectable bufferer 
sizes,


printk has selectable buffer sizes too.


   "Long term we probably want more complex tracing based on lttng,
but I'm a big fan of starting out simple and doing incremental
changes."


It's just that relayfs + another not simple layer are definitely not simple.

For a simple logger I'm thinking more like something like SGI's old
ktrace module (which undoubtedly many other people have recreated many
times for specific debugging scenarios)

But that all only makes sense if the overhead is really kept low
and i don't see that in your approach.


Is your complaint with the overhead of setting up a trace channel or the 
overhead of writing to a trace channel?   For the later, trace adds 
almost no overhead on top of relay.





One advantage of the trace approach is separating control and data 
layers, therefor trace can support multiple data layers to fit multiple 
requirements.


I have my ideas on how to develop data layer, others may have their own 
ideas and I welcome the input.


relayfs was supposed to be that data layer.


I am using the layer definitions described in trace.txt.  In this 
definition relay is a buffering layer.




PS: Systemtap has been criticized for introducing out-of-tree kernel 
code.  A clear direction from the community is to move re-usable code 
in-tree where it can be maintained.  Trace is a move in that direction.


I'm all for that. I believe a simple fast efficient no frills logger
would serve systemtap just fine too. But the approach here seems
to be more to add all kinds of knobs and whizzles until you end
up with something as slow with printk. And since we already have
printk another one just doesn't seem to make much sense.


If by knobs you mean the trace controls.  The only one that has any 
effect on the "speed" of tracing is the control to start and stop 
tracing.  And that had been designed to impose the minimal impact 
possible (one "if" in the tracing path).




-Andi



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

2007-10-04 Thread Andrew Morton
On Thu, 4 Oct 2007 16:40:44 -0600
Andreas Dilger <[EMAIL PROTECTED]> wrote:

> On Oct 04, 2007  13:12 -0700, Andrew Morton wrote:
> > On Mon, 01 Oct 2007 17:35:46 -0700
> > > ext2: Avoid rec_len overflow with 64KB block size
> > > 
> > > into 16 bits we have for entry lenght. So we store 0x instead and
> > > convert value when read from / written to disk.
> > 
> > This patch clashes in non-trivial ways with
> > ext2-convert-to-new-aops-fix.patch and perhaps other things which are
> > already queued for 2.6.24 inclusion, so I'll need to ask for an updated
> > patch, please.
> 
> If the rel_len overflow patch isn't going to make it, then we also need
> to revert the EXT*_MAX_BLOCK_SIZE change to 65536.  It would be possible
> to allow this to be up to 32768 w/o the rec_len overflow fix however.
> 

Ok, thanks, I dropped ext3-support-large-blocksize-up-to-pagesize.patch and
ext2-support-large-blocksize-up-to-pagesize.patch.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Andrew Morton
On Fri, 05 Oct 2007 00:39:16 +0200
Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> > throttle_vm_writeout() should be a per-zone thing, I guess.  Perhaps fixing
> > that would fix your deadlock.  That's doubtful, but I don't know anything
> > about your deadlock so I cannot say.
> 
> No, doing the throttling per-zone won't in itself fix the deadlock.
> 
> Here's a deadlock example:
> 
> Total memory = 32M
> /proc/sys/vm/dirty_ratio = 10
> dirty_threshold = 3M
> ratelimit_pages = 1M
> 
> Some program dirties 4M (dirty_threshold + ratelimit_pages) of mmap on
> a fuse fs.  Page balancing is called which turns all these into
> writeback pages.
> 
> Then userspace filesystem gets a write request, and tries to allocate
> memory needed to complete the writeout.
> 
> That will possibly trigger direct reclaim, and throttle_vm_writeout()
> will be called.  That will block until nr_writeback goes below 3.3M
> (dirty_threshold + 10%).  But since all 4M of writeback is from the
> fuse fs, that will never happen.
> 
> Does that explain it better?
> 

yup, thanks.

This is a somewhat general problem: a userspace process is in the IO path. 
Userspace block drivers, for example - pretty much anything which involves
kernel->userspace upcalls for storage applications.

I solved it once in the past by marking the userspace process as
PF_MEMALLOC and I beleive that others have implemented the same hack.

I suspect that what we need is a general solution, and that the solution
will involve explicitly telling the kernel that this process is one which
actually cleans memory and needs special treatment.

Because I bet there will be other corner-cases where such a process needs
kernel help, and there might be optimisation opportunities as well.

Problem is, any such mark-me-as-special syscall would need to be
privileged, and FUSE servers presently don't require special perms (do
they?)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-04 Thread Derek Fawcus
On Wed, Oct 03, 2007 at 01:12:46AM +0100, Alan Cox wrote:
> 
> The value of SELinux (or indeed any system compartmentalising access and
> limiting damage) comes into play when you get breakage - eg via a web
> browser exploit.

well,  being sick of the number of times one has to upgrade the browser
for exploits,  I addressed it in a different way.

I ran firefox setuid to a different (not my main user),  uid+gid,  gave
my main account that gid as a supplemental group,  and gave that uid
access to the X magic cookie.

...  which only changes the nature of any exploit that might occur - any
injected code would have to go via X to attack my main account.

DF
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 12/12] fuse: add blksize field to fuse_attr

2007-10-04 Thread Andrew Morton
On Tue, 02 Oct 2007 17:50:38 +0200
Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> Allow the userspace filesystem to supply a blksize value to be
> returned by stat() and friends.  If the field is zero, it defaults to
> the old PAGE_CACHE_SIZE value.
> 

Why does fuse need this feature?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 02/12] fuse: fix race between getattr and write

2007-10-04 Thread Miklos Szeredi
> > @@ -228,6 +243,7 @@ static struct dentry *fuse_lookup(struct
> > struct fuse_conn *fc = get_fuse_conn(dir);
> > struct fuse_req *req;
> > struct fuse_req *forget_req;
> > +   u64 attr_version;
> >  
> > if (entry->d_name.len > FUSE_NAME_MAX)
> > return ERR_PTR(-ENAMETOOLONG);
> > @@ -242,6 +258,10 @@ static struct dentry *fuse_lookup(struct
> > return ERR_PTR(PTR_ERR(forget_req));
> > }
> >  
> > +   spin_lock(>lock);
> > +   attr_version = fc->attr_version;
> > +   spin_unlock(>lock);
> 
> You might want to do this (oft-repeated) operation in a little helper
> function.
> 
> Because I suspect that the lock isn't needed if CONFIG_64BIT=y.

You're perfectly right, although fuse is not yet at the stage, where
I'd bother too much with scalability optimizations like that ;)

But it's a good cleanup, and I'll do an incremental patch on top of
this if that's OK.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 09/12] fuse: add list of writable files to fuse_inode

2007-10-04 Thread Andrew Morton
On Tue, 02 Oct 2007 17:50:35 +0200
Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> Each WRITE request must carry a valid file descriptor.  When a page is
> written back from a memory mapping, the file through which the page
> was dirtied is not available, so a new mechananism is needed to find a
> suitable file in ->writepage(s).
> 
> A list of fuse_files is added to fuse_inode.  The file is removed from
> the list in fuse_release().
> 
> This patch is in preparation for writable mmap support.
> 
> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
> ---
> 
> Index: linux/fs/fuse/file.c
> ===
> --- linux.orig/fs/fuse/file.c 2007-10-01 22:42:26.0 +0200
> +++ linux/fs/fuse/file.c  2007-10-01 22:42:27.0 +0200
> @@ -56,6 +56,7 @@ struct fuse_file *fuse_file_alloc(void)
>   kfree(ff);
>   ff = NULL;
>   }
> + INIT_LIST_HEAD(>write_entry);
>   atomic_set(>count, 0);
>   }
>   return ff;
> @@ -150,12 +151,18 @@ int fuse_release_common(struct inode *in
>  {
>   struct fuse_file *ff = file->private_data;
>   if (ff) {
> + struct fuse_conn *fc = get_fuse_conn(inode);
> +
>   fuse_release_fill(ff, get_node_id(inode), file->f_flags,
> isdir ? FUSE_RELEASEDIR : FUSE_RELEASE);
>  
>   /* Hold vfsmount and dentry until release is finished */
>   ff->reserved_req->vfsmount = mntget(file->f_path.mnt);
>   ff->reserved_req->dentry = dget(file->f_path.dentry);
> +
> + spin_lock(>lock);
> + list_del(>write_entry);
> + spin_unlock(>lock);
>   /*
>* Normally this will send the RELEASE request,
>* however if some asynchronous READ or WRITE requests
> Index: linux/fs/fuse/fuse_i.h
> ===
> --- linux.orig/fs/fuse/fuse_i.h   2007-10-01 22:42:24.0 +0200
> +++ linux/fs/fuse/fuse_i.h2007-10-01 22:43:15.0 +0200
> @@ -70,6 +70,9 @@ struct fuse_inode {
>  
>   /** Version of last attribute change */
>   u64 attr_version;
> +
> + /** Files usable in writepage.  Protected by fc->lock */
> + struct list_head write_files;
>  };
>  
>  /** FUSE specific file data */
> @@ -82,6 +85,9 @@ struct fuse_file {
>  
>   /** Refcount */
>   atomic_t count;
> +
> + /** Entry on inode's write_files list */
> + struct list_head write_entry;
>  };
>  
>  /** One input argument of a request */
> Index: linux/fs/fuse/inode.c
> ===
> --- linux.orig/fs/fuse/inode.c2007-10-01 22:42:24.0 +0200
> +++ linux/fs/fuse/inode.c 2007-10-01 22:42:27.0 +0200
> @@ -56,6 +56,7 @@ static struct inode *fuse_alloc_inode(st
>   fi->i_time = 0;
>   fi->nodeid = 0;
>   fi->nlookup = 0;
> + INIT_LIST_HEAD(>write_files);
>   fi->forget_req = fuse_request_alloc();
>   if (!fi->forget_req) {
>   kmem_cache_free(fuse_inode_cachep, inode);
> @@ -68,6 +69,7 @@ static struct inode *fuse_alloc_inode(st
>  static void fuse_destroy_inode(struct inode *inode)
>  {
>   struct fuse_inode *fi = get_fuse_inode(inode);
> + BUG_ON(!list_empty(>write_files));
>   if (fi->forget_req)
>   fuse_request_free(fi->forget_req);
>   kmem_cache_free(fuse_inode_cachep, inode);

hm.  At no point in this patch series does anything actually get added to
these lists, so this patch is presently a no-op.

I'll assume that it will get used later.  But it is a bit odd to add
infrastructure in a patch series, then not use it.  Why not hold the patch
back and include it in the patch series which actually uses these lists for
something?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[OOPS] AXIS 700 Lite (VIA C7 CPU) BUG with 2.6.23-rc9-git (i2c)

2007-10-04 Thread Guennadi Liakhovetski
Hi

Got an AXIS 700 Lite thin client with a C7 CPU and CN700 chipset in it, 
compiled today's git snapshot, and it Oopses in i2c_viapro:

BUG: unable to handle kernel paging request at virtual address 016c0555
 printing eip:
c01a60ed
*pde = 
Oops:  [#1]
PREEMPT 
Modules linked in: i2c_viapro i2c_dev i2c_core loop
CPU:0
EIP:0060:[]Not tainted VLI
EFLAGS: 00010282   (2.6.23-rc9-g804b3f9a #5)
EIP is at sysfs_create_group+0x1d/0xe0
eax: f889b828   ebx:    ecx: f7d764b0   edx: 016c0555
esi: 016c0555   edi:    ebp: f7d71dc4   esp: f7d71da8
ds: 007b   es: 007b   fs:   gs: 0033  ss: 0068
Process modprobe (pid: 1214, ti=f7d7 task=c18f3560 task.ti=f7d7)
Stack: c03ceba0 f7d7647c f7d71dd8 c01a392b f7d764b0  f889f3a0 f7d71ddc 
   c0248b38 f889b828 f889b7c0   f7d71df4 c0248c0c  
   f889b7c0  f889b864 f7d71e20 c02491ee f889b828 c0367b56 f889b864 
Call Trace:
 [] show_trace_log_lvl+0x1c/0x40
 [] show_stack_log_lvl+0x9a/0xc0
 [] show_registers+0x1dc/0x340
 [] die+0x102/0x210
 [] do_page_fault+0x266/0x600
 [] error_code+0x6a/0x70
 [] device_add_groups+0x28/0x60
 [] device_add_attrs+0x5c/0xb0
 [] device_add+0xfe/0x330
 [] device_register+0x12/0x20
 [] i2c_register_adapter+0xbd/0x170 [i2c_core]
 [] i2c_add_adapter+0x7a/0x80 [i2c_core]
 [] vt596_probe+0x145/0x370 [i2c_viapro]
 [] pci_call_probe+0xd/0x10
 [] __pci_device_probe+0x4f/0x60
 [] pci_device_probe+0x29/0x50
 [] really_probe+0x94/0x140
 [] driver_probe_device+0x40/0x60
 [] __driver_attach+0x7a/0x80
 [] bus_for_each_dev+0x54/0x70
 [] driver_attach+0x19/0x20
 [] bus_add_driver+0x77/0x130
 [] driver_register+0x75/0x80
 [] __pci_register_driver+0x4a/0x80
 [] i2c_vt596_init+0x17/0x19 [i2c_viapro]
 [] sys_init_module+0xe2/0x140
 [] sysenter_past_esp+0x5f/0x85
 ===
Code: e8 8d b6 00 00 00 00 8d bc 27 00 00 00 00 55 89 e5 56 89 d6 53 83 ec 
14 85 c0 0f 84 b2 00 00 00 8b 48 30 85 c9 0f 84 a7 00 00 00 <8b> 12 85 d2 
0f 85 89 00 00 00 89 4d f4 8b 5d f4 85 db 74 0b 8b 
EIP: [] sysfs_create_group+0x1d/0xe0 SS:ESP 0068:f7d71da8

Thanks
Guennadi
---
Guennadi Liakhovetski
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JBD-DEBUG /proc/sys entry (again)

2007-10-04 Thread Jose R. Santos
On Thu, 04 Oct 2007 16:28:07 +0400
Rusev <[EMAIL PROTECTED]> wrote:

> All that should be moved to DEBUGFS under /sys/kernel/debug  and so on
> -  that's right, a bit other issue
> is of interest for me.
> 
> My suggestion is that a few other problems with PROCFS exist:
> 
> From my observation there are two major issues are involved:
> 
> 1.  /proc/sys entry has very specific readdir operation (vs. other
> entries such as /proc/drivers and others)
> entries to /proc/sys is likely is to be performed by means of other API.
> (quick search found thet API explanation for 2.4.xx
> http://www.opentech.at/papers/embedded_proc/node33.html,
>  yet it looks to be a little change in 2.6)
> 
> 2.  function xlate_proc_name behaves not the way it specified in it's
> header comment:
>   [1]  minor issue is that proc_match wolud likely return "equals"
> as result of comparison of "sys" and "sysvipc"
>   [2] more significant issue is that it can't properly walk long
> paths  such as /proc/sys/jbd/jbd-debug,
>but only paths likes /proc/sys/jbd-debug  (just one step
> down, path walking is broken).
> 
> This way we can't add not only /proc/sys/jbd/jbd-debug but any path
> likes /proc/aaa/bbb/xxx-debug at one step.
> The entry /proc/sys is still specific, because even if fixing
> xlate_proc_name we can't see /proc/sys/jbd/jbd-debug
> in userspace and successfully see /proc/aaa/bbb/jbd-debug.

This patch is wrong.  xlate_pro_name() is meant to check if a given
path is valid, creating new directory entries is something that need to
be handle by the code that's creating the entries.

Also note that xlate_pro_name() is also called by remove_proc_entry()
so if I call it with a bogus path, this patch will end up creating new
directory entries which is not the intended result.

> That's because /proc/sys specific operator readdir blocks such PROCFS
> entries that they are NOTproperly registersd
>  with CTL_TABLE.
> 
> Yet I think that we have a general problem with
> adding-long-paths-in-one-step, which is addressed by the following patch:

This should not be done is user one step and for good reason.  If you
blindly create multiple directory entries in /proc, how are you going to
keep track of all the created entries when its time to remove them
(module unloading for example)?

If you enter an invalid path the original code is doing the right thing
by returning -ENOENT.

> 
> 
> 
> diff -uprN linux-2.6.21.orig/fs/proc/generic.c
> linux-2.6.21/fs/proc/generic.c
> --- linux-2.6.21.orig/fs/proc/generic.c 2007-09-13 15:36:07.0 +0400
> +++ linux-2.6.21/fs/proc/generic.c  2007-10-03 22:12:57.0 +0400
> @@ -298,6 +298,7 @@ static int xlate_proc_name(const char *n
> int len;
> int rtn = 0;
> 
> +

White space damage.

> spin_lock(_subdir_lock);
> de = _root;
> while (1) {
> @@ -305,24 +306,52 @@ static int xlate_proc_name(const char *n
> if (!next)
> break;
> 
> -   len = next - cp;
> -   for (de = de->subdir; de ; de = de->next) {
> -   if (proc_match(len, cp, de))
> -   break;
> -   }
> -   if (!de) {
> -   rtn = -ENOENT;
> -   goto out;
> -   }
> -   cp += len + 1;
> -   }
> +++next;
> +
> +
> +len = next - cp;
> +
> +if(de->subdir == NULL){
> +  /* directory "de" is empty, add myself to it now */
> +  char* my_name = kzalloc( (len - 1)  + 1, GFP_KERNEL);

You did not check if kzalloc was successfully.  If the allocation
fails, bad things will happen here.  Need to check the return status of
my_name and return -ENOMEM if the allocation fails.  This would of
course mean an API change and you would need make sure that all the
callers of xlate_proc_name handle the new return code correctly.

> +  memcpy(my_name, cp, len - 1);
> +  proc_mkdir(my_name,de);
> +  kfree(my_name);
> +}
> +
> +
> +struct proc_dir_entry   *parent_de = de;
> +for (de = parent_de->subdir; de ; de = de->next) {
> +  if (proc_match(len - 1, cp, de))
> +break;
> +
> +}
> +
> +if(de == NULL){
> +  /* we found no appropriate subdirectory, well create
> it now */

1. Email client cut the line.  Disable line wrapping.
2. Line too long - Documentation/CodingStyle

> +  char* my_name = kzalloc( (len - 1)  + 1, GFP_KERNEL);

Again, check for kzalloc return status.

> +  memcpy(my_name, cp, len - 1);
> +  de = proc_mkdir(my_name,parent_de);
> +  kfree(my_name);
> +}
> +
> +
> +

White space damage.  

Re: [patch 02/12] fuse: fix race between getattr and write

2007-10-04 Thread Andrew Morton
On Tue, 02 Oct 2007 17:50:28 +0200
Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> @@ -228,6 +243,7 @@ static struct dentry *fuse_lookup(struct
>   struct fuse_conn *fc = get_fuse_conn(dir);
>   struct fuse_req *req;
>   struct fuse_req *forget_req;
> + u64 attr_version;
>  
>   if (entry->d_name.len > FUSE_NAME_MAX)
>   return ERR_PTR(-ENAMETOOLONG);
> @@ -242,6 +258,10 @@ static struct dentry *fuse_lookup(struct
>   return ERR_PTR(PTR_ERR(forget_req));
>   }
>  
> + spin_lock(>lock);
> + attr_version = fc->attr_version;
> + spin_unlock(>lock);

You might want to do this (oft-repeated) operation in a little helper
function.

Because I suspect that the lock isn't needed if CONFIG_64BIT=y.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

2007-10-04 Thread Andreas Dilger
On Oct 04, 2007  13:12 -0700, Andrew Morton wrote:
> On Mon, 01 Oct 2007 17:35:46 -0700
> > ext2: Avoid rec_len overflow with 64KB block size
> > 
> > into 16 bits we have for entry lenght. So we store 0x instead and
> > convert value when read from / written to disk.
> 
> This patch clashes in non-trivial ways with
> ext2-convert-to-new-aops-fix.patch and perhaps other things which are
> already queued for 2.6.24 inclusion, so I'll need to ask for an updated
> patch, please.

If the rel_len overflow patch isn't going to make it, then we also need
to revert the EXT*_MAX_BLOCK_SIZE change to 65536.  It would be possible
to allow this to be up to 32768 w/o the rec_len overflow fix however.

Yes, this does imply that those patches were in the wrong order in the
patch series, and I apologize for that, even if it isn't my fault.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Miklos Szeredi
> None of the above.
> 
> [PATCH] vm: pageout throttling
> 
> With silly pageout testcases it is possible to place huge amounts of 
> memory
> under I/O.  With a large request queue (CFQ uses 8192 requests) it is
> possible to place _all_ memory under I/O at the same time.
> 
> This means that all memory is pinned and unreclaimable and the VM gets
> upset and goes oom.
> 
> The patch limits the amount of memory which is under pageout writeout to 
> be
> a little more than the amount of memory at which balance_dirty_pages()
> callers will synchronously throttle.
> 
> This means that heavy pageout activity can starve heavy writeback activity
> completely, but heavy writeback activity will not cause starvation of
> pageout.  Because we don't want a simple `dd' to be causing excessive
> latencies in page reclaim.
> 
> afaict that problem is still there.  It is possible to get all of
> ZONE_NORMAL dirty on a highmem machine.  With a large queue (or lots of
> queues), vmscan can them place all of ZONE_NORMAL under IO.
> 
> It could be that we've fixed this problem via other means in the interrim,
> but from a quick peek to seems to me that the scanner will still do a 100%
> CPU burn when all of a zone's pages are under writeback.

Ah, OK.

I did read the changelog, but you added quite a bit of translation ;)

> throttle_vm_writeout() should be a per-zone thing, I guess.  Perhaps fixing
> that would fix your deadlock.  That's doubtful, but I don't know anything
> about your deadlock so I cannot say.

No, doing the throttling per-zone won't in itself fix the deadlock.

Here's a deadlock example:

Total memory = 32M
/proc/sys/vm/dirty_ratio = 10
dirty_threshold = 3M
ratelimit_pages = 1M

Some program dirties 4M (dirty_threshold + ratelimit_pages) of mmap on
a fuse fs.  Page balancing is called which turns all these into
writeback pages.

Then userspace filesystem gets a write request, and tries to allocate
memory needed to complete the writeout.

That will possibly trigger direct reclaim, and throttle_vm_writeout()
will be called.  That will block until nr_writeback goes below 3.3M
(dirty_threshold + 10%).  But since all 4M of writeback is from the
fuse fs, that will never happen.

Does that explain it better?

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/1] ia64: Convert cpu_sibling_map to a per_cpu data array FIX

2007-10-04 Thread travis

The previous version of this patch missed a code path in
inserting the boot cpu into the cpu sibling and core maps.

This fix corrects that omission.
--

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1] ia64: Convert cpu_sibling_map to a per_cpu data array FIX

2007-10-04 Thread travis
There are two versions of per_cpu_init() for ia64.  This patch corrects
the problem that one of the versions did not insert the boot cpu
into the cpu sibling and core maps.

Signed-off-by: Mike Travis <[EMAIL PROTECTED]>
---
 arch/ia64/kernel/setup.c |8 
 arch/ia64/mm/contig.c|6 --
 2 files changed, 8 insertions(+), 6 deletions(-)

--- linux.orig/arch/ia64/kernel/setup.c 2007-10-04 14:38:53.0 -0700
+++ linux/arch/ia64/kernel/setup.c  2007-10-04 14:51:46.289055433 -0700
@@ -873,6 +873,14 @@ cpu_init (void)
void *cpu_data;
 
cpu_data = per_cpu_init();
+   /*
+* insert boot cpu into sibling and core mapes
+* (must be done after per_cpu area is setup)
+*/
+   if (smp_processor_id() == 0) {
+   cpu_set(0, per_cpu(cpu_sibling_map, 0));
+   cpu_set(0, cpu_core_map[0]);
+   }
 
/*
 * We set ar.k3 so that assembly code in MCA handler can compute
--- linux.orig/arch/ia64/mm/contig.c2007-10-04 14:38:53.0 -0700
+++ linux/arch/ia64/mm/contig.c 2007-10-04 14:50:12.699513748 -0700
@@ -212,12 +212,6 @@ per_cpu_init (void)
cpu_data += PERCPU_PAGE_SIZE;
per_cpu(local_per_cpu_offset, cpu) = 
__per_cpu_offset[cpu];
}
-   /*
-* cpu_sibling_map is now a per_cpu variable - it needs to
-* be accessed after per_cpu_init() sets up the per_cpu area.
-*/
-   cpu_set(0, per_cpu(cpu_sibling_map, 0));
-   cpu_set(0, cpu_core_map[0]);
}
return __per_cpu_start + __per_cpu_offset[smp_processor_id()];
 }

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for testing] Re: Decreasing stime running confuses top

2007-10-04 Thread Christian Borntraeger
Am Freitag, 5. Oktober 2007 schrieb Chuck Ebbert:
> On 10/04/2007 05:10 PM, Christian Borntraeger wrote:
> 
> > 
> 
> Alternative patch:
> 
> procfs: Don't read runtime twice when computing task's stime
> 
> Current code reads p->se.sum_exec_runtime twice and goes through
> multiple type conversions to calculate stime. Read it once and
> skip some of the conversions.
> 
> Signed-off-by: Chuck Ebbert <[EMAIL PROTECTED]>

Looks better and makes the code nicer. s390 and power should work as well as 
CONFIG_VIRT_CPU_ACCOUNTING is unaffected. 

If Frans successfully tests this patch, feel free to add

Acked-by: Christian Borntraeger <[EMAIL PROTECTED]>

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] Linux 2.6.23-rc9 and MAX_ARG_PAGES

2007-10-04 Thread Paul Mackerras
Linus Torvalds writes:

> Well, since others definitely don't see this, including me, and I can do 
> things like 62MB exec arrays:
> 
>   [EMAIL PROTECTED] linux]$ echo $(find /home/torvalds/) | wc
> 1  883304 63000962

That wouldn't actually do an exec, assuming you're using bash, since
echo is a shell builtin in bash.  You'd need to do /bin/echo.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB performance regression vs SLAB

2007-10-04 Thread David Chinner
On Thu, Oct 04, 2007 at 03:07:18PM -0700, David Miller wrote:
> From: Chuck Ebbert <[EMAIL PROTECTED]> Date: Thu, 04 Oct 2007 17:47:48
> -0400
> 
> > On 10/04/2007 05:11 PM, David Miller wrote:
> > > From: Chuck Ebbert <[EMAIL PROTECTED]> Date: Thu, 04 Oct 2007 17:02:17
> > > -0400
> > > 
> > >> How do you simulate reading 100TB of data spread across 3000 disks,
> > >> selecting 10% of it using some criterion, then sorting and summarizing
> > >> the result?
> > > 
> > > You repeatedly read zeros from a smaller disk into the same amount of
> > > memory, and sort that as if it were real data instead.
> > 
> > You've just replaced 3000 concurrent streams of data with a single stream.
> > That won't test the memory allocator's ability to allocate memory to many
> > concurrent users very well.
> 
> You've kindly removed my "thinking outside of the box" comment.
> 
> The point is was not that my specific suggestion would be perfect, but that
> if you used your creativity and thought in similar directions you might find
> a way to do it.
> 
> People are too narrow minded when it comes to these things, and that's the
> problem I want to address.

And it's a good point, too, because often problems to one person are a
no-brainer to someone else.

Creating lots of "fake" disks is trivial to do, IMO.  Use loopback on sparse
files containing sparse filesxi, use ramdisks containing sparse files or write a
sparse dm target for sparse block device mapping, etc. I'm sure there's more 
than the
few I just threw out...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.23-rc9-rt2

2007-10-04 Thread Steven Rostedt
We are pleased to announce the 2.6.23-rc9-rt2 tree, which can be
downloaded from the new location:

 http://www.kernel.org/pub/linux/kernel/projects/rt/

Changes since 2.6.23-rc9-rt1

  - x86_64 disable IST for debug (Andi Kleen)

  - Better handling of dynticks going bad in RCU (Steven Rostedt)

  - Preempt RCU boosting (Steven Rostedt based on Paul E. McKenney's 
stuff)

Again, this still holds experimental code. But I've been running it on a
few boxes already (and even the box I'm writing this on). 

to build a 2.6.23-rc9-rt2 tree, the following patches should be applied:

  http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.22.tar.bz2
  http://www.kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.23-rc9.bz2
  http://www.kernel.org/pub/linux/kernel/projects/rt/patch-2.6.23-rc9-rt2.bz2

The broken out patches are also available.

-- Steve






-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB performance regression vs SLAB

2007-10-04 Thread David Miller
From: Chuck Ebbert <[EMAIL PROTECTED]>
Date: Thu, 04 Oct 2007 17:47:48 -0400

> On 10/04/2007 05:11 PM, David Miller wrote:
> > From: Chuck Ebbert <[EMAIL PROTECTED]>
> > Date: Thu, 04 Oct 2007 17:02:17 -0400
> > 
> >> How do you simulate reading 100TB of data spread across 3000 disks,
> >> selecting 10% of it using some criterion, then sorting and
> >> summarizing the result?
> > 
> > You repeatedly read zeros from a smaller disk into the same amount of
> > memory, and sort that as if it were real data instead.
> 
> You've just replaced 3000 concurrent streams of data with a single
> stream.  That won't test the memory allocator's ability to allocate
> memory to many concurrent users very well.

You've kindly removed my "thinking outside of the box" comment.

The point is was not that my specific suggestion would be
perfect, but that if you used your creativity and thought
in similar directions you might find a way to do it.

People are too narrow minded when it comes to these things, and
that's the problem I want to address.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for testing] Re: Decreasing stime running confuses top

2007-10-04 Thread Chuck Ebbert
On 10/04/2007 05:10 PM, Christian Borntraeger wrote:

> 

Alternative patch:

procfs: Don't read runtime twice when computing task's stime

Current code reads p->se.sum_exec_runtime twice and goes through
multiple type conversions to calculate stime. Read it once and
skip some of the conversions.

Signed-off-by: Chuck Ebbert <[EMAIL PROTECTED]>

--- linux-2.6.23-rc6-dell.orig/fs/proc/array.c
+++ linux-2.6.23-rc6-dell/fs/proc/array.c
@@ -334,39 +334,38 @@ static cputime_t task_stime(struct task_
return p->stime;
 }
 #else
-static cputime_t task_utime(struct task_struct *p)
+static clock_t __task_utime(struct task_struct *p, u64 runtime)
 {
clock_t utime = cputime_to_clock_t(p->utime),
total = utime + cputime_to_clock_t(p->stime);
-   u64 temp;
 
/*
 * Use CFS's precise accounting:
 */
-   temp = (u64)nsec_to_clock_t(p->se.sum_exec_runtime);
-
if (total) {
-   temp *= utime;
-   do_div(temp, total);
+   runtime *= utime;
+   do_div(runtime, total);
}
-   utime = (clock_t)temp;
+   return (clock_t)runtime;
+}
 
-   return clock_t_to_cputime(utime);
+static cputime_t task_utime(struct task_struct *p)
+{
+   u64 runtime = (u64)nsec_to_clock_t(p->se.sum_exec_runtime);
+
+   return clock_t_to_cputime(__task_utime(p, runtime));
 }
 
 static cputime_t task_stime(struct task_struct *p)
 {
-   clock_t stime;
+   u64 runtime = (u64)nsec_to_clock_t(p->se.sum_exec_runtime);
 
/*
 * Use CFS's precise accounting. (we subtract utime from
 * the total, to make sure the total observed by userspace
 * grows monotonically - apps rely on that):
 */
-   stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
-   cputime_to_clock_t(task_utime(p));
-
-   return clock_t_to_cputime(stime);
+   return clock_t_to_cputime(runtime - __task_utime(p, runtime));
 }
 #endif
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] Linux 2.6.23-rc9 and MAX_ARG_PAGES

2007-10-04 Thread Mathieu Chouquet-Stringer
On Thu, Oct 04, 2007 at 07:17:50PM +0200, Peter Zijlstra wrote:
> what happens if you up the stack limit to say 128M ?
> 
> Also, do you happen to have execve syscall audit stuff enabled?

Actually, you were right, not only it's enabled but it's also the
culprit.  If I stop it, all is well...

Sorry for the noise.

-- 
Mathieu Chouquet-Stringer   [EMAIL PROTECTED]
The sun itself sees not till heaven clears.
 -- William Shakespeare --
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Andrew Morton
On Thu, 04 Oct 2007 14:25:22 +0200
Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> By relying on the global diry limits, this can cause a deadlock when
> devices are stacked.
> 
> If the stacking is done through a fuse filesystem, the __GFP_FS,
> __GFP_IO tests won't help: the process doing the allocation doesn't
> have any special flag.

This description of the bug-which-is-being-fixed is nowhere near adequate
enough for a reviewer to understand the problem.  This makes it hard to
suggest alternative fixes.

> So why exactly does this function exist?

That's described in the changelog for the patch which added
throttle_vm_writeout().  Unsurprisingly ;)

> Direct reclaim does not _increase_ the number of dirty pages in the
> system, so rate limiting it seems somewhat pointless.
> 
> There are two cases:
> 
> 1) File backed pages -> file
> 
>   dirty + writeback count remains constant
> 
> 2) Anonymous pages -> swap
> 
>   writeback count increases, dirty balancing will hold back file
>   writeback in favor of swap
> 
> So the real question is: does case 2 need rate limiting, or is it OK
> to let the device queue fill with swap pages as fast as possible?

None of the above.

[PATCH] vm: pageout throttling

With silly pageout testcases it is possible to place huge amounts of memory
under I/O.  With a large request queue (CFQ uses 8192 requests) it is
possible to place _all_ memory under I/O at the same time.

This means that all memory is pinned and unreclaimable and the VM gets
upset and goes oom.

The patch limits the amount of memory which is under pageout writeout to be
a little more than the amount of memory at which balance_dirty_pages()
callers will synchronously throttle.

This means that heavy pageout activity can starve heavy writeback activity
completely, but heavy writeback activity will not cause starvation of
pageout.  Because we don't want a simple `dd' to be causing excessive
latencies in page reclaim.

afaict that problem is still there.  It is possible to get all of
ZONE_NORMAL dirty on a highmem machine.  With a large queue (or lots of
queues), vmscan can them place all of ZONE_NORMAL under IO.

It could be that we've fixed this problem via other means in the interrim,
but from a quick peek to seems to me that the scanner will still do a 100%
CPU burn when all of a zone's pages are under writeback.

throttle_vm_writeout() should be a per-zone thing, I guess.  Perhaps fixing
that would fix your deadlock.  That's doubtful, but I don't know anything
about your deadlock so I cannot say.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.23-rc9-git2: Known regressions from 2.6.22

2007-10-04 Thread Rafael J. Wysocki
Hi,

This message contains a list of some known regressions from 2.6.22 for which
there are no fixes in the mainline that I know of.  If any of them have been 
fixed
already, please let me know.

If you know of any other unresolved regressions from 2.6.22, please let me know
either and I'll add them to the list.


Subject:zd1211 device is no longer configured
Submitter:  Oliver Neukum <[EMAIL PROTECTED]>
References: http://marc.info/?l=linux-usb-devel=118854967709322=2
http://bugzilla.kernel.org/show_bug.cgi?id=8972
Caused-By:  Daniel Drake <[EMAIL PROTECTED]>
commit 74553aedd46b3a2cae986f909cf2a3f99369decc


Subject:Oops while modprobing phy fixed module
Submitter:  Gabriel C <[EMAIL PROTECTED]>
References: http://lkml.org/lkml/2007/7/14/63
http://bugzilla.kernel.org/show_bug.cgi?id=9060
Handled-By: Satyam Sharma <[EMAIL PROTECTED]>
Vitaly Bordug <[EMAIL PROTECTED]>
Tejun Heo <[EMAIL PROTECTED]>
Patch:  http://lkml.org/lkml/2007/7/18/506


Subject:ACPI problems: 2.6.22-git17 working, 2.6.23-rc1* is not
Submitter:  Danny ter Haar <[EMAIL PROTECTED]>
References: http://lkml.org/lkml/2007/7/27/298
http://lkml.org/lkml/2007/7/29/371
http://bugzilla.kernel.org/show_bug.cgi?id=9061
Handled-By: Len Brown <[EMAIL PROTECTED]>


Subject:empty suspend stopped working around 2.6.23-rc4
Submitter:  Pavel Machek <[EMAIL PROTECTED]>
References: http://lkml.org/lkml/2007/9/11/326
http://bugzilla.kernel.org/show_bug.cgi?id=9075


Subject:umount triggers a warning in jfs and takes almost a minute
Submitter:  Oliver Neukum <[EMAIL PROTECTED]>
References: http://lkml.org/lkml/2007/9/4/73
http://bugzilla.kernel.org/show_bug.cgi?id=9076
Handled-By: Dave Kleikamp <[EMAIL PROTECTED]>
Patch:  http://bugzilla.kernel.org/attachment.cgi?id=13023=view


Subject:build #301 failed for 2.6.23-rc6-g0d4cbb5 in 
linux/drivers/net/wireless/libertas/
Submitter:  Toralf Förster <[EMAIL PROTECTED]>
References: http://lkml.org/lkml/2007/9/11/150
http://bugzilla.kernel.org/show_bug.cgi?id=9077
Handled-By: Randy Dunlap <[EMAIL PROTECTED]>
Patch:  http://bugzilla.kernel.org/attachment.cgi?id=12963=view


Subject:NETDEV WATCHDOG: eth0: transmit timed out
Submitter:  Karl Meyer <[EMAIL PROTECTED]>
References: http://lkml.org/lkml/2007/8/13/737
http://bugzilla.kernel.org/show_bug.cgi?id=9079
Handled-By: Francois Romieu <[EMAIL PROTECTED]>


Subject:Weird network problems with 2.6.23-rc2
Submitter:  Shish <[EMAIL PROTECTED]>
References: http://lkml.org/lkml/2007/8/11/40
http://bugzilla.kernel.org/show_bug.cgi?id=9080


Subject:powersaving degradation, (time spend in C0 goes up after a 
while)
Submitter:  Christian Leber <[EMAIL PROTECTED]>
References: http://lkml.org/lkml/2007/9/2/142
http://lkml.org/lkml/2007/9/2/207
http://bugzilla.kernel.org/show_bug.cgi?id=9081


Subject:vga text console not working on 2.6.23-rc8
Submitter:  Santiago Garcia Mantinan <[EMAIL PROTECTED]>
References: http://lkml.org/lkml/2007/9/28/342
http://bugzilla.kernel.org/show_bug.cgi?id=9099
Handled-By: H. Peter Anvin <[EMAIL PROTECTED]>
Antonino A. Daplas <[EMAIL PROTECTED]>


Subject:kernel oops when unplugging usb mouse, sometimes hardlock when 
moving mouse
Submitter:  o. meijer <[EMAIL PROTECTED]>
References: http://bugzilla.kernel.org/show_bug.cgi?id=9111
Handled-By: Dmitry Torokhov <[EMAIL PROTECTED]>


Subject:2.6.23-rc9 boot failure (megaraid?)
Submitter:  Burton Windle <[EMAIL PROTECTED]>
References: http://lkml.org/lkml/2007/10/2/243
http://bugzilla.kernel.org/show_bug.cgi?id=9113
Handled-By: Adrian Bunk <[EMAIL PROTECTED]>
FUJITA Tomonori <[EMAIL PROTECTED]>
Caused-By:  FUJITA Tomonori <[EMAIL PROTECTED]>
commit 3f6270ef76f2ce5c134615a470685d6c2a66c07e
[SCSI] megaraid_old: convert to use the data buffer accessors
Patch:  http://lkml.org/lkml/2007/10/4/294


Subject:kernel BUG at arch/i386/mm/highmem.c:15!  on 2.6.23-rc8/rc9
Submitter:  gurudas pai <[EMAIL PROTECTED]>
References: http://lkml.org/lkml/2007/10/4/61
http://bugzilla.kernel.org/show_bug.cgi?id=9122
Handled-By: Nick Piggin <[EMAIL PROTECTED]>
Hugh Dickins <[EMAIL PROTECTED]>
Patch:  http://lkml.org/lkml/2007/10/4/256


Subject:2.6.23-rcX SG_GET_SCSI_ID regression?
Submitter:  Joerg Platte <[EMAIL PROTECTED]>
References: http://lkml.org/lkml/2007/10/3/101
http://bugzilla.kernel.org/show_bug.cgi?id=9123


For details, please follow the links 

Re: [BUG] Linux 2.6.23-rc9 and MAX_ARG_PAGES

2007-10-04 Thread Mathieu Chouquet-Stringer
On Thu, Oct 04, 2007 at 05:50:00PM -0400, Chuck Ebbert wrote:
> On 10/04/2007 01:05 PM, Mathieu Chouquet-Stringer wrote:
> > In the kernel source tree, if I run a stupid find | xargs ls, I now get
> > this:
> > xargs: ls: Argument list too long
> > 
> 
> Can you strace it to see what syscall is failing?

Sure:
25789 <... execve resumed> )= -1 E2BIG (Argument list too long)

I'm going to reboot to a kernel that has Linus' printks...

-- 
Mathieu Chouquet-Stringer   [EMAIL PROTECTED]
The sun itself sees not till heaven clears.
 -- William Shakespeare --
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


pm qos infrastructure and interface

2007-10-04 Thread Mark Gross
The following patch is a generalization of the latency.c implementation
done by Arjan last year.  It provides infrastructure for more than one
parameter, and exposes a user mode interface for processes to register
pm_qos expectations of processes.


This interface provides a kernel and user mode interface for registering
performance expectations by drivers, subsystems and user space
applications on one of the parameters.

Currently we have {cpu_dma_latency, network_latency, network_throughput}
as the initial set of pm_qos parameters.

The infrastructure exposes multiple misc device nodes one per
implemented parameter.  The set of parameters implement is defined by
pm_qos_power_init() and pm_qos_params.h.  This is done because having
the available parameters being runtime configurable or changeable from a
driver was seen as too easy to abuse.

For each parameter a list of performance requirements is maintained
along with an aggregated target value.  The aggregated target value is
updated with changes to the requirement list or elements of the list.
Typically the aggregated target value is simply the max or min of the
requirement values held in the parameter list elements.

>From kernel mode the use of this interface is simple:
pm_qos_add_requirement(param_id, name, target_value):
Will insert a named element in the list for that identified PM_QOS
parameter with the target value.  Upon change to this list the new
target is recomputed and any registered notifiers are called only if the
target value is now different.

pm_qos_update_requirement(param_id, name, new_target_value):
Will search the list identified by the param_id for the named list
element and then update its target value, calling the notification tree
if the aggregated target is changed.  with that name is already
registered.

pm_qos_remove_requirement(param_id, name):
Will search the identified list for the named element and remove it,
after removal it will update the aggregate target and call the
notification tree if the target was changed as a result of removing the
named requirement.


>From user mode:
Only processes can register a pm_qos requirement.  To provide for
automatic cleanup for process the interface requires the process to
register its parameter requirements in the following way:

To register the default pm_qos target for the specific parameter, the
process must open one of /dev/[cpu_dma_latency, network_latency,
network_throughput]

As long as the device node is held open that process has a registered
requirement on the parameter.  The name of the requirement is
"process_" derived from the current->pid from within the open
system call.

To change the requested target value the process needs to write a s32
value to the open device node.  This translates to a
pm_qos_update_requirement call.

To remove the user mode request for a target value simply close the
device node.

--mgross


Signed-off-by: mark gross <[EMAIL PROTECTED]>

---

diff -urN -X linux-2.6.23-rc8/Documentation/dontdiff 
linux-2.6.23-rc8/Documentation/pm_qos_interface.txt 
linux-2.6.23-rc8-qos/Documentation/pm_qos_interface.txt
--- linux-2.6.23-rc8/Documentation/pm_qos_interface.txt 1969-12-31 
16:00:00.0 -0800
+++ linux-2.6.23-rc8-qos/Documentation/pm_qos_interface.txt 2007-10-04 
14:26:58.0 -0700
@@ -0,0 +1,59 @@
+PM quality of Service interface.
+
+This interface provides a kernel and user mode interface for registering
+performance expectations by drivers, subsystems and user space applications on
+one of the parameters.
+
+Currently we have {cpu_dma_latency, network_latency, network_throughput} as the
+initial set of pm_qos parameters.
+
+The infrastructure exposes multiple misc device nodes one per implemented
+parameter.  The set of parameters implement is defined by pm_qos_power_init()
+and pm_qos_params.h.  This is done because having the available parameters
+being runtime configurable or changeable from a driver was seen as too easy to
+abuse.
+
+For each parameter a list of performance requirements is maintained along with
+an aggregated target value.  The aggregated target value is updated with
+changes to the requirement list or elements of the list.  Typically the
+aggregated target value is simply the max or min of the requirement values held
+in the parameter list elements.
+
+From kernel mode the use of this interface is simple:
+pm_qos_add_requirement(param_id, name, target_value):
+Will insert a named element in the list for that identified PM_QOS parameter
+with the target value.  Upon change to this list the new target is recomputed
+and any registered notifiers are called only if the target value is now
+different.
+
+pm_qos_update_requirement(param_id, name, new_target_value):
+Will search the list identified by the param_id for the named list element and
+then update its target value, calling the notification tree if the aggregated
+target is changed.  with that name is already registered.
+

Re: [BUG] Linux 2.6.23-rc9 and MAX_ARG_PAGES

2007-10-04 Thread Chuck Ebbert
On 10/04/2007 01:05 PM, Mathieu Chouquet-Stringer wrote:
> In the kernel source tree, if I run a stupid find | xargs ls, I now get
> this:
> xargs: ls: Argument list too long
> 

Can you strace it to see what syscall is failing?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB performance regression vs SLAB

2007-10-04 Thread Chuck Ebbert
On 10/04/2007 05:11 PM, David Miller wrote:
> From: Chuck Ebbert <[EMAIL PROTECTED]>
> Date: Thu, 04 Oct 2007 17:02:17 -0400
> 
>> How do you simulate reading 100TB of data spread across 3000 disks,
>> selecting 10% of it using some criterion, then sorting and
>> summarizing the result?
> 
> You repeatedly read zeros from a smaller disk into the same amount of
> memory, and sort that as if it were real data instead.

You've just replaced 3000 concurrent streams of data with a single
stream.  That won't test the memory allocator's ability to allocate
memory to many concurrent users very well.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] Linux 2.6.23-rc9 and MAX_ARG_PAGES

2007-10-04 Thread Mathieu Chouquet-Stringer
On Thu, Oct 04, 2007 at 07:17:50PM +0200, Peter Zijlstra wrote:
> /me tries
> 
> yep works like a charm, and that is a tree with a full git repo and
> several build dirs in it.

Well, what can I say? ;-)

> what happens if you up the stack limit to say 128M ?

It's unlimited.

> Also, do you happen to have execve syscall audit stuff enabled?

Nope.

-- 
Mathieu Chouquet-Stringer   [EMAIL PROTECTED]
The sun itself sees not till heaven clears.
 -- William Shakespeare --
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc9: Oops in cache_alloc_refill() mm/slab.c

2007-10-04 Thread Badari Pulavarty
On Thu, 2007-10-04 at 18:13 +0200, Valerie Clement wrote:
> While running ffsb tests on my ext4 filesystem, I got an Oops in 
> cache_alloc_refill().
> I turned on SLAB debugging and here is the message I got:
> 
> slab: Internal list corruption detected in cache 'buffer_head'(30), 
> slabp 81007e100100(1515870810). Hexdump:

slabp->inuse = 1515870810 looks bogus. Is this easily reproducible ?
What tests are you running through ffsb ?

> 000: 5a 5a 5a 5a 5a 5a 5a 5a b8 23 34 7e 00 81 ff ff
> 010: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 020: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 030: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 040: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 050: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 060: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a a5
> 070: c0 88 56 63 c5 56 41 d8 f1 37 4a 80 ff ff ff ff
> 080: c0 88 56 63 c5 56 41 d8 80 33 53 7d 00 81 ff ff
> 090: e8 25 60 7d 00 81 ff ff 68 cb 3b 01 00 81 ff ff
> 0a0: 18 68 50 7d 00 81 ff ff
> [ cut here ]
> kernel BUG at /home/clementv/src/linux-2.6.23-rc9/mm/slab.c:2923!
> invalid opcode:  [1] SMP
> CPU 2
> Modules linked in: qla2xxx
> Pid: 4041, comm: ffsb Not tainted 2.6.23-rc9 #2
> RIP: 0010:[]  [] check_slabp+0xb5/0xc1
> RSP: 0018:8100774bb958  EFLAGS: 00010096
> RAX: 0001 RBX: 81007e100100 RCX: 6d20
> RDX:  RSI: 0046 RDI: 81007e347280
> RBP: 00a8 R08: 0005 R09: 8060bb10
> R10: 000ae468 R11: 00050002 R12: 00a8
> R13: 81007e347280 R14: 81007e347280 R15: 0002
> FS:  41802950(0063) GS:81007e0c4728() knlGS:
> CS:  0010 DS:  ES:  CR0: 8005003b
> CR2: 5f83d00c CR3: 78149000 CR4: 06e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: 0ff0 DR7: 0400
> Process ffsb (pid: 4041, threadinfo 8100774ba000, task 81007dbdc7a0)
> Stack:  000d 000e 81007e100100 81007e342398
>   81007e078488 80277069 8050 81007e347280
>   8050 0246 80299539 f000
> Call Trace:
>   [] cache_alloc_refill+0xc8/0x23f
>   [] alloc_buffer_head+0x14/0x45
>   [] kmem_cache_alloc+0x94/0xe9
>   [] alloc_buffer_head+0x14/0x45
>   [] alloc_page_buffers+0x38/0xd5
>   [] create_empty_buffers+0x14/0x9b
>   [] __block_prepare_write+0x7c/0x45b
>   [] ext4_get_block+0x0/0x139
>   [] block_prepare_write+0x1a/0x25
>   [] ext4_prepare_write+0xaf/0x175
>   [] generic_file_buffered_write+0x288/0x631
>   [] __generic_file_aio_write_nolock+0x33f/0x3a9
>   [] enqueue_entity+0x17c/0x1a3
>   [] generic_file_aio_write+0x61/0xc1
>   [] __check_preempt_curr_fair+0x56/0x76
>   [] ext4_file_write+0x16/0x91
>   [] do_sync_write+0xc9/0x10c
>   [] file_move+0x1d/0x4c
>   [] autoremove_wake_function+0x0/0x2e
>   [] do_filp_open+0x2a/0x38
>   [] poison_obj+0x26/0x30
>   [] vfs_write+0xad/0x136
>   [] sys_write+0x45/0x6e
>   [] system_call+0x7e/0x83
> 
> 
> Valérie


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] Linux 2.6.23-rc9 and MAX_ARG_PAGES

2007-10-04 Thread Mathieu Chouquet-Stringer
Thank you for getting back to me.

On Thu, Oct 04, 2007 at 10:27:52AM -0700, Linus Torvalds wrote:
> What does your "ulimit -s" say?

That's actually the first thing I checked.

mchouque - /usr/src/kernel/linux %ulimit -s
unlimited

And for the record, ulimit -a yields:
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes)unlimited
-c: core file size (blocks)0
-m: resident set size (kbytes) unlimited
-u: processes  16375
-n: file descriptors   1024
-l: locked-in-memory size (kb) 32
-v: address space (kb) unlimited
-x: file locks unlimited
-i: pending signals16375
-q: bytes in POSIX msg queues  819200
-N 13: 0
-N 14: 0


> I suspect that you might hit the code that limits execve() arguments to 
> one quarter of the maximum stack size.
> 
> We could change that from 25% to something else (half? three quarters?), 
> but if you really are hitting that limit, it sounds like you may have a 
> really small stack size to begin with (ie if 25% is smaller than the old 
> argument size limit of 128kB, you're running with a stack limit of less 
> than half a meg, which sounds pretty dang small).
> 
> So I'd like to verify that the stack limit really is the issue, and not 
> something else.

Anything else you'd like me to try?

-- 
Mathieu Chouquet-Stringer   [EMAIL PROTECTED]
The sun itself sees not till heaven clears.
 -- William Shakespeare --
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/PATCH -v2] Add sysfs control to modify a user's cpu share

2007-10-04 Thread Valdis . Kletnieks
On Thu, 04 Oct 2007 10:54:51 +0200, Heiko Carstens said:
> >   echo 2048 > /sys/kernel/uids/500/cpu_share
> > 
> > this should just work too, regardless of there not being any UID 500 
> > tasks yet. Likewise, once configured, the /sys/kernel/uids/* directories 
> > (with the settings in them) should probably not go away either.
> 
> Shouldn't that be done via uevents? E.g. UID x gets added to the sysfs tree,
> generates a uevent and a script then figures out the cpu_share and sets it.

That would tend to be a tad racy - a site may want to set limits in the
hypothetical /sys/kernel/uids/NNN before the program has a chance to fork-bomb
or otherwise make it difficult to set a limitfrom within another userspace
process.  It's similar to why you want a process to be launched with all its
ulimit's set, rather than set them after the fork/exec happens...



pgpLeIh1OXCKR.pgp
Description: PGP signature


Re: [PATCH 4/5] writeback: remove pages_skipped accounting in __block_write_full_page()

2007-10-04 Thread Andrew Morton
On Tue, 02 Oct 2007 16:41:47 +0800
Fengguang Wu <[EMAIL PROTECTED]> wrote:

> This patch fixes this bug. Though I'm not sure why __block_write_full_page()
> is called only to do nothing and who actually issued the writeback for us.

kjourald wrote the page's buffers back (ext3 in ordered-data mode).  The VM
didn't know about that, so we have a PageDirty page which has clean
buffers.

We rely upon the VFS writeback code to "discover" that this dirty page has
clean buffers: the VFS will attempt to write the dirty page and will end up
marking the page clean without performing any IO.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] Linux 2.6.23-rc9 and MAX_ARG_PAGES

2007-10-04 Thread Linus Torvalds


On Thu, 4 Oct 2007, Mathieu Chouquet-Stringer wrote:
> 
> Anything else you'd like me to try?

Well, since others definitely don't see this, including me, and I can do 
things like 62MB exec arrays:

[EMAIL PROTECTED] linux]$ echo $(find /home/torvalds/) | wc
  1  883304 63000962

without getting any overflows (much less just on the kernel sources, which 
is less than a megabyte of pathnames), I think it would be good if you 
were to just instrument the kernel and make it do a "printk()" when it 
returns E2BIG in fs/execve.c (or the NULL returns from get_arg_page()).

Just to figure out *which* test fails for you but apparently nobody else.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] Trace code and documentation

2007-10-04 Thread Andi Kleen
On Thu, Oct 04, 2007 at 12:19:35PM -0700, David Wilder wrote:
> Andi Kleen wrote:
> >"David J. Wilder" <[EMAIL PROTECTED]> writes:
> >>@@ -0,0 +1,160 @@
> >>+Trace Setup and Control
> >>+===
> >>+In the kernel, the trace interface provides a simple mechanism for
> >>+starting and managing data channels (traces) to user space.
> >
> >Wasn't relayfs supposed to do that already? Why do you need another
> >wrapper around it? 
> 
> The code in trace is exactly what all the current users of relay do. 
> Therefor trace reduces the duplication of code.

If everybody does this then the code should be just put into
relayfs?

> 
> 
> >
> >Is this also really still faster than a printk below log level
> >(without console driver overhead). If not then why not just
> >use printk?
> 
> Are you arguing against relayfs or trace?  Trace just makes relayfs 
> easer to use.  I think relayfs can stand up for it's self.

I'm arguing against complicated trace mechanisms that are not fast.

At some point when I looked at relayfs it seemed to be reasonably
fast (per cpu buffers; not much locking, over head per call roughtly like 
putchar()),
but that might have regressed. 

Your example module with its lock definitely looks very slow and I don't approve
of it.

> 
> 
> The example shows a way to create an ASCII data layer.

ASCII layers don't make much sense imho -- these should just use printk.

Fast dedicated binary log channels make sense though; but you don't
seem really to be very concentrated on that.

> True, to make trace "fast" you need a data layer that can handle the 
> requirements of per-cpu buffers.  However there are still advantages of 
> trace over printk even when using global bufferers: selectable bufferer 
> sizes,

printk has selectable buffer sizes too.

>"Long term we probably want more complex tracing based on lttng,
> but I'm a big fan of starting out simple and doing incremental
> changes."

It's just that relayfs + another not simple layer are definitely not simple.

For a simple logger I'm thinking more like something like SGI's old
ktrace module (which undoubtedly many other people have recreated many
times for specific debugging scenarios)

But that all only makes sense if the overhead is really kept low
and i don't see that in your approach.


> One advantage of the trace approach is separating control and data 
> layers, therefor trace can support multiple data layers to fit multiple 
> requirements.
> 
> I have my ideas on how to develop data layer, others may have their own 
> ideas and I welcome the input.

relayfs was supposed to be that data layer. 

> PS: Systemtap has been criticized for introducing out-of-tree kernel 
> code.  A clear direction from the community is to move re-usable code 
> in-tree where it can be maintained.  Trace is a move in that direction.

I'm all for that. I believe a simple fast efficient no frills logger
would serve systemtap just fine too. But the approach here seems
to be more to add all kinds of knobs and whizzles until you end
up with something as slow with printk. And since we already have
printk another one just doesn't seem to make much sense.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >