Re: [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump

2007-12-10 Thread Huang, Ying
On Mon, 2007-12-10 at 19:25 -0700, Eric W. Biederman wrote:
> "Huang, Ying" <[EMAIL PROTECTED]> writes:
[...]
> >  /*
> >   * Do not allocate memory (or fail in any way) in machine_kexec().
> >   * We are past the point of no return, committed to rebooting now.
> >   */
> > -NORET_TYPE void machine_kexec(struct kimage *image)
> > +int machine_kexec_vcall(struct kimage *image, unsigned long *ret,
> > +unsigned int argc, va_list args)
> >  {
> 
> Why do we need var arg support?
> Can't we do that with a shim we load from user space?

If all parameters are provided in user space, the usage model may be as
follow:

- sys_kexec_load() /* with executable/data/parameters(A) loaded */
- sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,) /* execute physical mode code with 
parameters(A)*/
- /* jump back */
- sys_kexec_load() /* with executable/data/parameters(B) loaded */
- sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,) /* execute physical mode code with 
parameters(B)*/
- /* jump back */

That is, the kexec image should be re-loaded if the parameters are
different, and there can be no state reserved in kexec image. This is OK
for original kexec implementation, because there is no jumping back.
But, for kexec with jumping back, another usage model may be useful too.

- sys_kexec_load() /* with executable/data loaded */
- sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,parameters(A)) /* execute physical mode 
code with parameters(A)*/
- sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,parameters(B)) /* execute physical mode 
code with parameters(B)*/

This way the kexec image need not to be re-loaded, and the state of
kexec image can be reserved across several invoking.


Another usage model may be useful is invoking the kexec image (such as
firmware) from kernel space.

- kmalloc the needed memory and loaded the firmware image (if needed)
- sys_kexec_load() with a fake image (one segment with size 0), the
entry point of the fake image is the entry point of the firmware image.
- kexec_call(fake_image, ...) /* maybe change entry point if needed */

This way, some kernel code can invoke the firmware in physical mode just
like invoking an ordinary function.

[...]
> > -   /* The segment registers are funny things, they have both a
> > -* visible and an invisible part.  Whenever the visible part is
> > -* set to a specific selector, the invisible part is loaded
> > -* with from a table in memory.  At no other time is the
> > -* descriptor table in memory accessed.
> > -*
> > -* I take advantage of this here by force loading the
> > -* segments, before I zap the gdt with an invalid value.
> > -*/
> > -   load_segments();
> > -   /* The gdt & idt are now invalid.
> > -* If you want to load them you must set up your own idt & gdt.
> > -*/
> > -   set_gdt(phys_to_virt(0),0);
> > -   set_idt(phys_to_virt(0),0);
> > +   if (image->preserve_cpu_ext) {
> > +   /* The segment registers are funny things, they have
> > +* both a visible and an invisible part.  Whenever the
> > +* visible part is set to a specific selector, the
> > +* invisible part is loaded with from a table in
> > +* memory.  At no other time is the descriptor table
> > +* in memory accessed.
> > +*
> > +* I take advantage of this here by force loading the
> > +* segments, before I zap the gdt with an invalid
> > +* value.
> > +*/
> > +   load_segments();
> > +   /* The gdt & idt are now invalid.  If you want to load
> > +* them you must set up your own idt & gdt.
> > +*/
> > +   set_gdt(phys_to_virt(0), 0);
> > +   set_idt(phys_to_virt(0), 0);
> > +   }
> 
> We can't keep the same idt and gdt as the pages they are on will be
> overwritten/reused.  So explictily stomping on them sounds better
> so they never work.  We can restore them on kernel reentry.

The original idea about this code is:

If the kexec image is claimed that it need not to "perserving extensive
CPU state" (such as FPU/MMX/GDT/LDT/IDT/CS/DS/ES/FS/GS/SS etc), the
IDT/GDT/CS/DS/ES/FS/GS/SS are not touched in kexec image code. So the
segment registers need not to be set.

But this is not clear. At least more description should be provided for
each preserve flag.

> > /* now call it */
> > -   relocate_kernel((unsigned long)image->head, (unsigned long)page_list,
> > -   image->start, cpu_has_pae);
> > +   relocate_kernel_ptr((unsigned long)image->head,
> > +   (unsigned long)page_list,
> > +   image->start, cpu_has_pae);
> 
> Why rename relocate_kernel?
> Ah.  I see.  You need to make it into a pointer again.  The crazy don't
> stop the pgd support strikes again.  It used to be named rnk.

You mean I should change the function pointer name to rnk to keep
consistency? I find rnk in IA64 implementation.

Best Regards,
Huang Ying
--
To unsubscribe 

Re: RFC: outb 0x80 in inb_p, outb_p harmful on some modern AMD64 with MCP51 laptops

2007-12-10 Thread Paul Rolland
Hi,

On Tue, 11 Dec 2007 12:12:59 +1030
David Newall <[EMAIL PROTECTED]> wrote:

> H. Peter Anvin wrote:
> > David Newall wrote:
> >
> > I think a single ISA bus transaction is 1 µs, so two of them back to 
> > back should be 2 µs, not 8 µs...
> 
> Exactly.  You think it's 2us, but the documentation doesn't say.  The _p 
> functions are generic inasmuch as they provide an unspecified delay.  

Well, if the delay is so much unspecified, what about _reading_ port 0x80 ?
Will the delay be shorter ? And if so, what about reading port 0x80 and
writing the value back ?
inb  al,0x80
outb 0x80,al

I've been wondering since the beginning of this thread if the problem is not
just the value we put to port 0x80, not writing to the port...

Just my 0.02 Eur...

Paul


-- 
Paul RollandE-Mail : rol(at)witbe.net
Witbe.net SATel. +33 (0)1 47 67 77 77
Les Collines de l'Arche Fax. +33 (0)1 47 67 77 99
F-92057 Paris La DefenseRIPE : PR12-RIPE

Please no HTML, I'm not a browser - Pas d'HTML, je ne suis pas un navigateur 
"Some people dream of success... while others wake up and work hard at it" 

"I worry about my child and the Internet all the time, even though she's too 
young to have logged on yet. Here's what I worry about. I worry that 10 or 15 
years from now, she will come to me and say 'Daddy, where were you when they 
took freedom of the press away from the Internet?'"
--Mike Godwin, Electronic Frontier Foundation 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: outb 0x80 in inb_p, outb_p harmful on some modern AMD64 with MCP51 laptops

2007-12-10 Thread Rene Herman

On 11-12-07 02:25, H. Peter Anvin wrote:


David Newall wrote:
Where did the 8us delay come from?  The documentation and source is 
careful not to say how long the delay is.  Would changing it to, say 
1us, be technically wrong?  Is code that requires 8us correct?


I think a single ISA bus transaction is 1 µs, so two of them back to 
back should be 2 µs, not 8 µs...


Sigh. And now where do these _two_ transactions come from? (and yes, see 
Alan's folowups, a transaction on a spec bus is 1 us).


Rene.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Lnux 2.6.24-rc5

2007-12-10 Thread Dave Young
Hi, linus

kernel.org web download is not available yet, isn't it?

Regards
dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kexec: force x86_64 arches to boot kdump kernels on boot cpu

2007-12-10 Thread Yinghai Lu
On Dec 10, 2007 8:48 PM, Eric W. Biederman <[EMAIL PROTECTED]> wrote:
> Neil Horman <[EMAIL PROTECTED]> writes:
>
> Almost there.
>
>
>
> > On Mon, Dec 10, 2007 at 06:08:03PM -0700, Eric W. Biederman wrote:
> >> Neil Horman <[EMAIL PROTECTED]> writes:
> >>
> > 
> >>
> >> Ok.  This test is broken.  Please remove the == 1.  You are looking
> >> for == (1 << 18).  So just saying: "if (htcfg & (1 << 18))" should be 
> >> clearer.
> >>
> > Fixed.  Thanks!
> >
> >> > + printk(KERN_INFO "Detected use of extended apic ids on hypertransport
> > bus\n");
> >> > +  if ((htcfg & (1 << 17)) == 0) {
> >> > + printk(KERN_INFO "Enabling hypertransport extended apic interrupt
> >> > broadcast\n");
> >> > +  htcfg |= (1 << 17);
> >> > +  write_pci_config(num, slot, func, 0x68, htcfg);
> >> > +  }
> >> > +  }
> >> > +
> >> > +}
> >>
> >> The rest of this quirk looks fine, include the fact it is only intended
> >> to be applied to PCI_VENDOR_ID_AMD PCI_DEVICE_ID_AMD_K8_NB.
> >>
> > Copy that.
> >
> >>
> >> For what is below I don't like the way the infrastructure has been
> >> extended as what you are doing quickly devolves into a big mess.
> >>
> >> Please extend struct chipset to be something like:
> >> struct chipset {
> >>  u16 vendor;
> >>  u16 device;
> >> u32 class, class_mask;
> >>  void (*f)(void);
> >> };
> >>
> >> And then the test for matching the chipset can be something like:
> >>  if ((id->vendor == PCI_ANY_ID || id->vendor == dev->vendor) &&
> >>  (id->device == PCI_ANY_ID || id->device == dev->device) &&
> >>  !((id->class ^ dev->class) & id->class_mask))
> >>
> >> Essentially a subset of pci_match_one_device from drivers/pci/pci.h
> >>
> >> That way you don't need to increase the number of tables or the
> >> number of passes through the pci busses, just update the early_qrk
> >> table with a few more bits of information.
> >>
> > copy that.  Fixed.  Thanks!
> >
> >> The extended form should be much more maintainable in the long
> >> run.  Given that we may want this before we enable the timer
> >> which is very early doing this in the pci early quirks seems
> >> to make sense.
> >>
> >> Eric
> >
> >
> > New patch attached, with suggestions incorporated.
> >
> > Thanks & regards
> > Neil
> >
> > Signed-off-by: Neil Horman <[EMAIL PROTECTED]>
> >
> >
> >  early-quirks.c | 82 
> > ++---
> >  1 file changed, 73 insertions(+), 9 deletions(-)
> >
> >
> >
> > diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
> > index 88bb83e..4b0cee1 100644
> > --- a/arch/x86/kernel/early-quirks.c
> > +++ b/arch/x86/kernel/early-quirks.c
> > @@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct 
> > acpi_table_header
> > *header)
> >  #endif /* CONFIG_X86_IO_APIC */
> >  #endif /* CONFIG_ACPI */
> >
> > +static void __init fix_hypertransport_config(int num, int slot, int func)
> > +{
> > + u32 htcfg;
> > + /*
> > +  *we found a hypertransport bus
> > +  *make sure that are broadcasting
> > +  *interrupts to all cpus on the ht bus
> > +  *if we're using extended apic ids
> > +  */
> > + htcfg = read_pci_config(num, slot, func, 0x68);
> > + if (htcfg & (1 << 18)) {
> > + printk(KERN_INFO "Detected use of extended apic ids on hypertransport 
> > bus\n");
> > + if ((htcfg & (1 << 17)) == 0) {
> > + printk(KERN_INFO "Enabling hypertransport extended apic interrupt
> > broadcast\n");
> > + htcfg |= (1 << 17);
> > + write_pci_config(num, slot, func, 0x68, htcfg);
> > + }
> > + }
> > +
> > +}
> > +
> > +static void __init check_hypertransport_config()
> > +{
> > + int num, slot, func;
> > + u32 device, vendor;
> > + func = 0;
> > + for (num = 0; num < 32; num++) {
> > + for (slot = 0; slot < 32; slot++) {
> > + vendor = read_pci_config(num,slot,func,
> > + PCI_VENDOR_ID);
> > + device = read_pci_config(num,slot,func,
> > + PCI_DEVICE_ID);
> > + vendor &= 0x;
> > + device >>= 16;
> > + if ((vendor == PCI_VENDOR_ID_AMD) &&
> > + (device == PCI_DEVICE_ID_AMD_K8_NB))
> > + fix_hypertransport_config(num,slot,func);
> > + }
> > + }
> > +
> > + return;
> > +
> > +}
>
> We should not need check_hypertransport_config as the generic loop
> now does the work for us.
> > +
> >  static void __init nvidia_bugs(void)
> >  {
> >  #ifdef CONFIG_ACPI
> > @@ -83,15 +127,25 @@ static void __init ati_bugs(void)
> >  #endif
> >  }
> >
> > +static void __init amd_host_bugs(void)
> > +{
> > + printk(KERN_CRIT "IN AMD_HOST_BUGS\n");
> > + check_hypertransport_config();
> > +}
>
> Likewise 

Re: tipc_init(), WARNING: at arch/x86/mm/highmem_32.c:52, [2.6.24-rc4-git5: Reported regressions from 2.6.23]

2007-12-10 Thread Dave Jones
On Sat, Dec 08, 2007 at 08:52:11PM +0100, Ingo Molnar wrote:

 > so even today's upstream kernel, which has 'ancient' SLUB code, SLAB and 
 > SLUB have essentially the same linecount:
 > 
 >   $ wc -l mm/slab.c mm/slub.c
 >   4478 mm/slab.c
 >   4125 mm/slub.c
 > 
 > (and while linecount != complexity, there is a strong relationship.)
 > 
 > With SLAB having 10 years more test coverage and tuning.

FWIW, the one thing slub does that slab doesn't that I find really nice
is being enable to enable debugging at boot time rather than compile time.

We don't get many people running benchmarks against the Fedora kernel,
so any scalability differences between slub/slab probably won't reach us
until we start shipping betas of the next RHEL based on the same kernel.

Which leaves my only other gripe.  It broke slabtop.
There's an alternative implementation in Documentation/vm/slabinfo.c
(why there not say, util-linux, home of current slabtop?) 

Dave

-- 
http://www.codemonkey.org.uk
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2 V2] Kprobes: Build kretprobe examples only if arch supports kretprobes

2007-12-10 Thread Ananth N Mavinakayanahalli
From: Ananth N Mavinakayanahalli <[EMAIL PROTECTED]>

This patch builds samples/kprobes/kretprobe_example.c only on archs that
support kretprobes. Thanks to Sam Ravnborg for Kconfig suggestions.

V2: Updated dependency on CONFIG_KRETPROBES

Signed-off-by: Ananth N Mavinakayanahalli <[EMAIL PROTECTED]>
---
 samples/Kconfig  |5 +
 samples/kprobes/Makefile |4 ++--
 2 files changed, 7 insertions(+), 2 deletions(-)

Index: linux-2.6.24-rc4/samples/kprobes/Makefile
===
--- linux-2.6.24-rc4.orig/samples/kprobes/Makefile
+++ linux-2.6.24-rc4/samples/kprobes/Makefile
@@ -1,5 +1,5 @@
 # builds the kprobes example kernel modules;
 # then to use one (as root):  insmod 
 
-obj-$(CONFIG_SAMPLE_KPROBES) += kprobe_example.o jprobe_example.o \
-   kretprobe_example.o
+obj-$(CONFIG_SAMPLE_KPROBES) += kprobe_example.o jprobe_example.o
+obj-$(CONFIG_SAMPLE_KRETPROBES) += kretprobe_example.o
Index: linux-2.6.24-rc4/samples/Kconfig
===
--- linux-2.6.24-rc4.orig/samples/Kconfig
+++ linux-2.6.24-rc4/samples/Kconfig
@@ -28,5 +28,10 @@ config SAMPLE_KPROBES
help
  This build several kprobes example modules.
 
+config SAMPLE_KRETPROBES
+   tristate "Build kretprobes example -- loadable modules only"
+   default m
+   depends on SAMPLE_KPROBES && KRETPROBES
+
 endif # SAMPLES
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Reducing the bdi proporion calculation period to speed up disk write

2007-12-10 Thread zhejiang
The patch 04fbfdc14e5f48463820d6b9807daa5e9c92c51f implemented bdi per
device dirty threshold. It works well.
However, the period for proportion calculation may be too large.
For 8G memory, the calc_period_shift() will return 19 as the shift.

When we switch writing operation between different disks, there may be
potential performance issue.

For example, we first write to disk A, then write to disk B.
The proportion for disk B will increase slowly because the denominator
is too large (It's 2^18 + (global_count & counter_mask)).
The disk B will get small dirty page quota for a long time,
it will get blocked frequently though the total dirty page is under the
dirty page limit.

Peter provided a patch to avoid this issue, this patch allow violation
of bdi limits if there is a lot of room on the system.
It looks like:

+if (nr_reclaimable + nr_writeback < (background_thresh +
dirty_thresh) / 2)
+ break; 

This patch really help to avoid congestion, but if the dirty pages
exceed about 3/4 of the dirty_thresh, congestion still happens if we
write to another disk. 

I think that we can reduce the period to speed up the proportion
adjustment. 

diff -Nur a/page-writeback.c b/page-writeback.c
--- a/page-writeback.c  2007-12-11 13:46:30.0 +0800
+++ b/page-writeback.c  2007-12-11 13:47:11.0 +0800
@@ -128,10 +128,7 @@
  */
 static int calc_period_shift(void)
 {
-   unsigned long dirty_total;
-
-   dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
100;
-   return 2 + ilog2(dirty_total - 1);
+   return 12;
 }


In the 8G memory system, I did some testing with iozone.
I found that reducing the period help to increase the write speed 
when switch to a new disk.


Run  "./iozone -B -i 0 -i 2 -r 4k -s 1000M" twice in the disk B.
Here is the result:

1. With the patch 04fbfdc14e5f48463820d6b9807daa5e9c92c51f
First   Second
write   78M 173M
rewrite 112M203M
randread1710M   1697M
randwrite   192M1412M

2. With Peter's patch
write   134M169M
rewrite 134M203M
randread1717M   1705M
randwrite   179M1412M 

3.Adjust the shift to 12
write   260M259M
rewrite 240M246M
randread1712M   1700M
randwrite   1409M   1409M

4.With Peter's patch and adjust the shift to 12
write   256M239M
rewrite 253M253M
randread1704M   1716M
randwrite   1414M   1416M


Run  "./iozone -B -i 0 -i 2 -r 4k -s 500M" twice in the disk B.

1. With the patch 04fbfdc14e5f48463820d6b9807daa5e9c92c51f
First   Second
write   821M725M
rewrite 144M1299M
randread1740M   1733M
randwrite   1444M   1440M

2. With Peter's patch
write   1100M   1112M
rewrite 1295M   1313M
randread1745M   1744M
randwrite   1452M   1449M 

3.Adjust the shift to 12
write   1021M   1104M
rewrite 1314M   1311M
randread1741M   1737M
randwrite   1448M   1445M

4.With Peter's patch and adjust the shift to 12
write   1104M   1105M
rewrite 1292M   1308M
randread1737M   1741M
randwrite   1449M   1449M
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2 V2] Kprobes: Indicate kretprobe support in arch//Kconfig

2007-12-10 Thread Ananth N Mavinakayanahalli
From: Ananth N Mavinakayanahalli <[EMAIL PROTECTED]>

This patch adds CONFIG_HAVE_KRETPROBES to the arch//Kconfig file
for relevant architectures with kprobes support. This facilitates easy
handling of in-kernel modules (like samples/kprobes/kretprobe_example.c)
that depend on kretprobes being present in the kernel.

This patch depends on Mathieu Desnoyers' "Instrumentation menu removal"
patchset (http://marc.info/?l=linux-kernel=119496432229633=2)

Updated to apply on 2.6.24-rc4-mm1. Thanks to Sam Ravnborg for helping
make the patch more lean.

V2: Per Mathieu's suggestion, added CONFIG_KRETPROBES and fixed up
dependencies.

Signed-off-by: Ananth N Mavinakayanahalli <[EMAIL PROTECTED]>
---
 arch/Kconfig  |7 +++
 arch/ia64/Kconfig |1 +
 arch/powerpc/Kconfig  |1 +
 arch/s390/Kconfig |1 +
 arch/x86/Kconfig  |1 +
 include/asm-ia64/kprobes.h|1 -
 include/asm-powerpc/kprobes.h |1 -
 include/asm-x86/kprobes_32.h  |1 -
 include/asm-x86/kprobes_64.h  |1 -
 include/linux/kprobes.h   |6 +++---
 kernel/kprobes.c  |8 +++-
 11 files changed, 17 insertions(+), 12 deletions(-)

Index: linux-2.6.24-rc4/arch/Kconfig
===
--- linux-2.6.24-rc4.orig/arch/Kconfig
+++ linux-2.6.24-rc4/arch/Kconfig
@@ -27,5 +27,12 @@ config KPROBES
  for kernel debugging, non-intrusive instrumentation and testing.
  If in doubt, say "N".
 
+config KRETPROBES
+   def_bool y
+   depends on KPROBES && HAVE_KRETPROBES
+
 config HAVE_KPROBES
def_bool n
+
+config HAVE_KRETPROBES
+   def_bool n
Index: linux-2.6.24-rc4/arch/ia64/Kconfig
===
--- linux-2.6.24-rc4.orig/arch/ia64/Kconfig
+++ linux-2.6.24-rc4/arch/ia64/Kconfig
@@ -17,6 +17,7 @@ config IA64
select ARCH_SUPPORTS_MSI
select HAVE_OPROFILE
select HAVE_KPROBES
+   select HAVE_KRETPROBES
default y
help
  The Itanium Processor Family is Intel's 64-bit successor to
Index: linux-2.6.24-rc4/arch/powerpc/Kconfig
===
--- linux-2.6.24-rc4.orig/arch/powerpc/Kconfig
+++ linux-2.6.24-rc4/arch/powerpc/Kconfig
@@ -81,6 +81,7 @@ config PPC
default y
select HAVE_OPROFILE
select HAVE_KPROBES
+   select HAVE_KRETPROBES
 
 config EARLY_PRINTK
bool
Index: linux-2.6.24-rc4/arch/s390/Kconfig
===
--- linux-2.6.24-rc4.orig/arch/s390/Kconfig
+++ linux-2.6.24-rc4/arch/s390/Kconfig
@@ -53,6 +53,7 @@ config S390
def_bool y
select HAVE_OPROFILE
select HAVE_KPROBES
+   select HAVE_KRETPROBES
 
 source "init/Kconfig"
 
Index: linux-2.6.24-rc4/arch/x86/Kconfig
===
--- linux-2.6.24-rc4.orig/arch/x86/Kconfig
+++ linux-2.6.24-rc4/arch/x86/Kconfig
@@ -20,6 +20,7 @@ config X86
def_bool y
select HAVE_OPROFILE
select HAVE_KPROBES
+   select HAVE_KRETPROBES
 
 config GENERIC_TIME
def_bool y
Index: linux-2.6.24-rc4/include/asm-ia64/kprobes.h
===
--- linux-2.6.24-rc4.orig/include/asm-ia64/kprobes.h
+++ linux-2.6.24-rc4/include/asm-ia64/kprobes.h
@@ -82,7 +82,6 @@ struct kprobe_ctlblk {
struct prev_kprobe prev_kprobe[ARCH_PREV_KPROBE_SZ];
 };
 
-#define ARCH_SUPPORTS_KRETPROBES
 #define kretprobe_blacklist_size 0
 
 #define SLOT0_OPCODE_SHIFT (37)
Index: linux-2.6.24-rc4/include/asm-powerpc/kprobes.h
===
--- linux-2.6.24-rc4.orig/include/asm-powerpc/kprobes.h
+++ linux-2.6.24-rc4/include/asm-powerpc/kprobes.h
@@ -80,7 +80,6 @@ typedef unsigned int kprobe_opcode_t;
 #define is_trap(instr) (IS_TW(instr) || IS_TWI(instr))
 #endif
 
-#define ARCH_SUPPORTS_KRETPROBES
 #define flush_insn_slot(p) do { } while (0)
 #define kretprobe_blacklist_size 0
 
Index: linux-2.6.24-rc4/include/asm-x86/kprobes_32.h
===
--- linux-2.6.24-rc4.orig/include/asm-x86/kprobes_32.h
+++ linux-2.6.24-rc4/include/asm-x86/kprobes_32.h
@@ -42,7 +42,6 @@ typedef u8 kprobe_opcode_t;
? (MAX_STACK_SIZE) \
: (((unsigned long)current_thread_info()) + THREAD_SIZE - (ADDR)))
 
-#define ARCH_SUPPORTS_KRETPROBES
 #define flush_insn_slot(p) do { } while (0)
 
 extern const int kretprobe_blacklist_size;
Index: linux-2.6.24-rc4/include/asm-x86/kprobes_64.h
===
--- linux-2.6.24-rc4.orig/include/asm-x86/kprobes_64.h
+++ linux-2.6.24-rc4/include/asm-x86/kprobes_64.h
@@ -41,7 +41,6 @@ typedef u8 kprobe_opcode_t;
? (MAX_STACK_SIZE) \
: (((unsigned 

Re: [PATCH 1/2] Kprobes: Indicate kretprobe support in arch//Kconfig - updated

2007-12-10 Thread Ananth N Mavinakayanahalli
On Mon, Dec 10, 2007 at 10:10:01AM -0500, Mathieu Desnoyers wrote:
> * Ananth N Mavinakayanahalli ([EMAIL PROTECTED]) wrote:
> > On Mon, Dec 10, 2007 at 11:13:07AM +0100, Sam Ravnborg wrote:
> > > On Mon, Dec 10, 2007 at 03:22:22PM +0530, Ananth N Mavinakayanahalli 
> > > wrote:
> > > > From: Ananth N Mavinakayanahalli <[EMAIL PROTECTED]>
> > > > 



> > Index: linux-2.6.24-rc4/include/linux/kprobes.h
> > ===
> > --- linux-2.6.24-rc4.orig/include/linux/kprobes.h
> > +++ linux-2.6.24-rc4/include/linux/kprobes.h
> > @@ -125,11 +125,11 @@ struct jprobe {
> >  DECLARE_PER_CPU(struct kprobe *, current_kprobe);
> >  DECLARE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
> >  
> > -#ifdef ARCH_SUPPORTS_KRETPROBES
> > +#ifdef CONFIG_HAVE_KRETPROBES
> 
> Hi Ananth,
> 
> I just want to point out a detail: if someone sets CONFIG_KPROBES to n,
> the CONFIG_HAVE_KPROBES is still y, and so is CONFIG_HAVE_KRETPROBES.
> However, I doubt that you want to activate this code in this case ?
> The code paths are OK because they are nested into CONFIG_KPROBES
> ifdefs (or not built due to dependency on CONFIG_KPROBES in the
> Makfile), but if one wants to use CONFIG_HAVE_KRETPROBE for something
> else (Makefile), then it could become a problem.
> 
> Could we add a menu entry CONFIG_KRETPROBES that depends on
> CONFIG_HAVE_KRETPROBES and CONFIG_KPROBES, and also remove the
> CONFIG_HAVE_KPROBES dependency for the CONFIG_HAVE_KRETPROBE option ?
> This way, we would have much more flexibility (like specifiying if we
> want CONFIG_KRETPROBES to be default y or default n...)

Done... Updated patch coming up.

Ananth
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: outb 0x80 in inb_p, outb_p harmful on some modern AMD64 with MCP51 laptops

2007-12-10 Thread H. Peter Anvin

Alan Cox wrote:
In any case, my machine does not have an ISA bus.  Why should it?  It's 
a laptop!


Yes it does. The branding spec said "No ISA bus" so it was renamed "LPC"
and hidden internally, but its alive and well.



Well that, plus it was serialized and uses PCI electricals and timing, 
hence the LPC (Low Pin Count) moniker.  Its performance is pretty much 
exactly ISA, though, and unlike PCI it provides full support for all 
legacy ISA features like slave DMA.


-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: outb 0x80 in inb_p, outb_p harmful on some modern AMD64 with MCP51 laptops

2007-12-10 Thread H. Peter Anvin

Andi Kleen wrote:
My machine in question, for example, needs no waiting within CMOS_READs 
at all.   And I doubt any other chip/device needs waiting that isn't 


I don't know about CMOS, but there were definitely some not too ancient
systems (let's say not more than 10 years) who required IO delays in the
floppy driver and the 8253/8259. But on those the jumps are already
far too fast.



Yes, early Linux used jumps.  I believe it broke a bunch of machines 
when the P5 came out, as the jumps were too fast.  (I have to admit to 
being a bit fuzzy on this... my memory says it was the 486 and not the 
P5, but that clearly can't be the case since my first Linux box was a 
486/33.)


-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Lnux 2.6.24-rc5

2007-12-10 Thread Linus Torvalds

It's been a week, and I promised to be a good boy and try to follow my 
release rules, so here is the next -rc.

Things _have_ slowed down, although I'd obviously be lying if I said we've 
got all the regressions handled and under control. They are being worked 
on, and the list is shrinking, but at a guess, we're definitely not going 
to have a final 2.6.24 out before xmas unless santa puts some more elves 
to work on those regressions..

So any elves out there - please keep working.

I'm including the shortlog since it's small enough, and quite frankly, 
gives about as readable explanation of the changes as can be imagined. 
Nothing hugely exciting here.

I'd post the diffstat too, but it's not really all that interesting, and 
it only highlights a textually big PA-RISC revert, and the powerpc 
defconfig updates. And the Blackfin SPI driver. The rest is largely random 
noise in various subsystems (drivers/net, xfs filesystem, and arch updates 
are some of the areas that show more changes).

Linus

---
Adam Litke (1):
  hugetlb: handle write-protection faults in follow_hugetlb_page

Adrian Bunk (3):
  x86: revert CONFIG_X86_HT semantics change
  x86: free_cache_attributes() section fix
  MAINTAINERS: remove the MTRR entry

Al Viro (5):
  regression: cifs endianness bug
  no need to mess with KBUILD_CFLAGS on uml-i386 anymore
  fcrypt endianness misannotations
  regression: bfs endianness bug
  remove nonsense force-casts from ocfs2

Alexey Dobriyan (1):
  proc: fix proc_dir_entry refcounting

Andrew Gallatin (1):
  [LRO]: fix lro_gen_skb() alignment

Andrew Morton (7):
  x86: arch_register_cpu() section fix
  [BRIDGE]: Section fix.
  [IA64] increase .data.patch offset
  [IA64] don't assume that unwcheck.py is executable
  [IA64] export copy_page() to modules
  aoe: properly initialise the request_queue's backing_dev_info
  revert "dpt_i2o: convert to SCSI hotplug model"

Anton Vorontsov (1):
  PHY: Add the phy_device_release device method.

Atsushi Nemoto (1):
  qemu: do not enable IP7 blindly

Auke Kok (1):
  e100: cleanup unneeded math

Bartlomiej Zolnierkiewicz (1):
  pata_amd/pata_via: de-couple programming of PIO/MWDMA and UDMA timings

Ben Gardner (1):
  gpio_cs5535: disable AUX on output

Benjamin Herrenschmidt (6):
  ibm_newemac: Fix ZMII refcounting bug
  ibm_newemac: Workaround reset timeout when no link
  ibm_newemac: Cleanup/Fix RGMII MDIO support detection
  ibm_newemac: Cleanup/fix support for STACR register variants
  ibm_newemac: Update file headers copyright notices
  powerpc: Fix IDE legacy vs. native fixups

Bernhard Walle (1):
  [IA64] rename _bss to __bss_start

Bryan Wu (11):
  spi: initial BF54x SPI support
  spi: spi_bfin cleanups, error handling
  spi: spi_bfin handles spi_transfer.cs_change
  spi: spi_bfin uses platform device resources
  spi: spi_bfin: handle multiple spi_masters
  spi: spi_bfin: bugfix for 8..16 bit word sizes
  spi: spi_bfin: update handling of delay-after-deselect
  Blackfin SPI driver: use cpu_relax() to replace continue in while busywait
  Blackfin SPI driver: use void __iomem * for regs_base
  Blackfin SPI driver: move hard coded pin_req to board file
  Blackfin SPI driver: reconfigure speed_hz and bits_per_word in each spi 
transfer

Chris Dearman (1):
  [MIPS] Don't byteswap writes to display when running bigendian

Christian Borntraeger (2):
  [S390] dcssblk: prevent early access without own make_request function
  [S390] Fix compile error on 31bit without preemption

Christoph Hellwig (1):
  [XFS] revert to double-buffering readdir

Cornelia Huck (1):
  [S390] cio: Issue SenseID per path.

Cyrill Gorcunov (1):
  [SPARC64]: check for possible NULL pointer dereference

David Brownell (2):
  SPI: use mutex not semaphore
  spi: at25 driver is for EEPROM not FLASH

David Chinner (2):
  [XFS] Fix broken inode cluster setup.
  [XFS] Fix xfs_ichgtime()s broken usage of I_SYNC

David Howells (1):
  [AF_RXRPC]: Add a missing goto

David S. Miller (4):
  [SPARC64]: Missing mdesc_release() in ldc_init().
  [SYSCTL_CHECK]: Fix typo in KERN_SPARC_SCONS_PWROFF entry string.
  [SPARC64]: Update defconfig.
  [SPARC64]: Fix memory controller register access when non-SMP.

David Sterba (1):
  bonding: Fix time comparison

David Woodhouse (1):
  Don't claim to do IPv6 checksum offload

Denis Cheng (1):
  mm/backing-dev.c: fix percpu_counter_destroy call bug in bdi_init

Denis V. Lunev (1):
  [IPV4]: Remove prototype of ip_rt_advice

Divy Le Ray (2):
  cxgb - revert file mode changes.
  cxgb3 - T3C support update

Don Zickus (1):
  x86: add the word 'WARNING' in check_nmi_watchdog() output

Donald Douwsma (1):
  [XFS] Fix dbflush panic in xfs_qm_sync.

Eliezer Tamir (1):
  make bnx2x select 

Re: [PATCH][for -mm] fix accounting in vmscan.c for memory controller

2007-12-10 Thread Balbir Singh
KAMEZAWA Hiroyuki wrote:
> On Tue, 11 Dec 2007 10:44:36 +0530
> Balbir Singh <[EMAIL PROTECTED]> wrote:
> 
>> Looks good to me.
>>
>> Acked-by: Balbir Singh <[EMAIL PROTECTED]>
>>
>> TODO:
>>
>> 1. Should we have vm_events for the memory controller as well?
>>May be in the longer term
>>
> 
> ALLOC_STALL is recoreded as failcnt, I think.
> I think DIRECT can be accoutned easily.

Thanks for clarifying

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][for -mm] fix accounting in vmscan.c for memory controller

2007-12-10 Thread KAMEZAWA Hiroyuki
On Tue, 11 Dec 2007 10:44:36 +0530
Balbir Singh <[EMAIL PROTECTED]> wrote:

> Looks good to me.
> 
> Acked-by: Balbir Singh <[EMAIL PROTECTED]>
> 
> TODO:
> 
> 1. Should we have vm_events for the memory controller as well?
>May be in the longer term
> 

ALLOC_STALL is recoreded as failcnt, I think.
I think DIRECT can be accoutned easily.

But I'm not in hurry very much, because all reclaimation is DIRECT, now.
After we implement background reclaim, we should consider it.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][for -mm] fix accounting in vmscan.c for memory controller

2007-12-10 Thread Balbir Singh
KAMEZAWA Hiroyuki wrote:
> Without this, ALLOCSTALL and PGSCAN_DIRECT increases too much unless
> there is no memory shortage.
> 
> against 2.6.24-rc4-mm1.
> 
> -Kame
> 
> ==
> Some amount of accounting is done while page reclaiming.
> 
> Now, there are 2 types of page reclaim (if memory controller is used)
>   - global: shortage of (global) pages.
>   - under cgroup: use up to limit.
> 
> I think 2 accountings, ALLOCSTALL and DIRECT should be accounted only under
> global lru scan. They are accounted against memory shortage at alloc_pages().
> 
> Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>
> 
>  mm/vmscan.c |6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6.24-rc4-mm1/mm/vmscan.c
> ===
> --- linux-2.6.24-rc4-mm1.orig/mm/vmscan.c
> +++ linux-2.6.24-rc4-mm1/mm/vmscan.c
> @@ -896,8 +896,9 @@ static unsigned long shrink_inactive_lis
>   if (current_is_kswapd()) {
>   __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan);
>   __count_vm_events(KSWAPD_STEAL, nr_freed);
> - } else
> + } else if (scan_global_lru(sc))
>   __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scan);
> +
>   __count_zone_vm_events(PGSTEAL, zone, nr_freed);
> 
>   if (nr_taken == 0)
> @@ -1333,7 +1334,8 @@ static unsigned long do_try_to_free_page
>   unsigned long lru_pages = 0;
>   int i;
> 
> - count_vm_event(ALLOCSTALL);
> + if (scan_global_lru(sc))
> + count_vm_event(ALLOCSTALL);
>   /*
>* mem_cgroup will not do shrink_slab.
>*/
> 

Looks good to me.

Acked-by: Balbir Singh <[EMAIL PROTECTED]>

TODO:

1. Should we have vm_events for the memory controller as well?
   May be in the longer term

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [DOC][for -mm] update Documentation/controller/memory.txt

2007-12-10 Thread Balbir Singh
KAMEZAWA Hiroyuki wrote:
> Balbir-san, could you review this update ?
> 
> --
> Documentation updates for memory controller.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>
> 
> Index: linux-2.6.24-rc4-mm1/Documentation/controllers/memory.txt
> ===
> --- linux-2.6.24-rc4-mm1.orig/Documentation/controllers/memory.txt
> +++ linux-2.6.24-rc4-mm1/Documentation/controllers/memory.txt
> @@ -9,8 +9,7 @@ d. Provides a double LRU: global memory 
> global LRU; a cgroup on hitting a limit, reclaims from the per
> cgroup LRU
> 
> -NOTE: Page Cache (unmapped) also includes Swap Cache pages as a subset
> -and will not be referred to explicitly in the rest of the documentation.
> +NOTE: Swap Cache (unmapped) is not accounted now.
> 
>  Benefits and Purpose of the memory controller
> 
> @@ -144,7 +143,7 @@ list.
>  The memory controller uses the following hierarchy
> 
>  1. zone->lru_lock is used for selecting pages to be isolated
> -2. mem->lru_lock protects the per cgroup LRU
> +2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone)
>  3. lock_page_cgroup() is used to protect page->page_cgroup
> 
>  3. User Interface
> @@ -193,6 +192,15 @@ this file after a write to guarantee the
>  The memory.failcnt field gives the number of times that the cgroup limit was
>  exceeded.
> 
> +The memory.stat file gives accounting information. Now, the number of
> +caches, RSS and Active pages/Inactive pages are shown.
> +
> +The memory.force_empty gives an interface to drop *all* charges by force.
> +
> +# echo -n 1 > memory.force_empty
> +
> +will drop all charges in cgroup. Currently, this is maintained for test.
> +
>  4. Testing
> 
>  Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11].
> @@ -222,11 +230,8 @@ reclaimed.
> 
>  A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
>  cgroup might have some charge associated with it, even though all
> -tasks have migrated away from it. If some pages are still left, after 
> following
> -the steps listed in sections 4.1 and 4.2, check the Swap Cache usage in
> -/proc/meminfo to see if the Swap Cache usage is showing up in the
> -cgroups memory.usage_in_bytes counter. A simple test of swapoff -a and
> -swapon -a should free any pending Swap Cache usage.
> +tasks have migrated away from it. Such charges are automatically dropped at
> +rmdir() if there are no tasks.
> 
>  4.4 Choosing what to account  -- Page Cache (unmapped) vs RSS (mapped)?
> 
> @@ -238,15 +243,11 @@ echo -n 1 > memory.control_type
>  5. TODO
> 
>  1. Add support for accounting huge pages (as a separate controller)
> -2. Improve the user interface to accept/display memory limits in KB or MB
> -   rather than pages (since page sizes can differ across platforms/machines).
> -3. Make cgroup lists per-zone
> -4. Make per-cgroup scanner reclaim not-shared pages first
> -5. Teach controller to account for shared-pages
> -6. Start reclamation when the limit is lowered
> -7. Start reclamation in the background when the limit is
> +2. Make per-cgroup scanner reclaim not-shared pages first
> +3. Teach controller to account for shared-pages
> +4. Start reclamation when the limit is lowered
> +5. Start reclamation in the background when the limit is
> not yet hit but the usage is getting closer
> -8. Create per zone LRU lists per cgroup
> 

Looks very good to me!

Reviewed-by: Balbir Singh <[EMAIL PROTECTED]>

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] RCU : move three variables to __read_mostly to save space

2007-12-10 Thread Eric Dumazet

I noticed this vmlinux layout on i686 (where CONFIG_X86_L1_CACHE_SHIFT = 7) :

c06cdab4 d pid_caches_lh
c06cdb00 d qlowmark
c06cdb04 d qhimark
c06cdb08 d blimit
c06cdb80 d rcu_ctrlblk
c06cdc80 d rcu_bh_ctrlblk

This means that qlowmark, qhimark and blimit use a whole 128 bytes cache line. 
Linker is not smart enough for us.


Moving these three variables to read_mostly section saves 116 (128-12) bytes.

# size vmlinux vmlinux.before_patch
   textdata bss dec hex filename
6343966  490818  630784 7465568  71ea60 vmlinux
6343966  490930  630784 7465680  71ead0 vmlinux.before_patch

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>
diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
index a66d4d1..11c815c 100644
--- a/kernel/rcupdate.c
+++ b/kernel/rcupdate.c
@@ -75,9 +75,9 @@ DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L };
 
 /* Fake initialization required by compiler */
 static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
-static int blimit = 10;
-static int qhimark = 1;
-static int qlowmark = 100;
+static int blimit __read_mostly = 10;
+static int qhimark __read_mostly = 1;
+static int qlowmark __read_mostly = 100;
 
 static atomic_t rcu_barrier_cpu_count;
 static DEFINE_MUTEX(rcu_barrier_mutex);


Re: [PATCH] kexec: force x86_64 arches to boot kdump kernels on boot cpu

2007-12-10 Thread Eric W. Biederman
Neil Horman <[EMAIL PROTECTED]> writes:

Almost there.


> On Mon, Dec 10, 2007 at 06:08:03PM -0700, Eric W. Biederman wrote:
>> Neil Horman <[EMAIL PROTECTED]> writes:
>> 
> 
>> 
>> Ok.  This test is broken.  Please remove the == 1.  You are looking
>> for == (1 << 18).  So just saying: "if (htcfg & (1 << 18))" should be 
>> clearer.
>> 
> Fixed.  Thanks!
>
>> > + printk(KERN_INFO "Detected use of extended apic ids on hypertransport
> bus\n");
>> > +  if ((htcfg & (1 << 17)) == 0) {
>> > + printk(KERN_INFO "Enabling hypertransport extended apic interrupt
>> > broadcast\n");
>> > +  htcfg |= (1 << 17);
>> > +  write_pci_config(num, slot, func, 0x68, htcfg);
>> > +  }
>> > +  }
>> > +  
>> > +}
>> 
>> The rest of this quirk looks fine, include the fact it is only intended
>> to be applied to PCI_VENDOR_ID_AMD PCI_DEVICE_ID_AMD_K8_NB.
>> 
> Copy that.
>
>> 
>> For what is below I don't like the way the infrastructure has been
>> extended as what you are doing quickly devolves into a big mess.
>> 
>> Please extend struct chipset to be something like:
>> struct chipset {
>>  u16 vendor;
>>  u16 device;
>> u32 class, class_mask;
>>  void (*f)(void);
>> };
>> 
>> And then the test for matching the chipset can be something like:
>>  if ((id->vendor == PCI_ANY_ID || id->vendor == dev->vendor) &&
>>  (id->device == PCI_ANY_ID || id->device == dev->device) &&
>>  !((id->class ^ dev->class) & id->class_mask))
>> 
>> Essentially a subset of pci_match_one_device from drivers/pci/pci.h
>> 
>> That way you don't need to increase the number of tables or the
>> number of passes through the pci busses, just update the early_qrk
>> table with a few more bits of information.
>> 
> copy that.  Fixed.  Thanks!
>
>> The extended form should be much more maintainable in the long
>> run.  Given that we may want this before we enable the timer
>> which is very early doing this in the pci early quirks seems
>> to make sense.
>> 
>> Eric
>
>
> New patch attached, with suggestions incorporated.
>
> Thanks & regards
> Neil
>
> Signed-off-by: Neil Horman <[EMAIL PROTECTED]>
>
>
>  early-quirks.c | 82 ++---
>  1 file changed, 73 insertions(+), 9 deletions(-)
>
>
>
> diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
> index 88bb83e..4b0cee1 100644
> --- a/arch/x86/kernel/early-quirks.c
> +++ b/arch/x86/kernel/early-quirks.c
> @@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct 
> acpi_table_header
> *header)
>  #endif /* CONFIG_X86_IO_APIC */
>  #endif /* CONFIG_ACPI */
>  
> +static void __init fix_hypertransport_config(int num, int slot, int func)
> +{
> + u32 htcfg;
> + /*
> +  *we found a hypertransport bus
> +  *make sure that are broadcasting
> +  *interrupts to all cpus on the ht bus
> +  *if we're using extended apic ids
> +  */
> + htcfg = read_pci_config(num, slot, func, 0x68);
> + if (htcfg & (1 << 18)) {
> + printk(KERN_INFO "Detected use of extended apic ids on hypertransport 
> bus\n");
> + if ((htcfg & (1 << 17)) == 0) {
> + printk(KERN_INFO "Enabling hypertransport extended apic interrupt
> broadcast\n");
> + htcfg |= (1 << 17);
> + write_pci_config(num, slot, func, 0x68, htcfg);
> + }
> + }
> + 
> +}
> +
> +static void __init check_hypertransport_config()
> +{
> + int num, slot, func;
> + u32 device, vendor;
> + func = 0;
> + for (num = 0; num < 32; num++) {
> + for (slot = 0; slot < 32; slot++) {
> + vendor = read_pci_config(num,slot,func,
> + PCI_VENDOR_ID); 
> + device = read_pci_config(num,slot,func,
> + PCI_DEVICE_ID);
> + vendor &= 0x;
> + device >>= 16;
> + if ((vendor == PCI_VENDOR_ID_AMD) &&
> + (device == PCI_DEVICE_ID_AMD_K8_NB))
> + fix_hypertransport_config(num,slot,func);
> + }
> + }
> +
> + return;
> +
> +}

We should not need check_hypertransport_config as the generic loop
now does the work for us.
> +
>  static void __init nvidia_bugs(void)
>  {
>  #ifdef CONFIG_ACPI
> @@ -83,15 +127,25 @@ static void __init ati_bugs(void)
>  #endif
>  }
>  
> +static void __init amd_host_bugs(void)
> +{
> + printk(KERN_CRIT "IN AMD_HOST_BUGS\n");
> + check_hypertransport_config();
> +}

Likewise this function is unneeded and the printk is likely confusing
for users.

>  struct chipset {
>   u16 vendor;
> + u16 device;
> + u32 class;
> + u32 class_mask;
>   void (*f)(void);
>  };
>  
>  static struct chipset early_qrk[] __initdata = {
> - { PCI_VENDOR_ID_NVIDIA, nvidia_bugs },
> - { PCI_VENDOR_ID_VIA, 

[PATCH 6/6] pcmcia/pcnet_cs: Fix 'shadow variable' warning

2007-12-10 Thread Richard Knutsson
Fixing:
  CHECK   drivers/net/pcmcia/pcnet_cs.c
drivers/net/pcmcia/pcnet_cs.c:523:15: warning: symbol 'hw_info' shadows an 
earlier one
drivers/net/pcmcia/pcnet_cs.c:148:18: originally declared here

Signed-off-by: Richard Knutsson <[EMAIL PROTECTED]>
---


diff --git a/drivers/net/pcmcia/pcnet_cs.c b/drivers/net/pcmcia/pcnet_cs.c
index db6a97d..5779344 100644
--- a/drivers/net/pcmcia/pcnet_cs.c
+++ b/drivers/net/pcmcia/pcnet_cs.c
@@ -520,7 +520,7 @@ static int pcnet_config(struct pcmcia_device *link)
 int i, last_ret, last_fn, start_pg, stop_pg, cm_offset;
 int has_shmem = 0;
 u_short buf[64];
-hw_info_t *hw_info;
+hw_info_t *local_hw_info;
 DECLARE_MAC_BUF(mac);
 
 DEBUG(0, "pcnet_config(0x%p)\n", link);
@@ -589,23 +589,23 @@ static int pcnet_config(struct pcmcia_device *link)
dev->if_port = 0;
 }
 
-hw_info = get_hwinfo(link);
-if (hw_info == NULL)
-   hw_info = get_prom(link);
-if (hw_info == NULL)
-   hw_info = get_dl10019(link);
-if (hw_info == NULL)
-   hw_info = get_ax88190(link);
-if (hw_info == NULL)
-   hw_info = get_hwired(link);
-
-if (hw_info == NULL) {
+local_hw_info = get_hwinfo(link);
+if (local_hw_info == NULL)
+   local_hw_info = get_prom(link);
+if (local_hw_info == NULL)
+   local_hw_info = get_dl10019(link);
+if (local_hw_info == NULL)
+   local_hw_info = get_ax88190(link);
+if (local_hw_info == NULL)
+   local_hw_info = get_hwired(link);
+
+if (local_hw_info == NULL) {
printk(KERN_NOTICE "pcnet_cs: unable to read hardware net"
   " address for io base %#3lx\n", dev->base_addr);
goto failed;
 }
 
-info->flags = hw_info->flags;
+info->flags = local_hw_info->flags;
 /* Check for user overrides */
 info->flags |= (delay_output) ? DELAY_OUTPUT : 0;
 if ((link->manf_id == MANFID_SOCKET) &&
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/6] pcmcia/axnet_cs: Make use of 'max()' instead of handcrafted one

2007-12-10 Thread Richard Knutsson
Use 'max(x,y)' instead of 'x < y ? y : x'.

Signed-off-by: Richard Knutsson <[EMAIL PROTECTED]>
---


diff --git a/drivers/net/pcmcia/axnet_cs.c b/drivers/net/pcmcia/axnet_cs.c
index 8d910a3..96931cc 100644
--- a/drivers/net/pcmcia/axnet_cs.c
+++ b/drivers/net/pcmcia/axnet_cs.c
@@ -1091,8 +1091,8 @@ static int ei_start_xmit(struct sk_buff *skb, struct 
net_device *dev)

ei_local->irqlock = 1;
 
-   send_length = ETH_ZLEN < length ? length : ETH_ZLEN;
-   
+   send_length = max(length, ETH_ZLEN);
+
/*
 * We have two Tx slots available for use. Find the first free
 * slot, and then perform some sanity checks. With two Tx bufs,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/6] pcmcia/fmvj18x_cs: Fix 'shadow variable' warning

2007-12-10 Thread Richard Knutsson
Fixing:
  CHECK   drivers/net/pcmcia/fmvj18x_cs.c
drivers/net/pcmcia/fmvj18x_cs.c:1205:6: warning: symbol 'i' shadows an earlier 
one
drivers/net/pcmcia/fmvj18x_cs.c:1179:9: originally declared here

Signed-off-by: Richard Knutsson <[EMAIL PROTECTED]>
---


diff --git a/drivers/net/pcmcia/fmvj18x_cs.c b/drivers/net/pcmcia/fmvj18x_cs.c
index 8c719b4..4f604ae 100644
--- a/drivers/net/pcmcia/fmvj18x_cs.c
+++ b/drivers/net/pcmcia/fmvj18x_cs.c
@@ -1202,8 +1202,7 @@ static void set_rx_mode(struct net_device *dev)
outb(1, ioaddr + RX_MODE);  /* Ignore almost all multicasts. */
 } else {
struct dev_mc_list *mclist;
-   int i;
-   
+
memset(mc_filter, 0, sizeof(mc_filter));
for (i = 0, mclist = dev->mc_list; mclist && i < dev->mc_count;
 i++, mclist = mclist->next) {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/6] pcmcia/3c574_cs: Fix 'shadow variable' warning

2007-12-10 Thread Richard Knutsson
Fixing:
  CHECK   drivers/net/pcmcia/3c574_cs.c
drivers/net/pcmcia/3c574_cs.c:695:7: warning: symbol 'i' shadows an earlier one
drivers/net/pcmcia/3c574_cs.c:636:6: originally declared here

Signed-off-by: Richard Knutson <[EMAIL PROTECTED]>
---


diff --git a/drivers/net/pcmcia/3c574_cs.c b/drivers/net/pcmcia/3c574_cs.c
index ad134a6..97b6daa 100644
--- a/drivers/net/pcmcia/3c574_cs.c
+++ b/drivers/net/pcmcia/3c574_cs.c
@@ -692,7 +692,7 @@ static void tc574_reset(struct net_device *dev)
mdio_write(ioaddr, lp->phys, 4, lp->advertising);
if (!auto_polarity) {
/* works for TDK 78Q2120 series MII's */
-   int i = mdio_read(ioaddr, lp->phys, 16) | 0x20;
+   i = mdio_read(ioaddr, lp->phys, 16) | 0x20;
mdio_write(ioaddr, lp->phys, 16, i);
}
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/6] pcmcia/3c574_cs: Fix dubious bitfield warning

2007-12-10 Thread Richard Knutsson
Fixing:
  CHECK   drivers/net/pcmcia/3c574_cs.c
drivers/net/pcmcia/3c574_cs.c:194:13: warning: dubious bitfield without 
explicit `signed' or `unsigned'
drivers/net/pcmcia/3c574_cs.c:196:14: warning: dubious bitfield without 
explicit `signed' or `unsigned'

Signed-off-by: Richard Knutsson <[EMAIL PROTECTED]>
---
Is there a reason for not doing it this way?


diff --git a/drivers/net/pcmcia/3c574_cs.c b/drivers/net/pcmcia/3c574_cs.c
index ad134a6..97b6daa 100644
--- a/drivers/net/pcmcia/3c574_cs.c
+++ b/drivers/net/pcmcia/3c574_cs.c
@@ -190,10 +190,10 @@ enum Window3 {/* Window 3: MAC/config 
bits. */
 union wn3_config {
int i;
struct w3_config_fields {
-   unsigned int ram_size:3, ram_width:1, ram_speed:2, rom_size:2;
-   int pad8:8;
-   unsigned int ram_split:2, pad18:2, xcvr:3, pad21:1, 
autoselect:1;
-   int pad24:7;
+   u8 ram_size:3, ram_width:1, ram_speed:2, rom_size:2;
+   u8 pad8;
+   u8 ram_split:2, pad18:2, xcvr:3, pad21:1;
+   u8 autoselect:1, pad24:7;
} u;
 };
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/6] pcmcia/axnet_cs: Make functions static

2007-12-10 Thread Richard Knutsson
Fixing:
  CHECK   drivers/net/pcmcia/axnet_cs.c
drivers/net/pcmcia/axnet_cs.c:994:5: warning: symbol 'ax_close' was not 
declared. Should it be static?
drivers/net/pcmcia/axnet_cs.c:1017:6: warning: symbol 'ei_tx_timeout' was not 
declared. Should it be static?

Signed-off-by: Richard Knutsson <[EMAIL PROTECTED]>
---


diff --git a/drivers/net/pcmcia/axnet_cs.c b/drivers/net/pcmcia/axnet_cs.c
index 8d910a3..96931cc 100644
--- a/drivers/net/pcmcia/axnet_cs.c
+++ b/drivers/net/pcmcia/axnet_cs.c
@@ -991,7 +991,7 @@ static int ax_open(struct net_device *dev)
  *
  * Opposite of ax_open(). Only used when "ifconfig  down" is done.
  */
-int ax_close(struct net_device *dev)
+static int ax_close(struct net_device *dev)
 {
unsigned long flags;
 
@@ -1014,7 +1014,7 @@ int ax_close(struct net_device *dev)
  * completed (or failed) - i.e. never posted a Tx related interrupt.
  */
 
-void ei_tx_timeout(struct net_device *dev)
+static void ei_tx_timeout(struct net_device *dev)
 {
long e8390_base = dev->base_addr;
struct ei_device *ei_local = (struct ei_device *) netdev_priv(dev);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] A clean approach to writeout throttling

2007-12-10 Thread Daniel Phillips
On Monday 10 December 2007 13:31, Jonathan Corbet wrote:
> Hey, Daniel,
>
> I'm just getting around to looking at this.  One thing jumped out at me:
> > +   if (bio->bi_throttle) {
> > +   struct request_queue *q = bio->bi_queue;
> > +   bio->bi_throttle = 0; /* or detect multiple endio and err? */
> > +   atomic_add(bio->bi_throttle, >available);
> > +   wake_up(>throttle_wait);
> > +   }
>
> I'm feeling like I must be really dumb, but...how can that possibly
> work?  You're zeroing >bi_throttle before adding it back into
> q->available, so the latter will never increase...

Hi Jon,

Don't you know?  These days we optimize all our code for modern
processors with tunnelling instructions and metaphysical cache.
On such processors, setting a register to zero does not entirely
destroy all the data that used to be in the register, so subsequent
instructions can make further use of the overwritten data by
reconstructing it from remnants of bits left attached to the edges of
the register.

Um, yeah, that's it.

Actually, I fat-fingered it in the merge to -mm.  Thanks for the catch,
corrected patch attached.

The offending line isn't even a functional part of the algorithm, it is
just supposed to defend against the possibility that, somehow,
->bi_endio gets called multiple times.  Probably it should really be
something like:

BUG_ON(bio->bi_throttle == -1);
if (bio->bi_throttle) {
...
bio->bi_throttle = -1;

Or perhaps we should just rely on nobody ever making that mistake
and let somebody else catch it if it does.

Regards,

Daniel
--- 2.6.24-rc3-mm.clean/block/ll_rw_blk.c	2007-12-04 14:45:25.0 -0800
+++ 2.6.24-rc3-mm/block/ll_rw_blk.c	2007-12-10 04:49:56.0 -0800
@@ -3210,9 +3210,9 @@ static inline int bio_check_eod(struct b
  */
 static inline void __generic_make_request(struct bio *bio)
 {
-	struct request_queue *q;
+	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 	sector_t old_sector;
-	int ret, nr_sectors = bio_sectors(bio);
+	int nr_sectors = bio_sectors(bio);
 	dev_t old_dev;
 	int err = -EIO;
 
@@ -3221,6 +3221,13 @@ static inline void __generic_make_reques
 	if (bio_check_eod(bio, nr_sectors))
 		goto end_io;
 
+	if (q && q->metric && !bio->bi_queue) {
+		int need = bio->bi_throttle = q->metric(bio);
+		bio->bi_queue = q;
+		/* FIXME: potential race if atomic_sub is called in the middle of condition check */
+		wait_event(q->throttle_wait, atomic_read(>available) >= need);
+		atomic_sub(need, >available);
+	}
 	/*
 	 * Resolve the mapping until finished. (drivers are
 	 * still free to implement/resolve their own stacking
@@ -3231,10 +3238,9 @@ static inline void __generic_make_reques
 	 */
 	old_sector = -1;
 	old_dev = 0;
-	do {
+	while (1) {
 		char b[BDEVNAME_SIZE];
 
-		q = bdev_get_queue(bio->bi_bdev);
 		if (!q) {
 			printk(KERN_ERR
 			   "generic_make_request: Trying to access "
@@ -3282,8 +3288,10 @@ end_io:
 			goto end_io;
 		}
 
-		ret = q->make_request_fn(q, bio);
-	} while (ret);
+		if (!q->make_request_fn(q, bio))
+			return;
+		q = bdev_get_queue(bio->bi_bdev);
+	}
 }
 
 /*
--- 2.6.24-rc3-mm.clean/drivers/md/dm.c	2007-12-04 14:46:04.0 -0800
+++ 2.6.24-rc3-mm/drivers/md/dm.c	2007-12-04 23:31:41.0 -0800
@@ -889,6 +889,11 @@ static int dm_any_congested(void *conges
 	return r;
 }
 
+static unsigned dm_metric(struct bio *bio)
+{
+	return bio->bi_vcnt;
+}
+
 /*-
  * An IDR is used to keep track of allocated minor numbers.
  *---*/
@@ -967,6 +972,7 @@ out:
 
 static struct block_device_operations dm_blk_dops;
 
+#define DEFAULT_THROTTLE_CAPACITY 1000
 /*
  * Allocate and initialise a blank device with a given minor.
  */
@@ -1009,6 +1015,11 @@ static struct mapped_device *alloc_dev(i
 		goto bad1_free_minor;
 
 	md->queue->queuedata = md;
+	md->queue->metric = dm_metric;
+	/* A dm device constructor may change the throttle capacity */
+	atomic_set(>queue->available, md->queue->capacity = DEFAULT_THROTTLE_CAPACITY);
+	init_waitqueue_head(>queue->throttle_wait);
+
 	md->queue->backing_dev_info.congested_fn = dm_any_congested;
 	md->queue->backing_dev_info.congested_data = md;
 	blk_queue_make_request(md->queue, dm_request);
--- 2.6.24-rc3-mm.clean/fs/bio.c	2007-12-04 14:38:47.0 -0800
+++ 2.6.24-rc3-mm/fs/bio.c	2007-12-04 23:31:41.0 -0800
@@ -1007,6 +1007,13 @@ void bio_endio(struct bio *bio, int erro
 	else if (!test_bit(BIO_UPTODATE, >bi_flags))
 		error = -EIO;
 
+	if (bio->bi_throttle) {
+		struct request_queue *q = bio->bi_queue;
+		atomic_add(bio->bi_throttle, >available);
+		bio->bi_throttle = 0; /* or detect multiple endio and err? */
+		wake_up(>throttle_wait);
+	}
+
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio, error);
 }
--- 2.6.24-rc3-mm.clean/include/linux/bio.h	2007-12-04 

Re: [PATCH] kexec: force x86_64 arches to boot kdump kernels on boot cpu

2007-12-10 Thread Neil Horman
On Mon, Dec 10, 2007 at 06:08:03PM -0700, Eric W. Biederman wrote:
> Neil Horman <[EMAIL PROTECTED]> writes:
> 

> 
> Ok.  This test is broken.  Please remove the == 1.  You are looking
> for == (1 << 18).  So just saying: "if (htcfg & (1 << 18))" should be clearer.
> 
Fixed.  Thanks!

> > + printk(KERN_INFO "Detected use of extended apic ids on hypertransport 
> > bus\n");
> > +   if ((htcfg & (1 << 17)) == 0) {
> > + printk(KERN_INFO "Enabling hypertransport extended apic interrupt
> > broadcast\n");
> > +   htcfg |= (1 << 17);
> > +   write_pci_config(num, slot, func, 0x68, htcfg);
> > +   }
> > +   }
> > +   
> > +}
> 
> The rest of this quirk looks fine, include the fact it is only intended
> to be applied to PCI_VENDOR_ID_AMD PCI_DEVICE_ID_AMD_K8_NB.
> 
Copy that.

> 
> For what is below I don't like the way the infrastructure has been
> extended as what you are doing quickly devolves into a big mess.
> 
> Please extend struct chipset to be something like:
> struct chipset {
>   u16 vendor;
>   u16 device;
> u32 class, class_mask;
>   void (*f)(void);
> };
> 
> And then the test for matching the chipset can be something like:
>   if ((id->vendor == PCI_ANY_ID || id->vendor == dev->vendor) &&
>   (id->device == PCI_ANY_ID || id->device == dev->device) &&
>   !((id->class ^ dev->class) & id->class_mask))
> 
> Essentially a subset of pci_match_one_device from drivers/pci/pci.h
> 
> That way you don't need to increase the number of tables or the
> number of passes through the pci busses, just update the early_qrk
> table with a few more bits of information.
> 
copy that.  Fixed.  Thanks!

> The extended form should be much more maintainable in the long
> run.  Given that we may want this before we enable the timer
> which is very early doing this in the pci early quirks seems
> to make sense.
> 
> Eric


New patch attached, with suggestions incorporated.

Thanks & regards
Neil

Signed-off-by: Neil Horman <[EMAIL PROTECTED]>


 early-quirks.c |   82 ++---
 1 file changed, 73 insertions(+), 9 deletions(-)



diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..4b0cee1 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct acpi_table_header 
*header)
 #endif /* CONFIG_X86_IO_APIC */
 #endif /* CONFIG_ACPI */
 
+static void __init fix_hypertransport_config(int num, int slot, int func)
+{
+   u32 htcfg;
+   /*
+*we found a hypertransport bus
+*make sure that are broadcasting
+*interrupts to all cpus on the ht bus
+*if we're using extended apic ids
+*/
+   htcfg = read_pci_config(num, slot, func, 0x68);
+   if (htcfg & (1 << 18)) {
+   printk(KERN_INFO "Detected use of extended apic ids on 
hypertransport bus\n");
+   if ((htcfg & (1 << 17)) == 0) {
+   printk(KERN_INFO "Enabling hypertransport extended apic 
interrupt broadcast\n");
+   htcfg |= (1 << 17);
+   write_pci_config(num, slot, func, 0x68, htcfg);
+   }
+   }
+   
+}
+
+static void __init check_hypertransport_config()
+{
+   int num, slot, func;
+   u32 device, vendor;
+   func = 0;
+   for (num = 0; num < 32; num++) {
+   for (slot = 0; slot < 32; slot++) {
+   vendor = read_pci_config(num,slot,func,
+   PCI_VENDOR_ID); 
+   device = read_pci_config(num,slot,func,
+   PCI_DEVICE_ID);
+   vendor &= 0x;
+   device >>= 16;
+   if ((vendor == PCI_VENDOR_ID_AMD) &&
+   (device == PCI_DEVICE_ID_AMD_K8_NB))
+   fix_hypertransport_config(num,slot,func);
+   }
+   }
+
+   return;
+
+}
+
 static void __init nvidia_bugs(void)
 {
 #ifdef CONFIG_ACPI
@@ -83,15 +127,25 @@ static void __init ati_bugs(void)
 #endif
 }
 
+static void __init amd_host_bugs(void)
+{
+   printk(KERN_CRIT "IN AMD_HOST_BUGS\n");
+   check_hypertransport_config();
+}
+
 struct chipset {
u16 vendor;
+   u16 device;
+   u32 class;
+   u32 class_mask;
void (*f)(void);
 };
 
 static struct chipset early_qrk[] __initdata = {
-   { PCI_VENDOR_ID_NVIDIA, nvidia_bugs },
-   { PCI_VENDOR_ID_VIA, via_bugs },
-   { PCI_VENDOR_ID_ATI, ati_bugs },
+   { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, 
nvidia_bugs },
+   { PCI_VENDOR_ID_VIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, 
via_bugs },
+   { PCI_VENDOR_ID_ATI, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, 
ati_bugs },
+   { 

[PATCH 2/4 v2] added methods for sched_class changes

2007-12-10 Thread Steven Rostedt
Dmitry Adamushko found that the current implementation of the RT
balancing code left out changes to the sched_setscheduler and rt_mutex_setprio.

This patch addresses this issue by adding methods to the schedule classes to
handle being switched out of (switched_from) and being switched into
(switched_to) a sched_class. Also a method for changing of priorities
is also added (prio_changed).

This patch also removes some duplicate logic between rt_mutex_setprio and
sched_setscheduler.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 include/linux/sched.h   |7 +++
 kernel/sched.c  |   42 ++
 kernel/sched_fair.c |   39 +
 kernel/sched_idletask.c |   31 
 kernel/sched_rt.c   |   89 
 5 files changed, 186 insertions(+), 22 deletions(-)

Index: linux-sched/include/linux/sched.h
===
--- linux-sched.orig/include/linux/sched.h  2007-12-10 20:39:14.0 
-0500
+++ linux-sched/include/linux/sched.h   2007-12-10 20:39:17.0 -0500
@@ -860,6 +860,13 @@ struct sched_class {
 
void (*join_domain)(struct rq *rq);
void (*leave_domain)(struct rq *rq);
+
+   void (*switched_from) (struct rq *this_rq, struct task_struct *task,
+  int running);
+   void (*switched_to) (struct rq *this_rq, struct task_struct *task,
+int running);
+   void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
+int oldprio, int running);
 };
 
 struct load_weight {
Index: linux-sched/kernel/sched.c
===
--- linux-sched.orig/kernel/sched.c 2007-12-10 20:39:14.0 -0500
+++ linux-sched/kernel/sched.c  2007-12-10 20:39:17.0 -0500
@@ -1147,6 +1147,18 @@ static inline void __set_task_cpu(struct
 #endif
 }
 
+static inline void check_class_changed(struct rq *rq, struct task_struct *p,
+  const struct sched_class *prev_class,
+  int oldprio, int running)
+{
+   if (prev_class != p->sched_class) {
+   if (prev_class->switched_from)
+   prev_class->switched_from(rq, p, running);
+   p->sched_class->switched_to(rq, p, running);
+   } else
+   p->sched_class->prio_changed(rq, p, oldprio, running);
+}
+
 #ifdef CONFIG_SMP
 
 /*
@@ -4012,6 +4024,7 @@ void rt_mutex_setprio(struct task_struct
unsigned long flags;
int oldprio, on_rq, running;
struct rq *rq;
+   const struct sched_class *prev_class = p->sched_class;
 
BUG_ON(prio < 0 || prio > MAX_PRIO);
 
@@ -4037,18 +4050,10 @@ void rt_mutex_setprio(struct task_struct
if (on_rq) {
if (running)
p->sched_class->set_curr_task(rq);
+
enqueue_task(rq, p, 0);
-   /*
-* Reschedule if we are currently running on this runqueue and
-* our priority decreased, or if we are not currently running on
-* this runqueue and our priority is higher than the current's
-*/
-   if (running) {
-   if (p->prio > oldprio)
-   resched_task(rq->curr);
-   } else {
-   check_preempt_curr(rq, p);
-   }
+
+   check_class_changed(rq, p, prev_class, oldprio, running);
}
task_rq_unlock(rq, );
 }
@@ -4248,6 +4253,7 @@ int sched_setscheduler(struct task_struc
 {
int retval, oldprio, oldpolicy = -1, on_rq, running;
unsigned long flags;
+   const struct sched_class *prev_class = p->sched_class;
struct rq *rq;
 
/* may grab non-irq protected spin_locks */
@@ -4341,18 +4347,10 @@ recheck:
if (on_rq) {
if (running)
p->sched_class->set_curr_task(rq);
+
activate_task(rq, p, 0);
-   /*
-* Reschedule if we are currently running on this runqueue and
-* our priority decreased, or if we are not currently running on
-* this runqueue and our priority is higher than the current's
-*/
-   if (running) {
-   if (p->prio > oldprio)
-   resched_task(rq->curr);
-   } else {
-   check_preempt_curr(rq, p);
-   }
+
+   check_class_changed(rq, p, prev_class, oldprio, running);
}
__task_rq_unlock(rq);
spin_unlock_irqrestore(>pi_lock, flags);
Index: linux-sched/kernel/sched_fair.c
===
--- linux-sched.orig/kernel/sched_fair.c2007-12-10 20:39:11.0 

[PATCH 1/4 v2] Replace hooks with pre/post schedule and wakeup methods

2007-12-10 Thread Steven Rostedt
To make the main sched.c code more agnostic to the schedule classes.
Instead of having specific hooks in the schedule code for the RT class
balancing. They are replaced with a pre_schedule, post_schedule
and task_wake_up methods. These methods may be used by any of the classes
but currently, only the sched_rt class implements them.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 include/linux/sched.h |3 +++
 kernel/sched.c|   20 
 kernel/sched_rt.c |   17 +++--
 3 files changed, 26 insertions(+), 14 deletions(-)

Index: linux-sched/include/linux/sched.h
===
--- linux-sched.orig/include/linux/sched.h  2007-12-10 20:39:11.0 
-0500
+++ linux-sched/include/linux/sched.h   2007-12-10 20:39:14.0 -0500
@@ -848,6 +848,9 @@ struct sched_class {
int (*move_one_task) (struct rq *this_rq, int this_cpu,
  struct rq *busiest, struct sched_domain *sd,
  enum cpu_idle_type idle);
+   void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
+   void (*post_schedule) (struct rq *this_rq);
+   void (*task_wake_up) (struct rq *this_rq, struct task_struct *task);
 #endif
 
void (*set_curr_task) (struct rq *rq);
Index: linux-sched/kernel/sched.c
===
--- linux-sched.orig/kernel/sched.c 2007-12-10 20:39:11.0 -0500
+++ linux-sched/kernel/sched.c  2007-12-10 20:39:14.0 -0500
@@ -1620,7 +1620,10 @@ out_activate:
 
 out_running:
p->state = TASK_RUNNING;
-   wakeup_balance_rt(rq, p);
+#ifdef CONFIG_SMP
+   if (p->sched_class->task_wake_up)
+   p->sched_class->task_wake_up(rq, p);
+#endif
 out:
task_rq_unlock(rq, );
 
@@ -1743,7 +1746,10 @@ void fastcall wake_up_new_task(struct ta
inc_nr_running(p, rq);
}
check_preempt_curr(rq, p);
-   wakeup_balance_rt(rq, p);
+#ifdef CONFIG_SMP
+   if (p->sched_class->task_wake_up)
+   p->sched_class->task_wake_up(rq, p);
+#endif
task_rq_unlock(rq, );
 }
 
@@ -1864,7 +1870,10 @@ static void finish_task_switch(struct rq
prev_state = prev->state;
finish_arch_switch(prev);
finish_lock_switch(rq, prev);
-   schedule_tail_balance_rt(rq);
+#ifdef CONFIG_SMP
+   if (current->sched_class->post_schedule)
+   current->sched_class->post_schedule(rq);
+#endif
 
fire_sched_in_preempt_notifiers(current);
if (mm)
@@ -3633,7 +3642,10 @@ need_resched_nonpreemptible:
switch_count = >nvcsw;
}
 
-   schedule_balance_rt(rq, prev);
+#ifdef CONFIG_SMP
+   if (prev->sched_class->pre_schedule)
+   prev->sched_class->pre_schedule(rq, prev);
+#endif
 
if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);
Index: linux-sched/kernel/sched_rt.c
===
--- linux-sched.orig/kernel/sched_rt.c  2007-12-10 20:39:11.0 -0500
+++ linux-sched/kernel/sched_rt.c   2007-12-10 20:39:14.0 -0500
@@ -689,14 +689,14 @@ static int pull_rt_task(struct rq *this_
return ret;
 }
 
-static void schedule_balance_rt(struct rq *rq, struct task_struct *prev)
+static void pre_schedule_rt(struct rq *rq, struct task_struct *prev)
 {
/* Try to pull RT tasks here if we lower this rq's prio */
if (unlikely(rt_task(prev)) && rq->rt.highest_prio > prev->prio)
pull_rt_task(rq);
 }
 
-static void schedule_tail_balance_rt(struct rq *rq)
+static void post_schedule_rt(struct rq *rq)
 {
/*
 * If we have more than one rt_task queued, then
@@ -713,10 +713,9 @@ static void schedule_tail_balance_rt(str
 }
 
 
-static void wakeup_balance_rt(struct rq *rq, struct task_struct *p)
+static void task_wake_up_rt(struct rq *rq, struct task_struct *p)
 {
-   if (unlikely(rt_task(p)) &&
-   !task_running(rq, p) &&
+   if (!task_running(rq, p) &&
(p->prio >= rq->rt.highest_prio) &&
rq->rt.overloaded)
push_rt_tasks(rq);
@@ -780,11 +779,6 @@ static void leave_domain_rt(struct rq *r
if (rq->rt.overloaded)
rt_clear_overload(rq);
 }
-
-#else /* CONFIG_SMP */
-# define schedule_tail_balance_rt(rq)  do { } while (0)
-# define schedule_balance_rt(rq, prev) do { } while (0)
-# define wakeup_balance_rt(rq, p)  do { } while (0)
 #endif /* CONFIG_SMP */
 
 static void task_tick_rt(struct rq *rq, struct task_struct *p)
@@ -838,6 +832,9 @@ const struct sched_class rt_sched_class 
.set_cpus_allowed   = set_cpus_allowed_rt,
.join_domain= join_domain_rt,
.leave_domain   = leave_domain_rt,
+   .pre_schedule   = pre_schedule_rt,
+   .post_schedule  = post_schedule_rt,
+   

[PATCH 4/4 v2] Subject: SCHED - Clean up some old cpuset logic

2007-12-10 Thread Steven Rostedt
From: Gregory Haskins <[EMAIL PROTECTED]>

We had support for overlapping cpuset based rto logic in early prototypes that
is no longer used, so clean it up.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---

 kernel/sched_rt.c |   33 -
 1 file changed, 33 deletions(-)

Index: linux-sched/kernel/sched_rt.c
===
--- linux-sched.orig/kernel/sched_rt.c  2007-12-10 20:39:19.0 -0500
+++ linux-sched/kernel/sched_rt.c   2007-12-10 20:39:21.0 -0500
@@ -586,38 +586,6 @@ static int pull_rt_task(struct rq *this_
continue;
 
src_rq = cpu_rq(cpu);
-   if (unlikely(src_rq->rt.rt_nr_running <= 1)) {
-   /*
-* It is possible that overlapping cpusets
-* will miss clearing a non overloaded runqueue.
-* Clear it now.
-*/
-   if (double_lock_balance(this_rq, src_rq)) {
-   /* unlocked our runqueue lock */
-   struct task_struct *old_next = next;
-
-   next = pick_next_task_rt(this_rq);
-   if (next != old_next)
-   ret = 1;
-   }
-   if (likely(src_rq->rt.rt_nr_running <= 1)) {
-   /*
-* Small chance that this_rq->curr changed
-* but it's really harmless here.
-*/
-   rt_clear_overload(this_rq);
-   } else {
-   /*
-* Heh, the src_rq is now overloaded, since
-* we already have the src_rq lock, go straight
-* to pulling tasks from it.
-*/
-   goto try_pulling;
-   }
-   spin_unlock(_rq->lock);
-   continue;
-   }
-
/*
 * We can potentially drop this_rq's lock in
 * double_lock_balance, and another CPU could
@@ -641,7 +609,6 @@ static int pull_rt_task(struct rq *this_
continue;
}
 
- try_pulling:
p = pick_next_highest_task_rt(src_rq, this_cpu);
 
/*

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/4 v2] RT balance updates against sched-devel

2007-12-10 Thread Steven Rostedt

[Sorry if this is a repost, but I had a problem with quilt mail, and 
 I don't know if my original post made it out. Unfortunately, I didn't
 save the original "prolog" file, and so this has to be rewritten
 from scratch, and I don't even remember the original subject :-/ ]


This patch series goes against Ingo's sched-devel git tree.

The first patch addresses Ingo's concerns about having hooks in the main
sched.c and replaces them with generic methods that any class may use.
The methods are: pre_schedule, post_schedule and task_wake_up; which
is called before the schedule, after a context switch and when a task
wakes up respectively. The are surrounded by ifdef CONFIG_SMP since they
are currently only used by sched_rt in SMP mode. But if this appears to
be applicable to other sched_classes in UP, then I can rerun this series
without the ifdefs.

The second patch addresses the concerns that Dmitry brought up showing that
the current RT balancing neglected to handle changes in prio and
classes from sched_setscheduler and rt_mutex_setprio. The added methods
are: switched_to, switched_from and prio_changed; these are called in
the when a task is assigned a new sched_class, after it leaves
a sched_class, and when it changes its prio respectively.

The last two patches are from Gregory Haskins where he cleaned up left
over changes that were from previous versions of the balancing code.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/4 v2] SCHED - Only adjust overload state when changing

2007-12-10 Thread Steven Rostedt
From: Gregory Haskins <[EMAIL PROTECTED]>

The overload set/clears were originally idempotent when this logic was first
implemented.  But that is no longer true due to the addition of the atomic
counter and this logic was never updated to work properly with that change.
So only adjust the overload state if it is actually changing to avoid
getting out of sync.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
---

 kernel/sched_rt.c |8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

Index: linux-sched/kernel/sched_rt.c
===
--- linux-sched.orig/kernel/sched_rt.c  2007-12-10 20:39:17.0 -0500
+++ linux-sched/kernel/sched_rt.c   2007-12-10 20:39:19.0 -0500
@@ -34,9 +34,11 @@ static inline void rt_clear_overload(str
 static void update_rt_migration(struct rq *rq)
 {
if (rq->rt.rt_nr_migratory && (rq->rt.rt_nr_running > 1)) {
-   rt_set_overload(rq);
-   rq->rt.overloaded = 1;
-   } else {
+   if (!rq->rt.overloaded) {
+   rt_set_overload(rq);
+   rq->rt.overloaded = 1;
+   }
+   } else if (rq->rt.overloaded) {
rt_clear_overload(rq);
rq->rt.overloaded = 0;
}

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][SCSI] hptiop: add more adapter models and other fixes

2007-12-10 Thread HighPoint Linux Team

Matthew Wilcox wrote:
>> - add more PCI device IDs
>> - support for adapters based on Marvell IOP
>
> Are you sure it's a good idea to do this?  This patch is 1200 lines long
> ... the same size as the existing driver:
>
> $ wc drivers/scsi/hptiop.*
>   947  2273 24531 drivers/scsi/hptiop.c
>   256   612  6175 drivers/scsi/hptiop.h
>   1203  2885 30706 total
>
> That suggests to me there's not much commonality between the two drivers,
> and you'd be better off adding a second driver for the 4xxx cards

The new adapter implementation adds to the driver about 300 lines of
code (some lines in the original driver was changed slightly to accommodate
the difference). It is only different from the original models on the
messaging interface, and still shares same firmware command block
structures and work flow.

HighPoint Linux Team

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[DOC][for -mm] update Documentation/controller/memory.txt

2007-12-10 Thread KAMEZAWA Hiroyuki
Balbir-san, could you review this update ?

--
Documentation updates for memory controller.

Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>

Index: linux-2.6.24-rc4-mm1/Documentation/controllers/memory.txt
===
--- linux-2.6.24-rc4-mm1.orig/Documentation/controllers/memory.txt
+++ linux-2.6.24-rc4-mm1/Documentation/controllers/memory.txt
@@ -9,8 +9,7 @@ d. Provides a double LRU: global memory 
global LRU; a cgroup on hitting a limit, reclaims from the per
cgroup LRU
 
-NOTE: Page Cache (unmapped) also includes Swap Cache pages as a subset
-and will not be referred to explicitly in the rest of the documentation.
+NOTE: Swap Cache (unmapped) is not accounted now.
 
 Benefits and Purpose of the memory controller
 
@@ -144,7 +143,7 @@ list.
 The memory controller uses the following hierarchy
 
 1. zone->lru_lock is used for selecting pages to be isolated
-2. mem->lru_lock protects the per cgroup LRU
+2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone)
 3. lock_page_cgroup() is used to protect page->page_cgroup
 
 3. User Interface
@@ -193,6 +192,15 @@ this file after a write to guarantee the
 The memory.failcnt field gives the number of times that the cgroup limit was
 exceeded.
 
+The memory.stat file gives accounting information. Now, the number of
+caches, RSS and Active pages/Inactive pages are shown.
+
+The memory.force_empty gives an interface to drop *all* charges by force.
+
+# echo -n 1 > memory.force_empty
+
+will drop all charges in cgroup. Currently, this is maintained for test.
+
 4. Testing
 
 Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11].
@@ -222,11 +230,8 @@ reclaimed.
 
 A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
 cgroup might have some charge associated with it, even though all
-tasks have migrated away from it. If some pages are still left, after following
-the steps listed in sections 4.1 and 4.2, check the Swap Cache usage in
-/proc/meminfo to see if the Swap Cache usage is showing up in the
-cgroups memory.usage_in_bytes counter. A simple test of swapoff -a and
-swapon -a should free any pending Swap Cache usage.
+tasks have migrated away from it. Such charges are automatically dropped at
+rmdir() if there are no tasks.
 
 4.4 Choosing what to account  -- Page Cache (unmapped) vs RSS (mapped)?
 
@@ -238,15 +243,11 @@ echo -n 1 > memory.control_type
 5. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
-2. Improve the user interface to accept/display memory limits in KB or MB
-   rather than pages (since page sizes can differ across platforms/machines).
-3. Make cgroup lists per-zone
-4. Make per-cgroup scanner reclaim not-shared pages first
-5. Teach controller to account for shared-pages
-6. Start reclamation when the limit is lowered
-7. Start reclamation in the background when the limit is
+2. Make per-cgroup scanner reclaim not-shared pages first
+3. Teach controller to account for shared-pages
+4. Start reclamation when the limit is lowered
+5. Start reclamation in the background when the limit is
not yet hit but the usage is getting closer
-8. Create per zone LRU lists per cgroup
 
 Summary
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump

2007-12-10 Thread Eric W. Biederman
"Huang, Ying" <[EMAIL PROTECTED]> writes:

> This patch implements the functionality of jumping between the kexeced
> kernel and the original kernel.
>
> To support jumping between two kernels, before jumping to (executing)
> the new kernel and jumping back to the original kernel, the devices
> are put into quiescent state, and the state of devices and CPU is
> saved. After jumping back from kexeced kernel and jumping to the new
> kernel, the state of devices and CPU are restored accordingly. The
> devices/CPU state save/restore code of software suspend is called to
> implement corresponding function.
>
> To support jumping without reserving memory. One shadow backup page
> (source page) is allocated for each page used by new (kexeced) kernel
> (destination page). When do kexec_load, the image of new kernel is
> loaded into source pages, and before executing, the destination pages
> and the source pages are swapped, so the contents of destination pages
> are backupped. Before jumping to the new (kexeced) kernel and after
> jumping back to the original kernel, the destination pages and the
> source pages are swapped too.
>
> A jump back protocol for kexec is defined and documented. It is an
> extension to ordinary function calling protocol. So, the facility
> provided by this patch can be used to call ordinary C function in real
> mode.
>
> A set of flags for sys_kexec_load are added to control which state are
> saved/restored before/after real mode code executing. For example, you
> can specify the device state and FPU state are saved/restored
> before/after real mode code executing.
>
> The states (exclude CPU state) save/restore code can be overridden
> based on the "command" parameter of kexec jump. Because more states
> need to be saved/restored by hibernating/resuming.
>

> Signed-off-by: Huang Ying <[EMAIL PROTECTED]>
>
> ---
>  Documentation/i386/jump_back_protocol.txt |  103 ++
>  arch/powerpc/kernel/machine_kexec.c   |2 
>  arch/ppc/kernel/machine_kexec.c   |2 
>  arch/sh/kernel/machine_kexec.c|2 
>  arch/x86/kernel/machine_kexec_32.c|   88 +---
>  arch/x86/kernel/machine_kexec_64.c|2 
>  arch/x86/kernel/relocate_kernel_32.S | 214 +++---
>  include/asm-x86/kexec_32.h|   39 -
>  include/linux/kexec.h |   40 +
>  kernel/kexec.c|  188 ++
>  kernel/power/Kconfig  |2 
>  kernel/sys.c  |   35 +++-
>  12 files changed, 648 insertions(+), 69 deletions(-)
>
> --- a/arch/x86/kernel/machine_kexec_32.c
> +++ b/arch/x86/kernel/machine_kexec_32.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define PAGE_ALIGNED __attribute__ ((__aligned__(PAGE_SIZE)))
>  static u32 kexec_pgd[1024] PAGE_ALIGNED;
> @@ -83,10 +84,14 @@ static void load_segments(void)
>   * reboot code buffer to allow us to avoid allocations
>   * later.
>   *
> - * Currently nothing.
> + * Turn off NX bit for control page.
>   */
>  int machine_kexec_prepare(struct kimage *image)
>  {
> + if (nx_enabled) {
> + change_page_attr(image->control_code_page, 1, PAGE_KERNEL_EXEC);
> + global_flush_tlb();
> + }
>   return 0;
>  }
>  
> @@ -96,25 +101,59 @@ int machine_kexec_prepare(struct kimage 
>   */
>  void machine_kexec_cleanup(struct kimage *image)
>  {
> + if (nx_enabled) {
> + change_page_attr(image->control_code_page, 1, PAGE_KERNEL);
> + global_flush_tlb();
> + }
> +}
> +
> +void machine_kexec(struct kimage *image)
> +{
> + machine_kexec_call(image, NULL, 0);
>  }
>  
>  /*
>   * Do not allocate memory (or fail in any way) in machine_kexec().
>   * We are past the point of no return, committed to rebooting now.
>   */
> -NORET_TYPE void machine_kexec(struct kimage *image)
> +int machine_kexec_vcall(struct kimage *image, unsigned long *ret,
> +  unsigned int argc, va_list args)
>  {

Why do we need var arg support?
Can't we do that with a shim we load from user space?

>   unsigned long page_list[PAGES_NR];
>   void *control_page;
> + asmlinkage NORET_TYPE void
> + (*relocate_kernel_ptr)(unsigned long indirection_page,
> +unsigned long control_page,
> +unsigned long start_address,
> +unsigned int has_pae) ATTRIB_NORET;
>  
>   /* Interrupts aren't acceptable while we reboot */
>   local_irq_disable();
>  
>   control_page = page_address(image->control_code_page);
> - memcpy(control_page, relocate_kernel, PAGE_SIZE);
> + memcpy(control_page, relocate_page, PAGE_SIZE/2);
> + KCALL_MAGIC(control_page) = 0;
>  
> + if (image->preserve_cpu) {
> + unsigned int i;
> + KCALL_MAGIC(control_page) = KCALL_MAGIC_NUMBER;
> 

Re: [PATCH 2.6.24-rc4-mm 2/2] gpiolib: add Generic IRQ support for 16-bit PCA9539 GPIO expander

2007-12-10 Thread eric miao
On Dec 10, 2007 6:14 PM, David Brownell <[EMAIL PROTECTED]> wrote:
> On Monday 10 December 2007, eric miao wrote:
> > +config GPIO_PCA9539_GENERIC_IRQ
> > +bool " Generic IRQ support for PCA9539"
> > +depends on GPIO_PCA9539=y
>
> Also depends on GENERIC_HARDIRQS, right?  (You should let
> the Kconfig UI handle indentation, too...)
>
> Seems like doing this for an I2C chip ought to shake loose
> some interesting review comments.  :)
>
>
> > +help
> > + Say yes here to support the Generic IRQ for the PCA9539 on-chip
> > + GPIO lines.
>
> This somewhat resembles the pcf857x chips in that it only support
> pin-changed IRQs (IRQ_TYPE_EDGE_BOTH) in hardware.  Some other I/O
> expanders are a bit more flexible.
>
> - Dave
>

Updated as follows:

>From 486724d8b2b7a668600e38807680cc3a089ad533 Mon Sep 17 00:00:00 2001
From: eric miao <[EMAIL PROTECTED]>
Date: Mon, 10 Dec 2007 17:24:36 +0800
Subject: [PATCH] gpiolib: add Generic IRQ support for 16-bit PCA9539
GPIO expander

This patch adds the generic IRQ support for the PCA9539 on-chip GPIOs.

Note: due to the inaccessibility of the generic IRQ code within modules,
this support is only available if the driver is built-in.

Signed-off-by: eric miao <[EMAIL PROTECTED]>
Acked-by: Ben Gardner <[EMAIL PROTECTED]>
---
 drivers/gpio/Kconfig   |   11 +++-
 drivers/gpio/pca9539.c |  184 
 2 files changed, 194 insertions(+), 1 deletions(-)

diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
index 6528fce..f897df8 100644
--- a/drivers/gpio/Kconfig
+++ b/drivers/gpio/Kconfig
@@ -40,7 +40,16 @@ config GPIO_PCA9539
  16-bit I/O port.

  This driver can also be built as a module.  If so, the module
- will be called pca9539.
+ will be called pca9539.  Note: the Generic IRQ support for the
+ chip will only be available if the driver is built-in
+
+config GPIO_PCA9539_GENERIC_IRQ
+   bool "Generic IRQ support for PCA9539"
+   depends on GPIO_PCA9539=y && GENERIC_HARDIRQS
+   help
+ Say yes here to support the Generic IRQ for the PCA9539 on-chip
+ GPIO lines. Only pin-changed IRQs (IRQ_TYPE_EDGE_BOTH) are
+ supported in hardware.

 comment "SPI GPIO expanders:"

diff --git a/drivers/gpio/pca9539.c b/drivers/gpio/pca9539.c
index 0a3ae6a..e736dd9 100644
--- a/drivers/gpio/pca9539.c
+++ b/drivers/gpio/pca9539.c
@@ -11,6 +11,9 @@

 #include 
 #include 
+#include 
+#include 
+#include 
 #include 
 #include 

@@ -27,9 +30,25 @@ struct pca9539_chip {
unsigned gpio_start;
uint16_t reg_output;
uint16_t reg_direction;
+   uint16_t last_input;

struct i2c_client *client;
struct gpio_chip gpio_chip;
+#ifdef CONFIG_GPIO_PCA9539_GENERIC_IRQ
+   /*
+* Note: Generic IRQ is not accessible within module code, the IRQ
+* support will thus _only_ be available if the driver is built-in
+*/
+   int irq;/* IRQ for the chip itself */
+   int irq_start;  /* starting IRQ for the on-chip GPIO lines */
+
+   uint16_t irq_mask;
+   uint16_t irq_falling_edge;
+   uint16_t irq_rising_edge;
+
+   struct irq_chip irq_chip;
+   struct work_struct irq_work;
+#endif
 };

 static int pca9539_write_reg(struct pca9539_chip *chip, int reg, uint16_t val)
@@ -152,6 +171,150 @@ static int pca9539_init_gpio(struct pca9539_chip *chip)
return gpiochip_add(gc);
 }

+#ifdef CONFIG_GPIO_PCA9539_GENERIC_IRQ
+/* FIXME: change to schedule_delayed_work() here if reading out of
+ * registers does not reflect the actual pin levels
+ */
+
+static void pca9539_irq_work(struct work_struct *work)
+{
+   struct pca9539_chip *chip;
+   uint16_t input, mask, rising, falling;
+   int ret, i;
+
+   chip = container_of(work, struct pca9539_chip, irq_work);
+
+   ret = pca9539_read_reg(chip, PCA9539_INPUT, );
+   if (ret < 0)
+   return;
+
+   mask = (input ^ chip->last_input) & chip->irq_mask;
+   rising = (input & mask) & chip->irq_rising_edge;
+   falling = (~input & mask) & chip->irq_falling_edge;
+
+   irq_enter();
+
+   for (i = 0; i < NR_PCA9539_GPIOS; i++) {
+   if ((rising | falling) & (1u << i)) {
+   int irq = chip->irq_start + i;
+   struct irq_desc *desc;
+
+   desc = irq_desc + irq;
+   desc_handle_irq(irq, desc);
+   }
+   }
+
+   irq_exit();
+
+   chip->last_input = input;
+}
+
+static void fastcall
+pca9539_irq_demux(unsigned int irq, struct irq_desc *desc)
+{
+   struct pca9539_chip *chip = desc->handler_data;
+
+   desc->chip->mask(chip->irq);
+   desc->chip->ack(chip->irq);
+   schedule_work(>irq_work);
+   desc->chip->unmask(chip->irq);
+}
+
+static void pca9539_irq_mask(unsigned int irq)
+{
+   struct irq_desc *desc = irq_desc + irq;
+   struct pca9539_chip *chip = desc->chip_data;
+
+   

[PATCH][for -mm] fix accounting in vmscan.c for memory controller

2007-12-10 Thread KAMEZAWA Hiroyuki
Without this, ALLOCSTALL and PGSCAN_DIRECT increases too much unless
there is no memory shortage.

against 2.6.24-rc4-mm1.

-Kame

==
Some amount of accounting is done while page reclaiming.

Now, there are 2 types of page reclaim (if memory controller is used)
  - global: shortage of (global) pages.
  - under cgroup: use up to limit.

I think 2 accountings, ALLOCSTALL and DIRECT should be accounted only under
global lru scan. They are accounted against memory shortage at alloc_pages().

Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>

 mm/vmscan.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

Index: linux-2.6.24-rc4-mm1/mm/vmscan.c
===
--- linux-2.6.24-rc4-mm1.orig/mm/vmscan.c
+++ linux-2.6.24-rc4-mm1/mm/vmscan.c
@@ -896,8 +896,9 @@ static unsigned long shrink_inactive_lis
if (current_is_kswapd()) {
__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan);
__count_vm_events(KSWAPD_STEAL, nr_freed);
-   } else
+   } else if (scan_global_lru(sc))
__count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scan);
+
__count_zone_vm_events(PGSTEAL, zone, nr_freed);
 
if (nr_taken == 0)
@@ -1333,7 +1334,8 @@ static unsigned long do_try_to_free_page
unsigned long lru_pages = 0;
int i;
 
-   count_vm_event(ALLOCSTALL);
+   if (scan_global_lru(sc))
+   count_vm_event(ALLOCSTALL);
/*
 * mem_cgroup will not do shrink_slab.
 */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/1] Writeback fix for concurrent large and small file writes.

2007-12-10 Thread Michael Rubin
From: Michael Rubin <[EMAIL PROTECTED]>

Fixing a bug where writing to large files while concurrently writing to
smaller ones creates a situation where writeback cannot keep up with the
traffic and memory baloons until the we hit the threshold watermark. This
can result in surprising latency spikes when syncing. This latency
can take minutes on large memory systems. Upon request I can provide
a test to reproduce this situation.

The only concern I have is that this makes the wb_kupdate slightly more
agressive. I am not sure it is enough to cause any problems. I think
there is enough checks to throttle the background activity.

Feng also the one line change that you recommended here 
http://marc.info/?l=linux-kernel=119629655402153=2 had no effect.

Signed-off-by: Michael Rubin <[EMAIL PROTECTED]>
---
Index: 2624rc3_feng/fs/fs-writeback.c
===
--- 2624rc3_feng.orig/fs/fs-writeback.c 2007-11-29 14:44:24.0 -0800
+++ 2624rc3_feng/fs/fs-writeback.c  2007-12-10 17:21:45.0 -0800
@@ -408,8 +408,7 @@ sync_sb_inodes(struct super_block *sb, s
 {
const unsigned long start = jiffies;/* livelock avoidance */
 
-   if (!wbc->for_kupdate || list_empty(>s_io))
-   queue_io(sb, wbc->older_than_this);
+   queue_io(sb, wbc->older_than_this);
 
while (!list_empty(>s_io)) {
struct inode *inode = list_entry(sb->s_io.prev,
Index: 2624rc3_feng/mm/page-writeback.c
===
--- 2624rc3_feng.orig/mm/page-writeback.c   2007-11-16 21:16:36.0 
-0800
+++ 2624rc3_feng/mm/page-writeback.c2007-12-10 17:37:17.0 -0800
@@ -638,7 +638,7 @@ static void wb_kupdate(unsigned long arg
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
writeback_inodes();
if (wbc.nr_to_write > 0) {
-   if (wbc.encountered_congestion || wbc.more_io)
+   if (wbc.encountered_congestion)
congestion_wait(WRITE, HZ/10);
else
break;  /* All the old data is written */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: outb 0x80 in inb_p, outb_p harmful on some modern AMD64 with MCP51 laptops

2007-12-10 Thread H. Peter Anvin

David Newall wrote:


Exactly.  You think it's 2us, but the documentation doesn't say.  The _p 
functions are generic inasmuch as they provide an unspecified delay.  
Drivers which work across platforms, and which use _p, therefore have 
different delays on different platforms.  Should the length of the delay 
be unimportant?  I wouldn't have thought so.  If it is important, does 
that mean that such drivers are buggy on some platforms?




That the _p delay is different across platforms is actually to be 
expected, since it pretty much amounts to a platform delay.  And yes, if 
it is used as a specific walltime delay that has nothing to do with the 
bus architecture of the system then I would classify that as a driver bug.


-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: outb 0x80 in inb_p, outb_p harmful on some modern AMD64 with MCP51 laptops

2007-12-10 Thread H. Peter Anvin

David Newall wrote:

H. Peter Anvin wrote:

David Newall wrote:
Where did the 8us delay come from?  The documentation and source is 
careful not to say how long the delay is.  Would changing it to, say 
1us, be technically wrong?  Is code that requires 8us correct?


I think a single ISA bus transaction is 1 µs, so two of them back to 
back should be 2 µs, not 8 µs...


Exactly.  You think it's 2us, but the documentation doesn't say.  The _p 
functions are generic inasmuch as they provide an unspecified delay.  
Drivers which work across platforms, and which use _p, therefore have 
different delays on different platforms.  Should the length of the delay 
be unimportant?  I wouldn't have thought so.  If it is important, does 
that mean that such drivers are buggy on some platforms?




What it specifically does is it generates a delay which is proportional 
to the ISA/LPC clock.


I really *hate* the idea that access to non-present hardware is used to 
generate a delay.  That sucks so badly.  It's worthy of a school-aged 
hacker, not of a world-leading operating system.  It's so not 
best-practice that it's worst-practice.




Perhaps you do, but it's the de facto standard on the platform.  Every 
BIOS uses the same technique, because it works.


*Now*, the real question is how many drivers actually need these delays. 
 My guess is most don't at all.


-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm 1/2] wait_task_stopped: remove unneeded delay_group_leader check

2007-12-10 Thread Roland McGrath
Your change looks correct to me.


Thanks,
Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: outb 0x80 in inb_p, outb_p harmful on some modern AMD64 with MCP51 laptops

2007-12-10 Thread David Newall

H. Peter Anvin wrote:

David Newall wrote:
Where did the 8us delay come from?  The documentation and source is 
careful not to say how long the delay is.  Would changing it to, say 
1us, be technically wrong?  Is code that requires 8us correct?


I think a single ISA bus transaction is 1 µs, so two of them back to 
back should be 2 µs, not 8 µs...


Exactly.  You think it's 2us, but the documentation doesn't say.  The _p 
functions are generic inasmuch as they provide an unspecified delay.  
Drivers which work across platforms, and which use _p, therefore have 
different delays on different platforms.  Should the length of the delay 
be unimportant?  I wouldn't have thought so.  If it is important, does 
that mean that such drivers are buggy on some platforms?


I really *hate* the idea that access to non-present hardware is used to 
generate a delay.  That sucks so badly.  It's worthy of a school-aged 
hacker, not of a world-leading operating system.  It's so not 
best-practice that it's worst-practice.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm 4/2] ptrace_stop: fix racy nonstop_code setting

2007-12-10 Thread Roland McGrath
Your change looks correct to me.


Thanks,
Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm 1/2] ptrace_stop: fix the race with ptrace detach+attach

2007-12-10 Thread Roland McGrath
Your change looks correct to me.


Thanks,
Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] will_become_orphaned_pgrp: we have threads

2007-12-10 Thread Eric W. Biederman
Oleg Nesterov <[EMAIL PROTECTED]> writes:

> On 12/09, Eric W. Biederman wrote:
>>
>> Oleg below is my proof of concept patch, which really needs to be
>> broken up into a whole patch series, so the changes are small
>> enough we can do a thorough audit on them.  Anyway take a look
>> and see what you think.
>
> Amazing ;)
>
> This patch certainly needs a time for understanding, so far I have
> read only the small subset, a couple of random questions.

Well I think it succeeds as a proof of concept and totally fails
as a production patch at this point.

>>   * pgrp and session fields are deprecated.
>> @@ -1034,8 +1035,9 @@ struct task_struct {
>> struct list_head sibling; /* linkage in my parent's children list */
>>  struct task_struct *group_leader;   /* threadgroup leader */
>>
>> +struct pid *tid;
>>  /* PID/PID hash table linkage. */
>> -struct pid_link pids[PIDTYPE_MAX];
>> +struct hlist_node pids[PIDTYPE_ARRAY_MAX];
>
> OK. It certainly makes sense to move PIDTYPE_PGID/SID pids from task_struct
> to signal struct.
>
> But can't we go a bit further? With this patch pid->tasks[].first still 
> "points"
> to leader's task_struct. Suppose we replace pid->tasks[] with pid->signals[],
> so that pid->signals[].first points to signal_struct. Then we can find the 
> task
> (group_leader) via signal->tgid.
>
> This means we can remove task_struct->pids, and kill transfer_pid().

We need a way to sill implement do_each_pid_task, but otherwise that should
work and be a nice clean up all on it's own.

>>  static inline struct pid *task_tgid(struct task_struct *task)
>>  {
>> -return task->group_leader->pids[PIDTYPE_PID].pid;
>> +struct signal_struct *sig = rcu_dereference(task->signal);
>> +struct pid *pid = NULL;
>> +if (sig)
>> +pid = sig->tgid;
>> +return pid;
>>  }
>
> Hmm. This is fixable, but note that task->signal is not RCU protected,
> only ->sighand.

Yes.  I realized that after I had sent the patch out.  We do run
those functions with just rcu protection sometimes so something would
need to be resolved there.

>>  static inline int pid_alive(struct task_struct *p)
>>  {
>> -return p->pids[PIDTYPE_PID].pid != NULL;
>> +return p->signal != NULL;
>>  }
>
> (this change btw is imho good regardless, because pid_alive() currently
>  means "the task is not unhashed yet" anyway).

Yes.

>>  static void __unhash_process(struct task_struct *p)
>>  {
>>  nr_threads--;
>> -detach_pid(p, PIDTYPE_PID);
>>  if (thread_group_leader(p)) {
>>  detach_pid(p, PIDTYPE_PGID);
>>  detach_pid(p, PIDTYPE_SID);
>> @@ -65,6 +64,7 @@ static void __unhash_process(struct task_struct *p)
>>  list_del_rcu(>tasks);
>>  __get_cpu_var(process_counts)--;
>>  }
>> +detach_pid(p, PIDTYPE_PID);
>
> Not sure why this change is needed... To prevent the premature
> detach_pid()->free_pid() ? But this doesn't looks possible, if
> the task is leader, p->tid->tsk == p, and detach_pid() does

This is a bit of a relic of how my patch developed.
I had the "if (task->tid != tsk->signal->tgid)" check in there
and was assuming the thread group id as my pid so I could clean
things up properly.  And it worked out nicer if the detach_pid
was for PIDTYPE_PID came later as I could reuse the same logic
as in de_thread.

>
>   if (pid->tsk)   // still used, don't free.
>   return;
>
>> @@ -946,6 +920,48 @@ fastcall NORET_TYPE void do_exit(long code)
>>  }
>>
>>  tsk->flags |= PF_EXITING;
>> +/* Transfer thread group leadership */
>> +if (thread_group_leader(tsk) && !thread_group_empty(tsk)) {
>
> Ah, this is racy without tasklist_lock. Suppose that the current
> ->group_leader exits right now and elects us as a new leader.

Hmm.  I thought I was redoing that test inside of the lock.
Anyway this hunk probably needs the most work as it is brand new code.

>> +struct task_struct *new_leader, *t;
>> +write_lock_irq(_lock);
>> +for (t = next_thread(tsk); t != tsk; t = next_thread(t)) {
>> +if (!(t->flags & PF_EXITING))
>> +break;
>> +}
>> +if (t != tsk) {
>> +new_leader = t;
>> +
>> +new_leader->start_time = tsk->start_time;
>> +task_pid(tsk)->tsk = new_leader;
>
> So this pid won't be freed when current does detach_pid(PIDTYPE_PID), from
> now current->tid->tsk != current, so detach_pid() doesn't clear pid->tsk.
>
> But when it will be freed then?

When new_leader does detach_pid on it.

>> +transfer_pid(tsk, new_leader, PIDTYPE_PGID);
>> +transfer_pid(tsk, new_leader, PIDTYPE_SID);
>> +list_replace_rcu(>tasks, _leader->tasks);
>> +
>> +/* Update group_leader on all of the threads... */
>> +new_leader->group_leader = new_leader;
>> 

Re: Why does reading from /dev/urandom deplete entropy so much?

2007-12-10 Thread Theodore Tso
On Mon, Dec 10, 2007 at 05:35:25PM -0600, Matt Mackall wrote:
> > I must have missed this. Can you please explain again? For a layman it
> > looks like a paranoid application cannot read 500 Bytes from
> > /dev/random without blocking if some other application has previously
> > read 10 Kilobytes from /dev/urandom.
> 
> /dev/urandom always leaves enough entropy in the input pool for
> /dev/random to reseed. Thus, as long as entropy is coming in, it is
> not possible for /dev/urandom readers to starve /dev/random readers.
> But /dev/random readers may still block temporarily and they should
> damn well expect to block if they read 500 bytes out of a 512 byte
> pool.

A paranoid application should only need to read ~500 bytes if it is
generating a long-term RSA private key, and in that case, it would do
well to use a non-blocking read, and if it can't get enough bytes, it
should prompt the user to move the mouse around or bang on the
keyboard.  /dev/random is *not* magic where you can assume that you
will always get an unlimited amount of good randomness.  Applications
who assume this are broken, and it has nothing to do with DOS attacks.

Note that even paranoid applicatons should not be using /dev/random
for session keys; again, /dev/random isn't magic, and entropy isn't
unlimited. Instead, such an application should pull 16 bytes or so,
and then use it to seed a cryptographic random number generator.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ITIMER_REAL: convert to use struct pid

2007-12-10 Thread Roland McGrath
This looks fine to me.

Thanks,
Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: outb 0x80 in inb_p, outb_p harmful on some modern AMD64 with MCP51 laptops

2007-12-10 Thread H. Peter Anvin

David Newall wrote:
Where did the 8us delay come from?  The documentation and source is 
careful not to say how long the delay is.  Would changing it to, say 
1us, be technically wrong?  Is code that requires 8us correct?


I think a single ISA bus transaction is 1 µs, so two of them back to 
back should be 2 µs, not 8 µs...


-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Updates to nfsroot documentation (take 3)

2007-12-10 Thread Amos Waterland
The difference between ip=off and ip=::off has been a cause of much
confusion.  Document how each behaves, and do not contradict ourselves
by saying that "off" is the default when in fact "any" is the default
and is descibed as being so lower in the file.

Signed-off-by: Amos Waterland <[EMAIL PROTECTED]>

 Documentation/nfsroot.txt |   12 +---
 net/ipv4/ipconfig.c   |   20 +---
 2 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/Documentation/nfsroot.txt b/Documentation/nfsroot.txt
index 16a7cae..0e87890 100644
--- a/Documentation/nfsroot.txt
+++ b/Documentation/nfsroot.txt
@@ -92,8 +92,14 @@ 
ip=::
   autoconfiguration.
 
   The  parameter can appear alone as the value to the `ip'
-  parameter (without all the ':' characters before) in which case auto-
-  configuration is used.
+  parameter (without all the ':' characters before).  If the value is
+  "ip=off" or "ip=none", no autoconfiguration will take place, otherwise
+  autoconfiguration will take place.  The most common way to use this 
+  is "ip=dhcp".
+
+  Note that "ip=off" is not the same thing as "ip=::off", because in 
+  the latter autoconfiguration will take place if any of DHCP, BOOTP or RARP
+  are compiled in the kernel.
 
 IP address of the client.
 
@@ -142,7 +148,7 @@ 
ip=::
into the kernel will be used, regardless of the value of
this option.
 
-  off or none: don't use autoconfiguration (default)
+  off or none: don't use autoconfiguration
  on or any:   use any protocol available in the kernel
  dhcp:use DHCP
  bootp:   use BOOTP
diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index c5c107a..96400b0 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -1396,25 +1396,7 @@ late_initcall(ip_auto_config);
 
 /*
  *  Decode any IP configuration options in the "ip=" or "nfsaddrs=" kernel
- *  command line parameter. It consists of option fields separated by colons in
- *  the following order:
- *
- *  ::
- *
- *  Any of the fields can be empty which means to use a default value:
- *  - address given by BOOTP or RARP
- *  - address of host returning BOOTP or RARP packet
- *  - none, or the address returned by BOOTP
- *- automatically determined from , or the
- *   one returned by BOOTP
- *  -  in ASCII notation, or the name returned
- *   by BOOTP
- * - use all available devices
- * :
- *off|none - don't do autoconfig at all (DEFAULT)
- *on|any   - use any configured protocol
- *dhcp|bootp|rarp  - use only the specified protocol
- *both - use both BOOTP and RARP (not DHCP)
+ *  command line parameter.  See Documentation/nfsroot.txt.
  */
 static int __init ic_proto_name(char *name)
 {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm 3/3] ptrace_check_attach: remove unneeded ->signal != NULL check

2007-12-10 Thread Roland McGrath
This looks fine to me.

Thanks,
Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kexec: force x86_64 arches to boot kdump kernels on boot cpu

2007-12-10 Thread Eric W. Biederman
> Sorry to reply to myself, but do we have consensus on this patch?  I'd like to
> figure out its disposition if possible.  

What the patch tries to do looks like the right thing.  So if we can get
a version that is clean and actually works we should merge it.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm 2/3] kill my_ptrace_child()

2007-12-10 Thread Roland McGrath
This looks fine to me.

Thanks,
Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm 1/3] kill PT_ATTACHED

2007-12-10 Thread Roland McGrath
Starting to catch up on some old patch review today.  This one has my ACK.


Thanks,
Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] Fix use of skb after netif_rx

2007-12-10 Thread David Miller
From: Julia Lawall <[EMAIL PROTECTED]>
Date: Sun, 9 Dec 2007 21:03:55 +0100 (CET)

> From: Julia Lawall <[EMAIL PROTECTED]>
> 
> Recently, Wang Chen submitted a patch
> (d30f53aeb31d453a5230f526bea592af07944564) to move a call to netif_rx(skb)
> after a subsequent reference to skb, because netif_rx may call kfree_skb on
> its argument.  The same problem occurs in some other drivers as well.
> 
> This was found using the following semantic match.
> (http://www.emn.fr/x-info/coccinelle/)
 ...
> Signed-off-by: Julia Lawall <[EMAIL PROTECTED]>

Also applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] Fix use of skb after netif_rx

2007-12-10 Thread David Miller
From: Julia Lawall <[EMAIL PROTECTED]>
Date: Sun, 9 Dec 2007 21:05:30 +0100 (CET)

> From: Julia Lawall <[EMAIL PROTECTED]>
> 
> Recently, Wang Chen submitted a patch
> (d30f53aeb31d453a5230f526bea592af07944564) to move a call to netif_rx(skb)
> after a subsequent reference to skb, because netif_rx may call kfree_skb on
> its argument.  netif_rx_ni calls netif_rx, so the same problem occurs in
> the files below.
> 
> I have left the updating of dev->last_rx after the calls to netif_rx_ni
> because it seems time dependent, but moved the other field updates before.
> 
> This was found using the following semantic match.
> (http://www.emn.fr/x-info/coccinelle/)
 ...
> Signed-off-by: Julia Lawall <[EMAIL PROTECTED]>

Applied.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] Fix use of skb after netif_rx

2007-12-10 Thread David Miller
From: Julia Lawall <[EMAIL PROTECTED]>
Date: Sun, 9 Dec 2007 21:02:31 +0100 (CET)

> From: Julia Lawall <[EMAIL PROTECTED]>
> 
> Recently, Wang Chen submitted a patch
> (d30f53aeb31d453a5230f526bea592af07944564) to move a call to netif_rx(skb)
> after a subsequent reference to skb, because netif_rx may call kfree_skb on
> its argument.  The same problem occurs in some other drivers as well.
> 
> This was found using the following semantic match.
> (http://www.emn.fr/x-info/coccinelle/)
 ...
> Signed-off-by: Julia Lawall <[EMAIL PROTECTED]>

Patch applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kexec: force x86_64 arches to boot kdump kernels on boot cpu

2007-12-10 Thread Eric W. Biederman
Neil Horman <[EMAIL PROTECTED]> writes:

> On Fri, Dec 07, 2007 at 09:21:44AM -0500, Neil Horman wrote:
>> On Fri, Dec 07, 2007 at 01:22:04AM -0800, Yinghai Lu wrote:
>> > On Dec 7, 2007 12:50 AM, Yinghai Lu <[EMAIL PROTECTED]> wrote:
>> > >
>> > > On Dec 6, 2007 4:33 PM, Eric W. Biederman <[EMAIL PROTECTED]> wrote:
>> > ...
>> > > >
>> > > > My feel is that if it is for legacy interrupts only it should not be a
> problem.
>> > > > Let's investigate and see if we can unconditionally enable this quirk
>> > > > for all opteron systems.
>> > >
>> > > i checked that bit
>> > >
>> > >
> http://www.openbios.org/viewvc/trunk/LinuxBIOSv2/src/northbridge/amd/amdk8/coherent_ht.c?revision=2596=markup
> 
>> > 
>> > it should be bit 18 (HTTC_APIC_EXT_ID)
>> > 
>> > 
>> > YH
>> 
>> this seems reasonable, I can reroll the patch for this.  As I think about it
> I'm
>> also going to update the patch to make this check occur for any pci class 
>> 0600
>> device from vendor AMD, since its possible that more than just nvidia 
>> chipsets
>> can be affected.
>> 
>> I'll repost as soon as I've tested, thanks!
>> Neil
>
>
> Ok, New patch attached.  It preforms the same function as previously 
> described,
> but is more restricted in its application.  As Yinghai pointed out, the
> broadcast mask bit (bit 17 in the htcfg register) should only be enabled, if 
> the
> extened apic id bit (bit 18 in the same register) is also set.  So this patch
> now check for that bit to be turned on first.  Also, this patch now adds an
> independent quirk check for all AMD hypertransport host controllers, since its
> possible for this misconfiguration to be present in systems other than 
> nvidias.
> The net effect of these changes is, that its now applicable to all AMD systems
> containing hypertransport busses, and is only activated if extended apic ids 
> are
> in use, meaning that this quirk guarantees that all processors in a system are
> elligible to receive interrupts from the ioapic, even if their apicid extends
> beyond the nominal 4 bit limitation.  Tested successfully by me.
>
> Thanks & Regards
> Neil
>
> Signed-off-by: Neil Horman <[EMAIL PROTECTED]>
>
>
>  early-quirks.c | 83 -
>  1 file changed, 76 insertions(+), 7 deletions(-)
>
>
>
> diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
> index 88bb83e..d5a7b30 100644
> --- a/arch/x86/kernel/early-quirks.c
> +++ b/arch/x86/kernel/early-quirks.c
> @@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct 
> acpi_table_header
> *header)
>  #endif /* CONFIG_X86_IO_APIC */
>  #endif /* CONFIG_ACPI */
>  
> +static void __init fix_hypertransport_config(int num, int slot, int func)
> +{
> + u32 htcfg;
> + /*
> +  *we found a hypertransport bus
> +  *make sure that are broadcasting
> +  *interrupts to all cpus on the ht bus
> +  *if we're using extended apic ids
> +  */
> + htcfg = read_pci_config(num, slot, func, 0x68);
> + if ((htcfg & (1 << 18)) == 1) { 

Ok.  This test is broken.  Please remove the == 1.  You are looking
for == (1 << 18).  So just saying: "if (htcfg & (1 << 18))" should be clearer.

> + printk(KERN_INFO "Detected use of extended apic ids on hypertransport 
> bus\n");
> + if ((htcfg & (1 << 17)) == 0) {
> + printk(KERN_INFO "Enabling hypertransport extended apic interrupt
> broadcast\n");
> + htcfg |= (1 << 17);
> + write_pci_config(num, slot, func, 0x68, htcfg);
> + }
> + }
> + 
> +}

The rest of this quirk looks fine, include the fact it is only intended
to be applied to PCI_VENDOR_ID_AMD PCI_DEVICE_ID_AMD_K8_NB.


For what is below I don't like the way the infrastructure has been
extended as what you are doing quickly devolves into a big mess.

Please extend struct chipset to be something like:
struct chipset {
u16 vendor;
u16 device;
u32 class, class_mask;
void (*f)(void);
};

And then the test for matching the chipset can be something like:
if ((id->vendor == PCI_ANY_ID || id->vendor == dev->vendor) &&
(id->device == PCI_ANY_ID || id->device == dev->device) &&
!((id->class ^ dev->class) & id->class_mask))

Essentially a subset of pci_match_one_device from drivers/pci/pci.h

That way you don't need to increase the number of tables or the
number of passes through the pci busses, just update the early_qrk
table with a few more bits of information.

The extended form should be much more maintainable in the long
run.  Given that we may want this before we enable the timer
which is very early doing this in the pci early quirks seems
to make sense.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: outb 0x80 in inb_p, outb_p harmful on some modern AMD64 with MCP51 laptops

2007-12-10 Thread David Newall
Where did the 8us delay come from?  The documentation and source is 
careful not to say how long the delay is.  Would changing it to, say 
1us, be technically wrong?  Is code that requires 8us correct?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

2007-12-10 Thread Arjan van de Ven
On Tue, 11 Dec 2007 01:01:25 +0100
Guillaume Chazarain <[EMAIL PROTECTED]> wrote:

> Arjan van de Ven <[EMAIL PROTECTED]> wrote:
> 
> > the frequency of both cores is the maximum of what linux sets each
> > core to;
> 
> Do you mean that the cpufreq code can be confused about the actual
> frequency of the cores? 

it means that cpufreq doesn't know the actual frequency (although bios 
sometimes tells us about the relationship, often the bios just lies through 
it's teeth); it only knows what it asks for, not what it gets. We know it'll 
get at least what it asks for, but it can get more than it asks for basically.

>That sounds like a big problem.

it'll get way worse going forward.
(but even on todays systems, the tsc no longer represents frequency, but is 
some fixed clock totally unrelated to cpu frequency)

-- 
If you want to reach me at my work email, use [EMAIL PROTECTED]
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: help

2007-12-10 Thread David Newall

Thanos Chatziathanassiou wrote:

help


I KNOW OF PLACES, ACTIONS, AND THINGS. MOST OF MY VOCABULARY
DESCRIBES PLACES AND IS USED TO MOVE YOU THERE. TO MOVE TRY
WORDS LIKE FOREST, BUILDING, DOWNSTREAM, ENTER, EAST, WEST
NORTH, SOUTH, UP, OR DOWN.  I KNOW ABOUT A FEW SPECIAL OBJECTS,
LIKE A BLACK ROD HIDDEN IN THE CAVE. THESE OBJECTS CAN BE
MANIPULATED USING ONE OF THE ACTION WORDS THAT I KNOW. USUALLY
YOU WILL NEED TO GIVE BOTH THE OBJECT AND ACTION WORDS
(IN EITHER ORDER), BUT SOMETIMES I CAN INFER THE OBJECT FROM
THE VERB ALONE. THE OBJECTS HAVE SIDE EFFECTS - FOR
INSTANCE, THE ROD SCARES THE BIRD.
USUALLY PEOPLE HAVING TROUBLE MOVING JUST NEED TO TRY A FEW
MORE WORDS. USUALLY PEOPLE TRYING TO MANIPULATE AN
OBJECT ARE ATTEMPTING SOMETHING BEYOND THEIR (OR MY!)
CAPABILITIES AND SHOULD TRY A COMPLETELY DIFFERENT TACK.
TO SPEED THE GAME YOU CAN SOMETIMES MOVE LONG DISTANCES
WITH A SINGLE WORD. FOR EXAMPLE, 'BUILDING' USUALLY GETS
YOU TO THE BUILDING FROM ANYWHERE ABOVE GROUND EXCEPT WHEN
LOST IN THE FOREST. ALSO, NOTE THAT CAVE PASSAGES TURN A
LOT, AND THAT LEAVING A ROOM TO THE NORTH DOES NOT GUARANTEE
ENTERING THE NEXT FROM THE SOUTH. GOOD LUCK!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump

2007-12-10 Thread Huang, Ying
On Mon, 2007-12-10 at 17:31 -0500, Vivek Goyal wrote:
> [..]
> >  
> > -#define KEXEC_ON_CRASH  0x0001
> > -#define KEXEC_ARCH_MASK 0x
> > +#define KEXEC_ON_CRASH 0x0001
> > +#define KEXEC_PRESERVE_CPU 0x0002
> > +#define KEXEC_PRESERVE_CPU_EXT 0x0004
> > +#define KEXEC_SINGLE_CPU   0x0008
> > +#define KEXEC_PRESERVE_DEVICE  0x0010
> > +#define KEXEC_PRESERVE_CONSOLE 0x0020
> 
> Hi,
> 
> Why do we need so many different flags for preserving different types
> of state (CPU, CPU_EXT, Device, console) ? To keep things simple,
> can't we can create just one flag KEXEC_PRESERVE_CONTEXT, which will
> indicate any special action required for preserving the previous kernel's
> context so that one can swith back to old kernel?

Yes. There are too many flags, especially when we have no users of these
flags now. It is better to use one flag such as KEXEC_PRESERVE_CONTEXT
now, and create the others required flags when really needed.

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump

2007-12-10 Thread Huang, Ying
On Mon, 2007-12-10 at 14:55 -0500, Vivek Goyal wrote:
> On Fri, Dec 07, 2007 at 03:53:30PM +, Huang, Ying wrote:
> > This patch implements the functionality of jumping between the kexeced
> > kernel and the original kernel.
> > 
> 
> Hi,
> 
> I am just going through your patches and trying to understand it. Don't
> understand many things. Asking is easy so here you go...
> 
> > To support jumping between two kernels, before jumping to (executing)
> > the new kernel and jumping back to the original kernel, the devices
> > are put into quiescent state, and the state of devices and CPU is
> > saved. After jumping back from kexeced kernel and jumping to the new
> > kernel, the state of devices and CPU are restored accordingly. The
> > devices/CPU state save/restore code of software suspend is called to
> > implement corresponding function.
> > 
> 
> I need jumping back to restore a already hibernated kernel image? Can
> you please tell little more about jumping back and why it is needed?

Now, the jumping back is used to implement "kexec based hibernation",
which uses kexec/kdump to save the memory image of hibernated kernel
during hibernating, and uses /dev/oldmem to restore the memory image of
hibernated kernel and jump back to the hibernated kernel to continue
run.

The other usage model maybe include:

- Dump the system memory image then continue to run, that is, get some
memory snapshot of system during system running.
- Cooperative multi-task of different OS. You can load another OS (B)
from current OS (A), and jump between the two OSes upon needed.
- Call some code (such as firmware, etc) in physical mode. 

> > To support jumping without reserving memory. One shadow backup page
> > (source page) is allocated for each page used by new (kexeced) kernel
> > (destination page). When do kexec_load, the image of new kernel is
> > loaded into source pages, and before executing, the destination pages
> > and the source pages are swapped, so the contents of destination pages
> > are backupped. Before jumping to the new (kexeced) kernel and after
> > jumping back to the original kernel, the destination pages and the
> > source pages are swapped too.
> > 
> 
> Ok, so due to swapping of source and destination pages first kernel's data
> is still preserved.  How do I get the dynamic memory required for second
> kernel boot (without writing first kernel's data)?

All dynamic memory required for second kernel should be "loaded" by
sys_kexec_load in first kernel. For example, not only the Linux kernel
should be loaded at 1M, the memory 0~16M (exclude kernel) should be
"loaded" (all zero) by /sbin/kexec via sys_kexec_load too.

> > A jump back protocol for kexec is defined and documented. It is an
> > extension to ordinary function calling protocol. So, the facility
> > provided by this patch can be used to call ordinary C function in real
> > mode.
> > 
> > A set of flags for sys_kexec_load are added to control which state are
> > saved/restored before/after real mode code executing. For example, you
> > can specify the device state and FPU state are saved/restored
> > before/after real mode code executing.
> > 
> > The states (exclude CPU state) save/restore code can be overridden
> > based on the "command" parameter of kexec jump. Because more states
> > need to be saved/restored by hibernating/resuming.
> > 
> > Signed-off-by: Huang Ying <[EMAIL PROTECTED]>
> > 
> > ---
> >  Documentation/i386/jump_back_protocol.txt |  103 ++
> >  arch/powerpc/kernel/machine_kexec.c   |2 
> >  arch/ppc/kernel/machine_kexec.c   |2 
> >  arch/sh/kernel/machine_kexec.c|2 
> >  arch/x86/kernel/machine_kexec_32.c|   88 +---
> >  arch/x86/kernel/machine_kexec_64.c|2 
> >  arch/x86/kernel/relocate_kernel_32.S  |  214 
> > +++---
> >  include/asm-x86/kexec_32.h|   39 -
> >  include/linux/kexec.h |   40 +
> >  kernel/kexec.c|  188 ++
> >  kernel/power/Kconfig  |2 
> >  kernel/sys.c  |   35 +++-
> >  12 files changed, 648 insertions(+), 69 deletions(-)
> > 
> > --- a/arch/x86/kernel/machine_kexec_32.c
> > +++ b/arch/x86/kernel/machine_kexec_32.c
> > @@ -20,6 +20,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #define PAGE_ALIGNED __attribute__ ((__aligned__(PAGE_SIZE)))
> >  static u32 kexec_pgd[1024] PAGE_ALIGNED;
> > @@ -83,10 +84,14 @@ static void load_segments(void)
> >   * reboot code buffer to allow us to avoid allocations
> >   * later.
> >   *
> > - * Currently nothing.
> > + * Turn off NX bit for control page.
> >   */
> >  int machine_kexec_prepare(struct kimage *image)
> >  {
> > +   if (nx_enabled) {
> > +   change_page_attr(image->control_code_page, 1, PAGE_KERNEL_EXEC);
> > +   global_flush_tlb();
> > +   }
> > return 0;
> >  }

Re: [RFC,PATCH 3/3] do_wait: fix waiting for stopped group with dead leader

2007-12-10 Thread Eric W. Biederman
Oleg Nesterov <[EMAIL PROTECTED]> writes:

> do_wait(WSTOPPED) assumes that p->state must be == TASK_STOPPED, this is not
> true if the leader is already dead. Check SIGNAL_STOP_STOPPED instead and use
> ->signal->group_exit_code.
>
> This patch is not complete if not buggy. At the very minimum it needs cleanup.

Thinking about this set of problems.  Testing SIGNAL_STOP_STOPPED
seems more correct then testing TASK_STOPPED.  It ensures we don't
have a race, and except for ptrace the only way to stop a task
triggers SIGNAL_STOP_STOPPED.

We need a similar flag for thread group exit, to mark when every task
in the thread group has exited.

With those in place we can have race free tests of our status.
/proc//status needs to be updated to use those the per
signal struct status bits as well.

As for the exit_code, we set tsk->exit_code = sig->group_exit_code
so that doesn't seem to be a problem either.

So to get a task group status looking at bits on the signal struct
looks like the right approach, as this ensures we can avoid races in
setting the status, and we don't need to test a dozen other fields.

There is still some value in my other approach but even it will
have small races if we continue look at per task status bits when
what we want is a per thread group status.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] i386: fix a few paravirt-related modpost warnings

2007-12-10 Thread Jeremy Fitzhardinge
Jan Beulich wrote:
> Signed-off-by: Jan Beulich <[EMAIL PROTECTED]>
>   
Acked-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>

>  arch/x86/kernel/head_32.S |2 +-
>  arch/x86/xen/setup.c  |2 +-
>  arch/x86/xen/xen-head.S   |2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
>
> --- linux-2.6.24-rc4/arch/x86/kernel/head_32.S2007-12-07 
> 09:00:59.0 +0100
> +++ 2.6.24-rc4-i386-lguest-warning/arch/x86/kernel/head_32.S  2007-12-05 
> 18:30:33.0 +0100
> @@ -151,7 +151,7 @@ WEAK(xen_entry)
>   /* Unknown implementation; there's really
>  nothing we can do at this point. */
>   ud2a
> -.data
> +.section .init.data, "aw"
>  subarch_entries:
>   .long default_entry /* normal x86/PC */
>   .long lguest_entry  /* lguest hypervisor */
> --- linux-2.6.24-rc4/arch/x86/xen/setup.c 2007-12-07 09:01:00.0 
> +0100
> +++ 2.6.24-rc4-i386-lguest-warning/arch/x86/xen/setup.c   2007-12-10 
> 17:31:06.0 +0100
> @@ -59,7 +59,7 @@ static void xen_idle(void)
>  /*
>   * Set the bit indicating "nosegneg" library variants should be used.
>   */
> -static void fiddle_vdso(void)
> +static __init void fiddle_vdso(void)
>  {
>   extern u32 VDSO_NOTE_MASK; /* See ../kernel/vsyscall-note.S.  */
>   extern char vsyscall_int80_start;
> --- linux-2.6.24-rc4/arch/x86/xen/xen-head.S  2007-12-07 09:01:00.0 
> +0100
> +++ 2.6.24-rc4-i386-lguest-warning/arch/x86/xen/xen-head.S2007-12-10 
> 17:25:46.0 +0100
> @@ -7,7 +7,7 @@
>  #include 
>  #include 
>  
> -.pushsection .init.text
> +.pushsection .init.text, "ax"
>  ENTRY(startup_xen)
>   movl %esi,xen_start_info
>   cld
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>   

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: outb 0x80 in inb_p, outb_p harmful on some modern AMD64 with MCP51 laptops

2007-12-10 Thread H. Peter Anvin

Rene Herman wrote:


By the way, David, it would be interesting if you could test 0xed. If 
your problem is some piece of hardware getting upset at LPC bus aborts 
it's not going to matter and we'd know an outb delay is just not an 
option on your system at least. You said you could quickly reproduce the 
problem with port 0x80?




I tried 0xED for a few versions (1.31-1.37) of SYSLINUX.  It broke on a 
lot of hardware (Phoenix BIOS uses 0xED by default, but BIOSes don't 
have to work on arbitrary hardware.)


-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Please revert: PCI: fix IDE legacy mode resources

2007-12-10 Thread Benjamin Herrenschmidt

> The GT-64111 system controller doesn't provide any kind of mapping
> functionality that would help here.  So legacy port addressing can only
> work by exploiting aliases due to incomplete decoding of legacy ioport
> addreses by the VT82C586 - but direct addressing is impossible.

Ok, that explains how the "fix" that we reverted worked. It caused crap
to be added to the top bits of the address :-)

So here, what you really want to do is not a call to
pcibios_resource_to_bus(), but you actually want to use a different bus
address in the first place, that you know the HW will decode the same
way.

The best way to achieve that imho, is to do a header quirk that is run
just after the generic probe code, which offsets the fixed legacy
resources by 0x1000 since that's really the bus address you are
going to emit.

Later on, your pcibios_fixup code should take that remove 0x1000
from all IO resources, since your 0xd000 mapping already maps
0x1000 as you probably already do.

The trick is, you don't want to convert a "resource" into a "bus
address" here, but really issue a different bus address.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Iomega ZIP-100 drive unsupported with jmicron JMB361 chip?

2007-12-10 Thread Robert Hancock

(linux-ide cc'ed)

trash can wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I have tolerated this problem for a year and do not post to this list in
haste. I have posted on forums and searched the community over the past
year. I have looked at the list archive on gossamer-threads.com for
solutions. With Fedora Core 6 unsupported (the last kernel for which my
zip drive worked), it is time for my last attempt at a solution. Please
CC: any response as I have not joined the list. I have compiled a
kernel-debug RPM and can run this if its output would help. Thank you
for any time you might devote to this problem.

motherboard: MSI P965 Platinum/Intel P965 Express Chipset Based (MS-7238
series)
Fedora 8 : kernel 2.6.23.1-42.fc8
Iomega Zip drive internal Model Z100ATAPI

lspci
03:00.0 SATA controller: JMicron Technologies, Inc. JMB361 AHCI/IDE (rev 02)
03:00.1 IDE interface: JMicron Technologies, Inc. JMB361 AHCI/IDE (rev 02)

# lsmod | grep ata
pata_jmicron8257  0
ata_generic 8901  0
ata_piix   16709  0
libata 99633  4 ahci,pata_jmicron,ata_generic,ata_piix
scsi_mod  119757  4 sr_mod,sg,libata,sd_mod

I have recently changed the BIOS setting for the SATA#1 Controller from
[IDE] to [AHCI] with no effect. I assume AHCI is correct?


AHCI is better, yes. It shouldn't be relevant this this problem though.



Text below attached as text.txt for readability.
from dmesg:
libata version 2.21 loaded.
device-mapper: ioctl: 4.11.0-ioctl (2006-10-12) initialised: [EMAIL PROTECTED]
PCI: Enabling device :03:00.1 ( -> 0001)
ACPI: PCI Interrupt :03:00.1[B] -> GSI 17 (level, low) -> IRQ 17
PCI: Setting latency timer of device :03:00.1 to 64
scsi0 : pata_jmicron
scsi1 : pata_jmicron
ata1: PATA max UDMA/100 cmd 0x0001cc00 ctl 0x0001c882 bmdma 0x0001c400 irq 17
ata2: PATA max UDMA/100 cmd 0x0001c800 ctl 0x0001c482 bmdma 0x0001c408 irq 17
ata1.00: ATAPI: LITE-ON DVDRW SOHW-1693S, KS0B, max UDMA/66
ata1.01: ATAPI: IOMEGA  ZIP 100   ATAPI, 05.H, max MWDMA1, CDB intr
ata1.00: configured for UDMA/66
ata1.01: configured for MWDMA1
scsi 0:0:0:0: CD-ROMLITE-ON  DVDRW SOHW-1693S KS0B PQ: 0 ANSI: 5
scsi 0:0:1:0: Direct-Access IOMEGA   ZIP 100  05.H PQ: 0 ANSI: 5
sd 0:0:1:0: [sda] 196608 512-byte hardware sectors (101 MB)
sd 0:0:1:0: [sda] Write Protect is off
sd 0:0:1:0: [sda] Mode Sense: 00 40 00 00
sd 0:0:1:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
sd 0:0:1:0: [sda] 196608 512-byte hardware sectors (101 MB)
sd 0:0:1:0: [sda] Write Protect is off
sd 0:0:1:0: [sda] Mode Sense: 00 40 00 00
sd 0:0:1:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
 sda:<6>sd 0:0:1:0: [sda] Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE,SUGGEST_OK
sd 0:0:1:0: [sda] Sense Key : Hardware Error [current]
sd 0:0:1:0: [sda] Add. Sense: Scsi parity error
end_request: I/O error, dev sda, sector 0
Buffer I/O error on device sda, logical block 0

If a disk is inserted into the drive (/var/log/messages)
Dec 10 14:22:53 localhost kernel: sd 0:0:1:0: [sda] Spinning up disk.<5>sd 
0:0:1:0: [sda] Spinning up diskready
Dec 10 14:22:53 localhost kernel: sd 0:0:1:0: [sda] 196608 512-byte hardware 
sectors (101 MB)
Dec 10 14:22:53 localhost kernel: sd 0:0:1:0: [sda] Write Protect is off
Dec 10 14:22:53 localhost kernel: sd 0:0:1:0: [sda] Write cache: enabled, read 
cache: enabled, doesn't support DPO or FUA
Dec 10 14:22:53 localhost kernel: sd 0:0:1:0: [sda] 196608 512-byte hardware 
sectors (101 MB)
Dec 10 14:22:53 localhost kernel: sd 0:0:1:0: [sda] Write Protect is off
Dec 10 14:22:53 localhost kernel: sd 0:0:1:0: [sda] Write cache: enabled, read 
cache: enabled, doesn't support DPO or FUA
Dec 10 14:22:53 localhost kernel:  sda:<6>sd 0:0:1:0: [sda] Result: 
hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Dec 10 14:22:53 localhost kernel: sd 0:0:1:0: [sda] Sense Key : Hardware Error 
[current]
Dec 10 14:22:53 localhost kernel: sd 0:0:1:0: [sda] Add. Sense: Scsi parity 
error
Dec 10 14:22:53 localhost kernel: end_request: I/O error, dev sda, sector 0
Dec 10 14:22:53 localhost kernel: printk: 42 messages suppressed.
Dec 10 14:22:53 localhost kernel: Buffer I/O error on device sda, logical block 0


That is rather curious. There's no sign of any libata error handling 
going on.. Maybe the drive is actually returning that error code in the 
ATAPI CDB, or at least we think it is?


You are sure that this drive still works with older kernels using 
drivers/ide, and that the hardware didn't break at some point, I assume?


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [PATCH 2/3] arch/ : Platform changes for UCC TDM driver for MPC8323ERDB.Also includes related QE changes.

2007-12-10 Thread Stephen Rothwell
On Mon, 10 Dec 2007 17:39:22 +0530 (IST) Poonam_Aggrwal-b10812 <[EMAIL 
PROTECTED]> wrote:
>
> +++ b/arch/powerpc/sysdev/qe_lib/qe.c
> @@ -149,22 +149,116 @@ EXPORT_SYMBOL(qe_issue_cmd);
>   */
>  static unsigned int brg_clk = 0;
>  
> -unsigned int get_brg_clk(void)
> +u32 get_brg_clk(enum qe_clock brgclk, enum qe_clock *brg_source)
>  {
> - struct device_node *qe;
> - if (brg_clk)
> - return brg_clk;
> + struct device_node *qe, *brg, *clocks;
> + enum qe_clock brg_src;
> + u32 brg_input_freq = 0;
> + u32 brg_num;
> + const unsigned int *prop;
>  
> - qe = of_find_node_by_type(NULL, "qe");
> - if (qe) {
> + *brg_source = 0;
> +
> + brg_num = brgclk - QE_BRG1;
> + brg = of_find_compatible_node(NULL, NULL, "fsl,cpm-brg");
> + if (brg) {
>   unsigned int size;
> - const u32 *prop = of_get_property(qe, "brg-frequency", );
> - brg_clk = *prop;
> - of_node_put(qe);
> - };
> + prop = of_get_property(brg,
> + "fsl,brg-sources", );
> +
> + brg_src = *(prop + brg_num);

You should probably sanity check that prop is not NULL and points to
something large enough.

You don't use brg after here, so the "of_node_put(brg)" could go here to
save putting it in multiple places later.  Also, currently there are
paths through the following code that do not do the of_node_put(brg).

> + if (brg_src == 0) {
> + *brg_source = 0;
> + if (brg_clk > 0) {
> + of_node_put(brg);
> + return brg_clk;
> + }
> + qe = of_find_node_by_type(NULL, "qe");
> + if (qe) {
> + unsigned int size;
> + prop = of_get_property
> + (qe, "brg-frequency", );
> + of_node_put(qe);
> + of_node_put(brg);
> + return *prop;

NULL check here (yes, I know that the old code didn't check).

> + }
> + } else {
> + *brg_source = brg_src + QE_CLK1 - 1;
> + clocks = of_find_compatible_node(NULL, NULL,
> + "fsl,cpm-clocks");
> + prop = of_get_property(clocks,
> + "#clock-cells", );
> + /*
> +  * clock-cells = 1 only supported right now.
> +  */
> + if (*prop != 1)

Again check for NULL (and possibly size).

> + return 0;
> + prop = of_get_property(clocks,
> + "clock-frequency", );
> +
> + brg_input_freq = *(prop+(brg_src - 1));

And again.

> + of_node_put(clocks);
> + of_node_put(brg);
> + return brg_input_freq;
> + }
> + }
>   return brg_clk;
>  }
-- 
Cheers,
Stephen Rothwell[EMAIL PROTECTED]
http://www.canb.auug.org.au/~sfr/


pgp1EalSLFKWO.pgp
Description: PGP signature


Re: Please revert: PCI: fix IDE legacy mode resources

2007-12-10 Thread Benjamin Herrenschmidt

On Mon, 2007-12-10 at 23:07 +, Alan Cox wrote:
> > Forcing controllers into native mode tends to be something that really
> > only works on -some- controllers. I'm happy to have a hack to try to do
> > that on all of them on powermacs, because the range of controllers that
> > might not be in native mode in the first place there is pretty small,
> > and for CHRP briq, I do it for a specific known controller only.
> 
> I'm thinking of doing this solely if the platform has
> CONFIG_ATA_NO_LEGACY set. In other words we'd only try this stunt on a
> system we *know* cannot address the low PCI space ports.

Allright. I don't set CONFIG_ATA_NO_LEGACY on powerpc anyway, as I do
support legacy ATA just fine on a range of machines. 

For example, Pegasos does the a quirk the other way around which is to
put it back the VIA IDE into legacy mode as there are issues with the
way that VIA chipset is configured on those machines.

It's mostly a matter of making sure for me that the IRQ routing match
what the platform code is set to deal with or that sort of thing as
unfortunately, anything that involves legacy stuff is still pretty much
full of hacks.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

2007-12-10 Thread Guillaume Chazarain
Arjan van de Ven <[EMAIL PROTECTED]> wrote:

> the frequency of both cores is the maximum of what linux sets each core to;

Do you mean that the cpufreq code can be confused about the actual
frequency of the cores? That sounds like a big problem.

Thanks for any insight.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Please revert: PCI: fix IDE legacy mode resources

2007-12-10 Thread Ralf Baechle
On Tue, Dec 11, 2007 at 07:43:03AM +1100, Benjamin Herrenschmidt wrote:

> > > :00:09.1 IDE interface: VIA Technologies, Inc.
> > > VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
> > > (prog-if 8a [Master SecP PriP])
> > > Flags: bus master, fast Back2Back, medium devsel, latency 64
> > > I/O ports at 1820 [size=16]
> > 
> > And that's lspci -v -b:
> > 
> > > :00:09.1 IDE interface: VIA Technologies, Inc.
> > > VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
> > > (prog-if 8a [Master SecP PriP])
> > > Flags: bus master, fast Back2Back, medium devsel, latency 64
> > > I/O ports at 10001820
> > 
> > So the IDE controller already seems to be in native mode?
> > 
> 
> No, native mode is 5 not A in the low 4 bits of progif.
> 
> You need to be a bit careful about those VIA, I remember having issues
> on Pegasos where we left it in legacy mode. It think the problem is that
> even when switched, the IRQ routing might be done based on some other
> setting in the chipset, possibly a strap. But that's nothing you can't
> deal with an appropriate quirk in the arch code.
> 
> Also, double check the level/edge setting of the interrupts as it can be
> different between legacy and native (native is level low, legacy is
> rising edge).
> 
> I'm surprised however that one would use such a legacy southbridge on a
> platform that can't issue low IO ports, that doesn't seem to make sense
> to me ... there's a whole lot of things on this such as the 8259 PIC
> etc.. that can only be addressed via low IOs, unless the ISA space can
> be somewhat remapped ?

The GT-64111 system controller doesn't provide any kind of mapping
functionality that would help here.  So legacy port addressing can only
work by exploiting aliases due to incomplete decoding of legacy ioport
addreses by the VT82C586 - but direct addressing is impossible.

  Ralf
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

2007-12-10 Thread Guillaume Chazarain
Stefano Brivio <[EMAIL PROTECTED]> wrote:

> Sorry for disappearing. Anyway, yes, those patches fixed it. Precision in
> delays isn't that good when using my crappy unstable TSC (mdelay(2000)
> causes delays between 2 and 2.9 seconds) but it's not depending on frequency
> changes anymore. So I'd say it's fixed, but please tell me if you want me
> to do any other test so as to be sure it is.

Ingo,

it seems you dropped http://lkml.org/lkml/2007/12/7/100 (cpu_clock()
based udelay), so how udelay can be affected by your proposed changes?

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

2007-12-10 Thread Arjan van de Ven
On Tue, 11 Dec 2007 00:34:33 +0100
Stefano Brivio <[EMAIL PROTECTED]> wrote:

> On Tue, 11 Dec 2007 00:04:25 +0100
> Ingo Molnar <[EMAIL PROTECTED]> wrote:
> > 
> > * Ingo Molnar <[EMAIL PROTECTED]> wrote:
> > 
> > > * Andrew Morton <[EMAIL PROTECTED]> wrote:
> > > 
> > > > > what do you think? Right now i've got them queued up for
> > > > > 2.6.25 in both the scheduler-devel and the x86-devel git
> > > > > trees - but can submit them for 2.6.24 if it's better if we
> > > > > did them there. I've got no strong opinion either way.
> > > > 
> > > > printk_clock() doesn't seem terribly important but what's this
> > > > stuff about effects on udelay/mdelay?  That can be serious if
> > > > they're getting shortened.
> > > 
> > > since udelay depends on loops_per_jiffy, which is fixed up 
> > > time_cpufreq_notifier(), i dont see how it could be affected by 
> > > frequency changes. (but that's the theory - practice might be 
> > > different)
> > 
> > Stefano Brivio reported udelay()/mdelay() effects in the b43
> > driver. (and it caused driver failures for him.)
> > 
> > Stefano, could you please try to sum up your experiences with that 
> > issue? Is it reproducable, and the 5 patches i did fix it? (if yes, 
> > could you try to re-do the mdelay verifications perhaps, to make
> > sure it's not some other effect interacting here. In theory
> > sched-clock scaling has no effect on udelay behavior.)
> 
> Sorry for disappearing. Anyway, yes, those patches fixed it.
> Precision in delays isn't that good when using my crappy unstable TSC
> (mdelay(2000) causes delays between 2 and 2.9 seconds) but it's not
> depending on frequency changes anymore. So I'd say it's fixed, but
> please tell me if you want me to do any other test so as to be sure
> it is.
> 
> 
I'm still quite concerned about this in dual/quad core scenarios;
the frequency of both cores is the maximum of what linux sets each core to;
this means that if you're THIS sensitive to that there still is quite a nasty 
issue there.

I wonder if the various delay functions (maybe only in .25) should use the 
maximum observed loops_per_jiffie instead always (across cpus) to be super safe 
here.

-- 
If you want to reach me at my work email, use [EMAIL PROTECTED]
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]

2007-12-10 Thread Casey Schaufler

--- David Howells <[EMAIL PROTECTED]> wrote:

> Casey Schaufler <[EMAIL PROTECTED]> wrote:
> 
> > That happens to me when interfaces are described in SELinux terms. I
> > still don't care much for multiple contexts, and I don't have a good
> > grasp of how you'll deal with Smack, or any LSM other than SELinux.
> 
> Me neither.  I understand SELinux somewhat, though it's got a lot of wibbly
> bits, and WinNT's security system, but I have no experience of the other
> stuff.
> 
> > Just as Stephen mentions, I also don't see the generality that a change
> > of this magnitude really ought to provide.
> 
> Perhaps it should be a specific interface, solely for cachefiles's use then.

That would help focus things, to be sure. I don't know if that
focus will speed things up or slow them down, but I think that
attempting to accomodate SELinux/NFS, with the state that effort
is in, will only lead to tears.


Casey Schaufler
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]

2007-12-10 Thread Casey Schaufler

--- David Howells <[EMAIL PROTECTED]> wrote:

> Stephen Smalley <[EMAIL PROTECTED]> wrote:
> 
> > From a config file whose pathname would be provided by libselinux (ala
> > the way in which dbusd imports contexts), or directly as a context
> > returned by a libselinux function.
> 
> That sounds too SELinux specific.  How do I do it so that it works for any
> LSM?
> 
> Is linking against libselinux is a viable option if it's not available under
> all LSM models?  Is it available under all LSM models?  Perhaps Casey can
> answer this one.

Linking against libselinux is not now, nor will it ever be, a viable
option. There's just too much sophistication contained in libselinux
for us simple folk to deal with.

> > > I use to do that, but someone objected...  Possibly Karl MacMillan.
> > 
> > Yes, but I think I disagreed then too.
> 
> So, who's right?

Me! (smiley inserted here, for those in need)

> > It doesn't fit with how other users of security_kernel_act_as() will
> > likely want to work (they will want to just set the context to a
> > specified value, whether one obtained from the client or from some local
> > source), nor with how type transitions normally work (exec, with the
> > program type as the second type field).  I think it will just cause
> > confusion and subtle breakage.
> 
> It's causing me lots of confusion as it is.  I have been / am being told by
> different people to do different things just in dealing with SELinux, and
> various people are raising extra requirements or restrictions beyond that.
> There doesn't seem to be a consensus.
> 
> It sounds like the best option is just to have the kernel nick the userspace
> daemon's security context and use that as is, and junk all the restrictions
> on
> what the daemon can do so that the kernel isn't too restricted.

That would be consistant with the (perhaps archaic now) behavior
of nfsd on Unix, which did nothing but "lend it's credential" to the
underlying kernel code. I think it's a rational approach, although I
expect that in may have troubles under SELinux.


Casey Schaufler
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]

2007-12-10 Thread David Howells
Casey Schaufler <[EMAIL PROTECTED]> wrote:

> That happens to me when interfaces are described in SELinux terms. I
> still don't care much for multiple contexts, and I don't have a good
> grasp of how you'll deal with Smack, or any LSM other than SELinux.

Me neither.  I understand SELinux somewhat, though it's got a lot of wibbly
bits, and WinNT's security system, but I have no experience of the other
stuff.

> Just as Stephen mentions, I also don't see the generality that a change
> of this magnitude really ought to provide.

Perhaps it should be a specific interface, solely for cachefiles's use then.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

2007-12-10 Thread Stefano Brivio
On Tue, 11 Dec 2007 00:04:25 +0100
Ingo Molnar <[EMAIL PROTECTED]> wrote:
> 
> * Ingo Molnar <[EMAIL PROTECTED]> wrote:
> 
> > * Andrew Morton <[EMAIL PROTECTED]> wrote:
> > 
> > > > what do you think? Right now i've got them queued up for 2.6.25 in 
> > > > both the scheduler-devel and the x86-devel git trees - but can 
> > > > submit them for 2.6.24 if it's better if we did them there. I've got 
> > > > no strong opinion either way.
> > > 
> > > printk_clock() doesn't seem terribly important but what's this stuff 
> > > about effects on udelay/mdelay?  That can be serious if they're 
> > > getting shortened.
> > 
> > since udelay depends on loops_per_jiffy, which is fixed up 
> > time_cpufreq_notifier(), i dont see how it could be affected by 
> > frequency changes. (but that's the theory - practice might be 
> > different)
> 
> Stefano Brivio reported udelay()/mdelay() effects in the b43 driver. 
> (and it caused driver failures for him.)
> 
> Stefano, could you please try to sum up your experiences with that 
> issue? Is it reproducable, and the 5 patches i did fix it? (if yes, 
> could you try to re-do the mdelay verifications perhaps, to make sure 
> it's not some other effect interacting here. In theory sched-clock 
> scaling has no effect on udelay behavior.)

Sorry for disappearing. Anyway, yes, those patches fixed it. Precision in
delays isn't that good when using my crappy unstable TSC (mdelay(2000)
causes delays between 2 and 2.9 seconds) but it's not depending on frequency
changes anymore. So I'd say it's fixed, but please tell me if you want me
to do any other test so as to be sure it is.


--
Ciao
Stefano
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why does reading from /dev/urandom deplete entropy so much?

2007-12-10 Thread Matt Mackall
On Tue, Dec 11, 2007 at 12:06:43AM +0100, Marc Haber wrote:
> On Sun, Dec 09, 2007 at 10:16:05AM -0600, Matt Mackall wrote:
> > On Sun, Dec 09, 2007 at 01:42:00PM +0100, Marc Haber wrote:
> > > On Wed, Dec 05, 2007 at 03:26:47PM -0600, Matt Mackall wrote:
> > > > The distinction between /dev/random and /dev/urandom boils down to one
> > > > word: paranoia. If you are not paranoid enough to mistrust your
> > > > network, then /dev/random IS NOT FOR YOU. Use /dev/urandom.
> > > 
> > > But currently, people who use /dev/urandom to obtain low-quality
> > > entropy do a DoS for the paranoid people.
> > 
> > Not true, as I've already pointed out in this thread.
> 
> I must have missed this. Can you please explain again? For a layman it
> looks like a paranoid application cannot read 500 Bytes from
> /dev/random without blocking if some other application has previously
> read 10 Kilobytes from /dev/urandom.

/dev/urandom always leaves enough entropy in the input pool for
/dev/random to reseed. Thus, as long as entropy is coming in, it is
not possible for /dev/urandom readers to starve /dev/random readers.
But /dev/random readers may still block temporarily and they should
damn well expect to block if they read 500 bytes out of a 512 byte
pool.

-- 
Mathematics is the supreme nostalgia of our time.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]

2007-12-10 Thread David Howells
Stephen Smalley <[EMAIL PROTECTED]> wrote:

> From a config file whose pathname would be provided by libselinux (ala
> the way in which dbusd imports contexts), or directly as a context
> returned by a libselinux function.

That sounds too SELinux specific.  How do I do it so that it works for any
LSM?

Is linking against libselinux is a viable option if it's not available under
all LSM models?  Is it available under all LSM models?  Perhaps Casey can
answer this one.

> > I use to do that, but someone objected...  Possibly Karl MacMillan.
> 
> Yes, but I think I disagreed then too.

So, who's right?

> It doesn't fit with how other users of security_kernel_act_as() will
> likely want to work (they will want to just set the context to a
> specified value, whether one obtained from the client or from some local
> source), nor with how type transitions normally work (exec, with the
> program type as the second type field).  I think it will just cause
> confusion and subtle breakage.

It's causing me lots of confusion as it is.  I have been / am being told by
different people to do different things just in dealing with SELinux, and
various people are raising extra requirements or restrictions beyond that.
There doesn't seem to be a consensus.

It sounds like the best option is just to have the kernel nick the userspace
daemon's security context and use that as is, and junk all the restrictions on
what the daemon can do so that the kernel isn't too restricted.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: PNP: do not stop/start devices in suspend/resume path

2007-12-10 Thread Bjorn Helgaas
On Friday 07 December 2007 12:13:35 am Shaohua Li wrote:
> On Thu, 2007-12-06 at 02:24 +0800, Bjorn Helgaas wrote:
> > Index: linux-mm/drivers/pnp/driver.c
> > ===
> > --- linux-mm.orig/drivers/pnp/driver.c  2007-11-30 13:58:25.0
> > -0700
> > +++ linux-mm/drivers/pnp/driver.c   2007-12-03 09:58:35.0
> > -0700
> > @@ -161,13 +161,6 @@
> > return error;
> > }
> > 
> > -   if (!(pnp_drv->flags & PNP_DRIVER_RES_DO_NOT_CHANGE) &&
> > -   pnp_can_disable(pnp_dev)) {
> > -   error = pnp_stop_dev(pnp_dev);
> > -   if (error)
> > -   return error;
> > -   }
> > -
> > if (pnp_dev->protocol && pnp_dev->protocol->suspend)
> > pnp_dev->protocol->suspend(pnp_dev, state);
> > return 0;
> > @@ -177,7 +170,6 @@
> >  {
> > struct pnp_dev *pnp_dev = to_pnp_dev(dev);
> > struct pnp_driver *pnp_drv = pnp_dev->driver;
> > -   int error;
> > 
> > if (!pnp_drv)
> > return 0;
> > @@ -185,12 +177,6 @@
> > if (pnp_dev->protocol && pnp_dev->protocol->resume)
> > pnp_dev->protocol->resume(pnp_dev);
> > 
> > -   if (!(pnp_drv->flags & PNP_DRIVER_RES_DO_NOT_CHANGE)) {
> > -   error = pnp_start_dev(pnp_dev);
> > -   if (error)
> > -   return error;
> > -   }
> > -
> I'd suggest keep pnp_start_dev here to prevent BIOS not or assign
> different resources after a resume.

The patch I currently have in -mm (http://lkml.org/lkml/2007/10/29/412)
merely requests resources in pnp_start_dev() and releases them in
pnp_stop_dev().  So if we remove pnp_stop_dev() but keep pnp_start_dev(),
I have to fix that patch to deal with things that may already be
reserved.

But I don't see any mention in the spec of running _SRS in the
sleep/wakup path, so I'm not convinced it's really necessary.
Section 7.4 mentions _TTS, _PTS, _GTS, etc., but not _SRS.

For devices, it looks like the intent is that BIOS should generate
notifications that cause OSPM to re-enumerate devices that might
have changed.  I'm pretty sure Linux is missing some of that code,
though, so I could believe that _SRS might help paper over that
deficiency.

What I'd really like to do is figure out how Windows uses _SRS and
do the same thing.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Please revert: PCI: fix IDE legacy mode resources

2007-12-10 Thread Alan Cox
> Forcing controllers into native mode tends to be something that really
> only works on -some- controllers. I'm happy to have a hack to try to do
> that on all of them on powermacs, because the range of controllers that
> might not be in native mode in the first place there is pretty small,
> and for CHRP briq, I do it for a specific known controller only.

I'm thinking of doing this solely if the platform has
CONFIG_ATA_NO_LEGACY set. In other words we'd only try this stunt on a
system we *know* cannot address the low PCI space ports.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.23-rc3 can't see sd partitions on Alpha

2007-12-10 Thread Ivan Kokshaysky
On Mon, Dec 10, 2007 at 09:08:53AM -0600, Bob Tracy wrote:
> Ivan Kokshaysky wrote:
> > For now I have reassigned the bug #9457 to myself and will gradually hack
> > into udev...
> 
> Thanks...  Let me know if there's anything useful I can do to help.

It turns out to be yet another strncpy() bug that indeed shows up only with
certain src/dst alignments and breaks kobject_get_path(). Ugh...

Hopefully I'll have a patch tomorrow.

Ivan.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why does reading from /dev/urandom deplete entropy so much?

2007-12-10 Thread Marc Haber
On Sun, Dec 09, 2007 at 10:16:05AM -0600, Matt Mackall wrote:
> On Sun, Dec 09, 2007 at 01:42:00PM +0100, Marc Haber wrote:
> > On Wed, Dec 05, 2007 at 03:26:47PM -0600, Matt Mackall wrote:
> > > The distinction between /dev/random and /dev/urandom boils down to one
> > > word: paranoia. If you are not paranoid enough to mistrust your
> > > network, then /dev/random IS NOT FOR YOU. Use /dev/urandom.
> > 
> > But currently, people who use /dev/urandom to obtain low-quality
> > entropy do a DoS for the paranoid people.
> 
> Not true, as I've already pointed out in this thread.

I must have missed this. Can you please explain again? For a layman it
looks like a paranoid application cannot read 500 Bytes from
/dev/random without blocking if some other application has previously
read 10 Kilobytes from /dev/urandom.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 3221 2323190
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

2007-12-10 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> * Andrew Morton <[EMAIL PROTECTED]> wrote:
> 
> > > what do you think? Right now i've got them queued up for 2.6.25 in 
> > > both the scheduler-devel and the x86-devel git trees - but can 
> > > submit them for 2.6.24 if it's better if we did them there. I've got 
> > > no strong opinion either way.
> > 
> > printk_clock() doesn't seem terribly important but what's this stuff 
> > about effects on udelay/mdelay?  That can be serious if they're 
> > getting shortened.
> 
> since udelay depends on loops_per_jiffy, which is fixed up 
> time_cpufreq_notifier(), i dont see how it could be affected by 
> frequency changes. (but that's the theory - practice might be 
> different)

Stefano Brivio reported udelay()/mdelay() effects in the b43 driver. 
(and it caused driver failures for him.)

Stefano, could you please try to sum up your experiences with that 
issue? Is it reproducable, and the 5 patches i did fix it? (if yes, 
could you try to re-do the mdelay verifications perhaps, to make sure 
it's not some other effect interacting here. In theory sched-clock 
scaling has no effect on udelay behavior.)

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4, v3] Physical PCI slot objects

2007-12-10 Thread Alex Chiang
Hi Kenji-san,

I have been thinking about this problem for quite a bit, and
think that there are no good solutions...

* Kenji Kaneshige <[EMAIL PROTECTED]>:
> On my system, hotplug slots themselves can be added, removed
> and replaced with the ohter type of I/O box. 
>> Are you talking about some sort of I/O cabinet/chassis that you
>> can attach to the actual computer? Can the I/O expander unit be
>> hotplugged? Or do you need to power your machine down to attach
>> it?
>> If you can hotplug it, I'm guessing that is why your firmware
>> presents SxFy objects in the namespace with "weird" _SUN values,
>> and it's why you have to check _STA to see if the slots are valid
>> or not. That means the value returned by _SUN will change too,
>> right? What will it turn into?
>>
>
> Currently, it's not hotpluggable (will be hotpluggable in the future).
> Here is a sample AML code to explain what my firmware is doing.
>
> Device (PCI0) {
>   Device (P2PA) {
>   Device (P2PB) { // for I/O unit (A)
>   Name (_ADR, ...)
>   Method (_STA) { ... }
>   }
>   Device (S0F0) { // for I/O unit (B)
>   Name (_ADR, ...)
>   Method (_STA) { ... }
>   Method (_EJx) { ... }
>   Method (_SUN) { ... }
>   }
>   ...
>   }
>   ...
> }
>
> If the I/O unit (A) is connected, _STA of P2PB returns as present
> and _STA of S0F0 returns as not present.
> If the I/O unit (B) is connected, _STA of P2PB returns as not
> present and _STA of S0F0 returns as present.

If I/O unit A or B can never appear while the system is turned on
(aka not hotpluggable), then it is incorrect to present them in
the current namespace. 

>>> In addtion, I think we should not trust the _SUN value of
>>> non-existing device because the ACPI spec says in "6.5.1 _INI
>>> (Init)" that _INI method is run before _ADR, _CID, _HID, _SUN, and
>>> _UID are run. It means _SUN could be initialized in _INI method
>>> implecitely. And it also says that "If the _STA method indicates
>>> that the device is not present, OSPM will not run the _INI and will
>>> not examine the children of the device for _INI methods.". After all,
>>> _SUN for non-existing device is not reliable because it might not
>>> initialized by _INI method.
>> This is true, but HP platforms provide _INI at the root
>> device/host bridge level, not on SxFy objects, so it doesn't seem
>> that we would need to call _STA before calling _SUN for SxFy.
>> Does your firmware provide _INI on SxFy objects?
>
> No, it doesn't. But what I wanted to say was we should not use _SUN
> value of non-existing device object.

There is nothing illegal about evaluating _SUN for an object that
returns 0x0 for _STA. 

Also, when you say "non-existing", I think of the ACPI CA
exception code AE_NOT_EXIST which means "absent from
the namespace", and is the reason why my code works on both HP
and IBM machines. It does not mean "_STA == 0x0".

>> Our firmware teams seem to think that _STA should give the status
>> of the card for hotplug support and general functional state.
>> They claim that it doesn't makes much sense to support _STA on
>> the slot itself unless you can physically change the slot
>> topology on the machine at runtime, which we can't do (although
>> maybe you can).
>> The section of the spec you quoted is correct as long as we are
>> talking ACPI 2.0 or later. My platforms implement ACPI 1.0b for
>> legacy reasons. :-/
>> In ACPI 1.0b, _EJx definition says (section 6.3.2):
>>  For hot removal, the device must be immediately ejected
>>  when the OS calls the _EJ0 control method. The _EJ0
>>  control method does not return until ejection is
>>  complete. After calling _EJ0, the OS will call _STA to
>>  determine whether or not the eject succeeded.
>> So your firmware implementation does not seem backward compatible
>> with the 1.0b spec. The different versions of ACPI is part of the
>> reason why my patch is breaking on your machine.
>
> I think this is the real reason. My platform implements ACPI 2.0 or
> later. I didn't notice the chage to_EJx definition. Maybe we need to
> check ACPI version in pci_slot driver.

I did some experiments on HP low-end ia64 (ACPI 1.0b only) and
our mid-range and high-end ia64 platforms (ACPI 2.0c). Checking
for _STA before evaluating _SUN leads to the same result for me:
we only detect populated slots.

I think that the real issue is not 1.0 vs 2.0, but the semantics
that our different firmware teams have placed on _STA. Again,

  - HP firmware thinks _STA should give status of the card
  - Fujitsu firmware thinks _STA should give status of the slot

So we are at an impasse. :(

>> But as long as we are quoting the spec...  :)
>>  _SUN evaluates to a DWORD that is the number to be used
>>  in the user interface. This number is required to be
>>  unique among 

Re: [PATCH] Fake NUMA emulation for PowerPC (Take 2)

2007-12-10 Thread Olof Johansson
On Sat, Dec 08, 2007 at 04:07:14AM +0530, Balbir Singh wrote:

> Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>

Looks good to me. Sure, it could be fleshed out to something more
generic and in common code, but this is small and simple and doesn't
bloat the kernel much as it stands, and it has value for debugging.

Acked-by: Olof Johansson <[EMAIL PROTECTED]>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

2007-12-10 Thread Ingo Molnar

* Andrew Morton <[EMAIL PROTECTED]> wrote:

> > what do you think? Right now i've got them queued up for 2.6.25 in 
> > both the scheduler-devel and the x86-devel git trees - but can 
> > submit them for 2.6.24 if it's better if we did them there. I've got 
> > no strong opinion either way.
> 
> printk_clock() doesn't seem terribly important but what's this stuff 
> about effects on udelay/mdelay?  That can be serious if they're 
> getting shortened.

since udelay depends on loops_per_jiffy, which is fixed up 
time_cpufreq_notifier(), i dont see how it could be affected by 
frequency changes. (but that's the theory - practice might be different)

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump

2007-12-10 Thread Vivek Goyal
On Fri, Dec 07, 2007 at 03:53:30PM +, Huang, Ying wrote:
> This patch implements the functionality of jumping between the kexeced
> kernel and the original kernel.
> 
> To support jumping between two kernels, before jumping to (executing)
> the new kernel and jumping back to the original kernel, the devices
> are put into quiescent state, and the state of devices and CPU is
> saved. After jumping back from kexeced kernel and jumping to the new
> kernel, the state of devices and CPU are restored accordingly. The
> devices/CPU state save/restore code of software suspend is called to
> implement corresponding function.
> 
> To support jumping without reserving memory. One shadow backup page
> (source page) is allocated for each page used by new (kexeced) kernel
> (destination page). When do kexec_load, the image of new kernel is
> loaded into source pages, and before executing, the destination pages
> and the source pages are swapped, so the contents of destination pages
> are backupped. Before jumping to the new (kexeced) kernel and after
> jumping back to the original kernel, the destination pages and the
> source pages are swapped too.
> 
> A jump back protocol for kexec is defined and documented. It is an
> extension to ordinary function calling protocol. So, the facility
> provided by this patch can be used to call ordinary C function in real
> mode.
> 
> A set of flags for sys_kexec_load are added to control which state are
> saved/restored before/after real mode code executing. For example, you
> can specify the device state and FPU state are saved/restored
> before/after real mode code executing.
> 
> The states (exclude CPU state) save/restore code can be overridden
> based on the "command" parameter of kexec jump. Because more states
> need to be saved/restored by hibernating/resuming.
> 


[..]
>  
> -#define KEXEC_ON_CRASH  0x0001
> -#define KEXEC_ARCH_MASK 0x
> +#define KEXEC_ON_CRASH   0x0001
> +#define KEXEC_PRESERVE_CPU   0x0002
> +#define KEXEC_PRESERVE_CPU_EXT   0x0004
> +#define KEXEC_SINGLE_CPU 0x0008
> +#define KEXEC_PRESERVE_DEVICE0x0010
> +#define KEXEC_PRESERVE_CONSOLE   0x0020

Hi,

Why do we need so many different flags for preserving different types
of state (CPU, CPU_EXT, Device, console) ? To keep things simple,
can't we can create just one flag KEXEC_PRESERVE_CONTEXT, which will
indicate any special action required for preserving the previous kernel's
context so that one can swith back to old kernel?

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]

2007-12-10 Thread Casey Schaufler

--- Stephen Smalley <[EMAIL PROTECTED]> wrote:

> On Mon, 2007-12-10 at 21:08 +, David Howells wrote:
> > Stephen Smalley <[EMAIL PROTECTED]> wrote:
> > 
> > > Otherwise, only other issue I have with this interface is it won't
> > > generalize to dealing with nfsd, where we want to set the acting context
> > > to a context we obtain from or determine based upon the client.
> > 
> > Are you speaking of security_kernel_act_as() and security_create_files_as()
> > specifically?  Or the task_struct::act_as override pointer in general?
> 
> security_kernel_act_as()
> 
> > I don't really know how nfsd wants to obtain and set its LSM context, so
> it's
> > a bit difficult for me to make something that works for nfsd as well as
> > cachefiles.
> 
> It would get a context from the client or from a local configuration
> that would map security-unaware clients to a default context, and then
> want to assume that context for the particular operation.  No transition
> involved.

I would expect that the operation would be more sophisticated
than that. You certainly aren't going to use what comes from
the other side without any processing, and I expect you'll have
some sort of operation on anything you pull from a config file
before you actually apply it.

> > > Why can't cachefilesd just push a context into the kernel and pass that
> > > into the hook as the acting context,
> > 
> > How does cachefilesd come up with such a context?  Grab it from
> > /etc/cachefilesd.conf?
> 
> >From a config file whose pathname would be provided by libselinux (ala
> the way in which dbusd imports contexts), or directly as a context
> returned by a libselinux function.  Has to be done that way so that it
> can be set differently for different policy types (strict, targeted,
> mls).

Unless you've got an LSM other than SELinux, of course. If
cachefilesd is going to be responsible for maintaining this
magic context there needs to be an LSM interface for it, not
just an SELinux interface.

> Naturally, cachefiles (the kernel module) would invoke a security hook
> to check whether the daemon is allowed to set the specified context.
> 
> > I use to do that, but someone objected...  Possibly Karl MacMillan.
> 
> Yes, but I think I disagreed then too.
> 
> > > and then nfsd can do likewise using the context provided by the client or
> > > obtained locally from exports for ordinary clients?  Avoids the
> transition
> > > SID computation altogether within the kernel and makes this more generic.
> > 
> > I seem to remember that I was told that it should be done this way,
> possibly
> > by Karl MacMillan, but I don't remember exactly.
> > 
> > Now it's configured by cachefilesd.te:
> > 
> > type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;
> 
> It doesn't fit with how other users of security_kernel_act_as() will
> likely want to work (they will want to just set the context to a
> specified value, whether one obtained from the client or from some local
> source), nor with how type transitions normally work (exec, with the
> program type as the second type field).  I think it will just cause
> confusion and subtle breakage.

I think that I agree with Stephen, although I could be mirely confused.
That happens to me when interfaces are described in SELinux terms. I
still don't care much for multiple contexts, and I don't have a good
grasp of how you'll deal with Smack, or any LSM other than SELinux.
Just as Stephen mentions, I also don't see the generality that a change
of this magnitude really ought to provide.



Casey Schaufler
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.24-rc4] proc: Remove/Fix proc generic d_revalidate

2007-12-10 Thread vandrove
Quoting Andrew Morton <[EMAIL PROTECTED]>:

> On Mon, 10 Dec 2007 16:32:18 +0300 "Denis V. Lunev" <[EMAIL PROTECTED]> wrote:
> >
> 
> Plese don't top-post.  It makes replying to you rather awkward.
> 
> > could you, plz, check patch sent by Eric above in this thread.
> > 
> > I have tried it on my test node and it works for module you have
> > provided. The problem exists without it.
> > 
> 
> When Peter says "with your patch in place" I assume that he's referring to
> Eric's latest patch, namely.

Sorry, I was not clear.  No, I meant Eric's original patch.  Without
d_revalidate() problem does not occur.
   Petr

> 
> --- a/fs/proc/generic.c~proc-remove-fix-proc-generic-d_revalidate
> +++ a/fs/proc/generic.c
> @@ -374,16 +374,9 @@ static int proc_delete_dentry(struct den
>   return 1;
>  }
>  
> -static int proc_revalidate_dentry(struct dentry *dentry, struct nameidata
> *nd)
> -{
> - d_drop(dentry);
> - return 0;
> -}
> -
>  static struct dentry_operations proc_dentry_operations =
>  {
>   .d_delete   = proc_delete_dentry,
> - .d_revalidate   = proc_revalidate_dentry,
>  };
>  
>  /*
> 
> So we still have problems, it appears.
> 
> > 
> > Petr Vandrovec wrote:
> > > Eric W. Biederman wrote:
> > >> Ultimately to implement /proc perfectly we need an implementation
> > >> of d_revalidate because files and directories can be removed behind
> > >> the back of the VFS, and d_revalidate is the only way we can let
> > >> the VFS know that this has happened.
> > >>
> > >> So until we get a proper test for keeping dentries in the dcache
> > >> fix the current d_revalidate method by completely removing it.  This
> > >> returns us to the current status quo.
> > > 
> > > Hello,
> > >I know that I'm late to the party, but mount points is not only
> > > problem with d_revalidate.  With your patch in place module below gets
> > > refcount incremented by two every time I do 'ls -la /proc/fs/vmblock'.
> > > 
> > >
> > > #include 
> > > #include 
> > > #include 
> > > 
> > > static int vmblockinit(void) {
> > >struct proc_dir_entry *controlProcDirEntry;
> > > 
> > >/* Create /proc/fs/vmblock */
> > >controlProcDirEntry = proc_mkdir("vmblock", proc_root_fs);
> > >if (!controlProcDirEntry) {
> > >   printk(KERN_DEBUG "Bad...\n");
> > >   return -EINVAL;
> > >}
> > >controlProcDirEntry->owner = THIS_MODULE;
> > >return 0;
> > > }
> > > 
> > > static void vmblockexit(void) {
> > >remove_proc_entry("vmblock", proc_root_fs);
> > > }
> > > 
> > > module_init(vmblockinit);
> > > module_exit(vmblockexit);
> > > 
> > > 
> > > (code comes from VMware's vmblock module,
> > > http://sourceforge.net/project/showfiles.php?group_id=204462)
> > > Thanks,
> > > Petr
> > > 
> > > 
> 
> 
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: syslets v7: back to basics

2007-12-10 Thread Zach Brown

> I pulled from your tree to look over the patches, and noticed that it
> looks like several commits were merged improperly.  It looks like they
> were auto merged or something from an email, and the commit message
> contains the email headers, rather than just the commit message in the
> body.  This leads to the shortlog showing entries that start with
> "Return-Path:".

These are patches that guilt imported from email messages.  It didn't
strip the headers and I didn't care to.  I'll try to in the future, it
isn't a big deal.

> I was hoping to find at least some initial information on the overall
> design in Documentation/ but don't see any.  Have you written any yet
> that I could take a look at elsewhere maybe?

No, but it's coming.  I'd like to have some robust documentation so that
Ulrich can help me understand what more he'd need to support POSIX AIO
with syslets from glibc.

> Some of the things I was trying to figure out is does each syslet get
> its own stack,

Yes.  Each blocking operation has a thread that is performing the
operation synchronously.  The benefit is that the thread is only created
if the operation blocks.  If it doesn't block then it's a normal system
call invocation.  You don't have to manage threads and communicate the
arguments and results of system calls amongst threads for the case where
it never blocks.

> and schedule only at a few well defined points

No, every blocking point is considered a scheduling point.

> , and if
> so, would it then be fair to characterize them as kernel mode fibers?

I'm not sure what exactly you mean by kernel mode fibers (I can guess,
but I'd rather not).  From the answer of to the last question, though,
I'm going to guess that it might not be the most apt characterization.

- z

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] A clean approach to writeout throttling

2007-12-10 Thread Pekka Enberg
Hi,

On Dec 10, 2007 11:31 PM, Jonathan Corbet <[EMAIL PROTECTED]> wrote:
> I'm just getting around to looking at this.  One thing jumped out at me:
>
> > + if (bio->bi_throttle) {
> > + struct request_queue *q = bio->bi_queue;
> > + bio->bi_throttle = 0; /* or detect multiple endio and err? */
> > + atomic_add(bio->bi_throttle, >available);
> > + wake_up(>throttle_wait);
> > + }
>
> I'm feeling like I must be really dumb, but...how can that possibly
> work?  You're zeroing >bi_throttle before adding it back into
> q->available, so the latter will never increase...

Heh, well, that's ok as long as bio->bi_vcnt is set to zero and I think we
have some md raid drivers do just that... ;-)

Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoid overflows in kernel/time.c

2007-12-10 Thread Andrew Morton
On Mon, 10 Dec 2007 10:59:20 -0800
"H. Peter Anvin" <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > 
> > My ia64 allmodconfig build has taken
> > 
> > akpm 15700 89.6  0.0   8256   700 pts/4RN+  03:09  10:41 bc -q 
> > kernel/timeconst.bc
> > 
> > 11 minutes so far.  fc6/x86_64.
> > 
> 
> I just tried this on my system, using your cross-compiler chain.  I got 
> a different error:
> 
> /opt/crosstool/gcc-3.4.5-glibc-2.3.6/ia64-unknown-linux-gnu/lib/gcc/ia64-unknown-linux-gnu/3.4.5/../../../../ia64-unknown-linux-gnu/bin/ld:
>  
> section .data.patch [a500 -> a507] overlaps 
> section .dynamic [a3c8 -> a507]
> collect2: ld returned 1 exit status
> make[2]: *** [arch/ia64/kernel/gate.so] Error 1

You'll need rc4-mm1's ia64-increase-datapatch-offset.patch.  That's now in
Tony's tree and should go into 2.6.24 IMO.

> ... but the timeconst stuff worked fine.  I tried it both from the 
> command line and using your xb script.
> 
> This is on a fc7/x86-64 box.  I also ran through all the values from 48 
> to 1024 on both an fc5 and an fc7 box (no fc6 box readily available, 
> although bc has been at 1.06 since 2000...)
> 
> In short, this is highly weird.  Could you possibly do me a favour and 
> just run, at the command line:
> 
> echo 250 | bc -q kernel/timeconst.bc

That works OK.

> ... and see if it reproduces the lockup (I'm assuming HZ == 250 in your 
> config, since that's what I get when I do "make allmodconfig" on IA64.)
> 
> (No need to wait 11 minutes.  It should run in a small fraction of a 
> second.)

I retested 2.6.24-rc4-mm1 plus avoid-overflows-in-kernel-timec.patch and
the failure has magically gone away.

Ho hum.  I'll reconstitute the patch and will keep an eye on it.
It'd be nice to avoid the introduction of the bc dependency though.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Possibly SATA related freeze killed networking and RAID

2007-12-10 Thread Thiemo Nagel
Hello,

I think, I'm experiencing the same problem:

09:16:34 : NETDEV WATCHDOG: eth0: transmit timed out
09:16:34 : eth0: Got tx_timeout. irq: 
09:16:34 : eth0: Ring at 37e5
09:16:34 : eth0: Dumping tx registers
09:16:34 :   0:  00ff 0003 025003ca  
 
09:16:34 :  20:      
 

[...]

09:16:54 : ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
09:16:54 : ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
09:16:54 : ata6.00: cmd 25/00:08:1e:97:48/00:00:19:00:00/e0 tag 0 cdb 0x0
data 4096 in
09:16:54 :  res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4
(timeout)
09:16:54 : ata5.00: cmd 25/00:70:1e:97:48/00:00:19:00:00/e0 tag 0 cdb 0x0
data 57344 in
09:16:54 :  res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4
(timeout)
09:16:54 : ata6: soft resetting port
09:16:54 : ata5: soft resetting port
09:16:54 : ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
09:16:54 : ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
09:16:54 : NETDEV WATCHDOG: eth0: transmit timed out
09:16:54 : eth0: Got tx_timeout. irq: 0032
09:16:54 : eth0: Ring at 37e5
09:16:54 : eth0: Dumping tx registers

A more complete log can be found at:
http://www.e18.physik.tu-muenchen.de/~tnagel/misc/kernel-crash.log

The setup is strikingly similar to that of noah (I'm quoting all of this
by heart, if somebody is interested in more detail, just ask.):

Kernel: 2.6.22 (amd64, Debian patches, tainted)
Mainboard: Asus M2N-SLI Deluxe (nForce 570 SLI MCP --> MCP55, same as noah)
CPU: Athlon64 Dual-Core (same as noah)
RAM: 1GB
HD: 22 x Samsung HD501LJ 500GB (same as noah), 1-6 connected to chipset,
7-22 connected to RocketRaid 2340.

I'm using software RAID like noah, (levels 1, 5 and 6), and like with noah
the problem occurred during RAID check, in my case during heavy NFS load
which had been ongoing for ~4 days.  This is the third time, it has
happened, but only this time I could catch the logs via netconsole.  The
two affected drives are connected to the chipset and show no SMART errors.

Unfortunately, the kernel is tainted since I'm using HighPoint's drivers
for the RR2340.  I don't know whether I can change this easily.

Kind regards,

Thiemo Nagel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] IB/ehca: Serialize HCA-related hCalls on POWER5

2007-12-10 Thread Roland Dreier
 > > map_phys_fmr
 > 
 > In fact, we do use hCalls there. Our hardware doesn't actually support FMRs,
 > so we translate a "map FMR" into a "reallocate PMR", which doesn't work
 > without hCalls. What's more, the hCalls involved (e.g. H_FREE_RESOURCE)
 > might well return H_LONG_BUSY, so the whole operation might sleep; no way
 > around it.

It's a big problem.  If you cannot implement FMRs in such a way that
you can handling having map_phys_fmr being called in a context that
can't sleep, then I think the only option is to remove your FMR
support.  It's an optional device feature, so this should be OK
(although the iSER driver currently seems to depend on a device
supporting FMRs, which is probably going to be a problem with iWARP
support in the future anyway).

The fact that consumers can map FMRs from interrupt context, while
holding locks, etc, is pretty fundamental to the use of FMRs so I
don't see any way around the requirement that map_phys_fmr never
sleep.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc] lockless get_user_pages for dio (and more)

2007-12-10 Thread Dave Kleikamp

On Mon, 2007-10-15 at 22:25 +1000, Nick Piggin wrote:
> On Monday 15 October 2007 04:19, Siddha, Suresh B wrote:
> > On Sun, Oct 14, 2007 at 11:01:02AM +1000, Nick Piggin wrote:

> > > This is just a really quick hack, untested ATM, but one that
> > > has at least a chance of working (on x86).
> >
> > When we fall back to slow mode, we should decrement the ref counts
> > on the pages we got so far in the fast mode.
> 
> Here is something that is actually tested and works (not
> tested with hugepages yet, though).
> 
> However it's not 100% secure at the moment. It's actually
> not completely trivial; I think we need to use an extra bit
> in the present pte in order to exclude "not normal" pages,
> if we want fast_gup to work on small page mappings too. I
> think this would be possible to do on most architectures, but
> I haven't done it here obviously.
> 
> Still, it should be enough to test the design. I've added
> fast_gup and fast_gup_slow to /proc/vmstat, which count the
> number of times fast_gup was called, and the number of times
> it dropped into the slowpath. It would be interesting to know
> how it performs compared to your granular hugepage ptl...

Nick,
I've played with the fast_gup patch a bit.  I was able to find a problem
in follow_hugetlb_page() that Adam Litke fixed.  I'm haven't been brave
enough to implement it on any other architectures, but I did add  a
default that takes mmap_sem and calls the normal get_user_pages() if the
architecture doesn't define fast_gup().  I put it in linux/mm.h, for
lack of a better place, but it's a little kludgy since I didn't want
mm.h to have to include sched.h.  This patch is against 2.6.24-rc4.
It's not ready for inclusion yet, of course.

I haven't done much benchmarking.  The one test I was looking at didn't
show much of a change.

 ==
Introduce a new "fast_gup" (for want of a better name right now) which
is basically a get_user_pages with a less general API that is more suited
to the common case.

- task and mm are always current and current->mm
- force is always 0
- pages is always non-NULL
- don't pass back vmas

This allows (at least on x86), an optimistic lockless pagetable walk,
without taking any page table locks or even mmap_sem. Page table existence
is guaranteed by turning interrupts off (combined with the fact that we're
always looking up the current mm, which would need an IPI before its
pagetables could be shot down from another CPU).

Many other architectures could do the same thing. Those that don't IPI
could potentially RCU free the page tables and do speculative references
on the pages (a la lockless pagecache) to achieve a lockless fast_gup.

Originally by Nick Piggin <[EMAIL PROTECTED]>
---
 arch/x86/lib/Makefile_64 |2 
 arch/x86/lib/gup_64.c|  188 +++
 fs/bio.c |8 -
 fs/block_dev.c   |5 -
 fs/direct-io.c   |   10 --
 fs/splice.c  |   38 
 include/asm-x86/uaccess_64.h |4 
 include/linux/mm.h   |   26 +
 include/linux/vmstat.h   |1 
 mm/vmstat.c  |3 
 10 files changed, 231 insertions(+), 54 deletions(-)

diff -Nurp linux-2.6.24-rc4/arch/x86/lib/Makefile_64 
linux/arch/x86/lib/Makefile_64
--- linux-2.6.24-rc4/arch/x86/lib/Makefile_64   2007-12-04 08:44:34.0 
-0600
+++ linux/arch/x86/lib/Makefile_64  2007-12-10 15:01:17.0 -0600
@@ -10,4 +10,4 @@ obj-$(CONFIG_SMP) += msr-on-cpu.o
 lib-y := csum-partial_64.o csum-copy_64.o csum-wrappers_64.o delay_64.o \
usercopy_64.o getuser_64.o putuser_64.o  \
thunk_64.o clear_page_64.o copy_page_64.o bitstr_64.o bitops_64.o
-lib-y += memcpy_64.o memmove_64.o memset_64.o copy_user_64.o rwlock_64.o 
copy_user_nocache_64.o
+lib-y += memcpy_64.o memmove_64.o memset_64.o copy_user_64.o rwlock_64.o 
copy_user_nocache_64.o gup_64.o
diff -Nurp linux-2.6.24-rc4/arch/x86/lib/gup_64.c linux/arch/x86/lib/gup_64.c
--- linux-2.6.24-rc4/arch/x86/lib/gup_64.c  1969-12-31 18:00:00.0 
-0600
+++ linux/arch/x86/lib/gup_64.c 2007-12-10 15:01:17.0 -0600
@@ -0,0 +1,188 @@
+/*
+ * Lockless fast_gup for x86
+ *
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ */
+#include 
+#include 
+#include 
+#include 
+
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+int write, struct page **pages, int *nr)
+{
+   pte_t *ptep;
+
+   /* XXX: this won't work for 32-bit (must map pte) */
+   ptep = (pte_t *)pmd_page_vaddr(pmd) + pte_index(addr);
+   do {
+   pte_t pte = *ptep;
+   unsigned long pfn;
+   struct page *page;
+
+   if ((pte_val(pte) & (_PAGE_PRESENT|_PAGE_USER)) !=
+   (_PAGE_PRESENT|_PAGE_USER))
+   return 0;
+
+   if (write && !pte_write(pte))
+ 

Re: [RFC] [PATCH] A clean approach to writeout throttling

2007-12-10 Thread Jonathan Corbet
Hey, Daniel,

I'm just getting around to looking at this.  One thing jumped out at me:

> + if (bio->bi_throttle) {
> + struct request_queue *q = bio->bi_queue;
> + bio->bi_throttle = 0; /* or detect multiple endio and err? */
> + atomic_add(bio->bi_throttle, >available);
> + wake_up(>throttle_wait);
> + }

I'm feeling like I must be really dumb, but...how can that possibly
work?  You're zeroing >bi_throttle before adding it back into
q->available, so the latter will never increase...

jon
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: syslets v7: back to basics

2007-12-10 Thread Phillip Susi

Zach Brown wrote:

The following patches are a substantial refactoring of the syslet code.  I'm
branding them as the v7 release of the syslet infrastructure, though they
represent a signifiant change in focus.

My current focus is to see the most fundamental functionality brought to
maturity.  To me, this means getting a ABI that is used by applications through
glibc on x86 and PPC64.   Only once that is ready should we distract ourselves
with advanced complexity.


I pulled from your tree to look over the patches, and noticed that it 
looks like several commits were merged improperly.  It looks like they 
were auto merged or something from an email, and the commit message 
contains the email headers, rather than just the commit message in the 
body.  This leads to the shortlog showing entries that start with 
"Return-Path:".


I was hoping to find at least some initial information on the overall 
design in Documentation/ but don't see any.  Have you written any yet 
that I could take a look at elsewhere maybe?


Some of the things I was trying to figure out is does each syslet get 
its own stack, and schedule only at a few well defined points, and if 
so, would it then be fair to characterize them as kernel mode fibers?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]

2007-12-10 Thread Stephen Smalley
On Mon, 2007-12-10 at 21:08 +, David Howells wrote:
> Stephen Smalley <[EMAIL PROTECTED]> wrote:
> 
> > Otherwise, only other issue I have with this interface is it won't
> > generalize to dealing with nfsd, where we want to set the acting context
> > to a context we obtain from or determine based upon the client.
> 
> Are you speaking of security_kernel_act_as() and security_create_files_as()
> specifically?  Or the task_struct::act_as override pointer in general?

security_kernel_act_as()

> I don't really know how nfsd wants to obtain and set its LSM context, so it's
> a bit difficult for me to make something that works for nfsd as well as
> cachefiles.

It would get a context from the client or from a local configuration
that would map security-unaware clients to a default context, and then
want to assume that context for the particular operation.  No transition
involved.

> > Why can't cachefilesd just push a context into the kernel and pass that
> > into the hook as the acting context,
> 
> How does cachefilesd come up with such a context?  Grab it from
> /etc/cachefilesd.conf?

>From a config file whose pathname would be provided by libselinux (ala
the way in which dbusd imports contexts), or directly as a context
returned by a libselinux function.  Has to be done that way so that it
can be set differently for different policy types (strict, targeted,
mls).

Naturally, cachefiles (the kernel module) would invoke a security hook
to check whether the daemon is allowed to set the specified context.

> I use to do that, but someone objected...  Possibly Karl MacMillan.

Yes, but I think I disagreed then too.

> > and then nfsd can do likewise using the context provided by the client or
> > obtained locally from exports for ordinary clients?  Avoids the transition
> > SID computation altogether within the kernel and makes this more generic.
> 
> I seem to remember that I was told that it should be done this way, possibly
> by Karl MacMillan, but I don't remember exactly.
> 
> Now it's configured by cachefilesd.te:
> 
>   type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;

It doesn't fit with how other users of security_kernel_act_as() will
likely want to work (they will want to just set the context to a
specified value, whether one obtained from the client or from some local
source), nor with how type transitions normally work (exec, with the
program type as the second type field).  I think it will just cause
confusion and subtle breakage.

-- 
Stephen Smalley
National Security Agency

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] FireWire update

2007-12-10 Thread Stefan Richter
Linus, please pull from the for-linus branch at

git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6.git 
for-linus

to receive the following FireWire subsystem update.

This considerably enhances compatibility of the new firewire-ohci driver
with a number of controllers.  It shrinks the list of chips with trouble
with isochronous reception to VIA VT6306 and some variants of VT6307.

The patch is somewhat big for this late -rc phase, and it has so far only
surfaced in 2.6.24-rc4-mm1 and in recent Fedora test kernels.  But the
author and I did a lot of tests with as much previously working chips as
we could get our hands on (some more than listed below) to make sure that
there is no regression.  

 drivers/firewire/fw-ohci.c |  175 +++-
 1 files changed, 155 insertions(+), 20 deletions(-)

Jarod Wilson (1):
  firewire: OHCI 1.0 Isochronous Receive support


Full log and diff:

commit a186b4a6b22fdc96a1ed63da483d267b5d00839e
Author: Jarod Wilson <[EMAIL PROTECTED]>
Date:   Mon Dec 3 13:43:12 2007 -0500

firewire: OHCI 1.0 Isochronous Receive support

Third rendition of FireWire OHCI 1.0 Isochronous Receive support, using a
zer-copy method similar to OHCI 1.1 which puts the IR data payload directly
into the userspace buffer. The zero-copy implementation eliminates the
video artifacts, audio popping, and buffer underrun problems seen with
version 1 of this patch, as well as fixing a regression in OHCI 1.1 support
introduced by version 2 of this patch.

Successfully tested in OHCI 1.1 mode on the following chipsets:

- NEC uPD72847 (rev 01), OHCI 1.1 (PCI)
- Ti XIO2200(A) (rev 01), OHCI 1.1 (PCIe)
- Ti TSB41AB2 (rev 01), OHCI 1.1 (PCI on SB Audigy)
- Apple UniNorth 2 (rev 81), OHCI 1.1 (PowerBook G4 onboard)

Successfully tested in OHCI 1.0 mode on the following chipsets:

- Agere FW323 (rev 06), OHCI 1.0 (Mac Mini onboard)
- Agere FW323 (rev 06), OHCI 1.0 (PCI)
- Via VT6306 (rev 46), OHCI 1.0 (PCI)
- NEC OrangeLink (rev 01), OHCI 1.0 (PCI)
- NEC uPD72847 (rev 01), OHCI 1.1 (PCI)
- Ti XIO2200(A) (rev 01), OHCI 1.1 (PCIe)

The bulk of testing was done in an x86_64 system, but was also successfully
sanity-tested on other systems, including a PPC(32) PowerBook G4 and an i686
EPIA M10k. Crude benchmarking (watching top during capture) puts the cpu
utilization during capture on the EPIA's 1GHz Via C3 processor around 13%,
which is down from 30% with the v1 code.

Some implementation details:

To maintain the same userspace API as dual-buffer mode, we set up two
descriptors for every incoming packet. The first is an INPUT_MORE 
descriptor,
pointing to a buffer large enough to hold just the packet's iso headers,
immediately followed by an INPUT_LAST descriptor, pointing to a chunk of the
userspace buffer big enough for the packet's data payload. With this setup,
each incoming packet fills in these two descriptors in a manner that very
closely emulates dual-buffer receive, to the point where the bulk of the
handle_ir_* code is now identical between the two (and probably primed for
some restructuring to share code between them).

The only caveat I have at the moment is that neither of my OHCI 1.0 Via
VT6307-based FireWire controllers work particularly well with this code
for reasons I have yet to figure out.

Signed-off-by: Jarod Wilson <[EMAIL PROTECTED]>
Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>

diff --git a/drivers/firewire/fw-ohci.c b/drivers/firewire/fw-ohci.c
index c9b9081..436a855 100644
--- a/drivers/firewire/fw-ohci.c
+++ b/drivers/firewire/fw-ohci.c
@@ -437,6 +437,21 @@ static void ar_context_run(struct ar_context *ctx)
flush_writes(ctx->ohci);
 }
 
+static struct descriptor *
+find_branch_descriptor(struct descriptor *d, int z)
+{
+   int b, key;
+
+   b   = (le16_to_cpu(d->control) & DESCRIPTOR_BRANCH_ALWAYS) >> 2;
+   key = (le16_to_cpu(d->control) & DESCRIPTOR_KEY_IMMEDIATE) >> 8;
+
+   /* figure out which descriptor the branch address goes in */
+   if (z == 2 && (b == 3 || key == 2))
+   return d;
+   else
+   return d + z - 1;
+}
+
 static void context_tasklet(unsigned long data)
 {
struct context *ctx = (struct context *) data;
@@ -455,7 +470,7 @@ static void context_tasklet(unsigned long data)
address = le32_to_cpu(last->branch_address);
z = address & 0xf;
d = ctx->buffer + (address - ctx->buffer_bus) / sizeof(*d);
-   last = (z == 2) ? d : d + z - 1;
+   last = find_branch_descriptor(d, z);
 
if (!ctx->callback(ctx, d, last))
break;
@@ -566,7 +581,7 @@ static void context_append(struct context *ctx,
 
ctx->head_descriptor = d + z + extra;

  1   2   3   4   5   6   7   >