Re: Question about Reiser4

2007-04-23 Thread Theodore Tso
On Mon, Apr 23, 2007 at 06:52:16AM -0700, Eric Hopper wrote:
 Oh, two things really interest me about Reiser4.  First, I despise
 having to care about how many tiny files I leave lying around when
 writing a program.  Berkeley DB and its ilk are evil, evil programs that
 obscure data and make things harder.  Secondly, the moves Reiser4 has
 made towards having actual transactions at the filesystem level also
 intrigue me.
 
 I want to use the filesystem as a DB.  IMHO, there is no reason that
 filesystems shouldn't be a DB sans query language.  If there were a more
 DB-like way to deal with filesystems, I think that it would be that much
 easier to make something that was a decent replacement for NFS and
 actually worked.

One of the big problems of using a filesystem as a DB is the system
call overheads.  If you use huge numbers of tiny files, then each
attempt read an atom of information from the DB takes three system
calls --- an open(), read(), and close(), with all of the overheads in
terms of dentry and inode cache.

Hans of course had a solution to this problem --- namely the
sys_reiser4 system call, where you download a program to the kernel to
execute a open/read/close via a single system call, and which returns
the combined results to userspace.  But now you have more complexity
since there is now a reseir4-specific interpreter embeddeed in the
kernel, the userspace application needs to write the equivalent of an
channel program such as what was found in an IBM/360 mainframe (need I
mention this can be a rich source of security bugs), and then the
userspace application *still* needs to parse the result returned by
the sys_reiser4() system call?

So it adds a huge amount of complexity, and at the end of the day,
given that you don't have the search capability, it is (a) less
functional, (b) more complexitated, and (c) probably less performant
than simply calling out to a database.

 Sadly, unless someone pays me to maintain it, I can't do the fork
 myself, and I likely wouldn't anyway as being a kernel hacker of
 something as important as a filesystem is a full-time job and I have
 other things that interest me a lot more.

Unfortunately, the way OSS works is that you either (a) have to do the
work yourself, (b) convince someone else to do the work, or (c)
convince someone that it's worth paying you to do it.

Personally, if I controlled large budget for Linux filesystem
development, I'd put a lot more money into something like Val's
chunkfs idea than resier4.  Being able to have filesystems designed
for fast recovery given disks getting larger and larger (but not more
reliable), is a whole lot more improtant than trying to create an
alternate solution to an already solved problem --- namely that of a
database.  When you consider that a similar idea, WinFS, was partially
responsible for delaying Vista by years due to the complexity of
shoving a database where it has no place being, it's another reason
why I personally think that chunkfs is a much more promising avenue
for future filesystem investment than reiserfs.

But hey, the advantage of Open Source is that if *you* want to work on
Reiser4, you're perfectly free to do so.  My personal opinion is that
it'd be a waste of your time, but you're free to spend your time
whichever way that you want.  What you don't get do is whine about how
other people get to spend *their* time, or *their* money.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.7 locking up hard on boot

2007-04-23 Thread Marcos Pinto

I'm honestly not sure how to try what you suggested to try, since I'm
nothing even remotely close to a kernel geek and it was over my head.
However, I'd gladly test anything that you think would be worth
testing, if you would please put it in way that I could understand,
such as change line 'foo' in probe.c into 'foolio'

Thanks again for all of your help,
Marcos

On 4/23/07, Jan Beulich [EMAIL PROTECTED] wrote:

Given that all of the reports are in cases when the adjustment is *not*
being done (and only a message is being printed), I can only assume that
the breakage results from the adding of PCI_BASE_ADDRESS_SPACE_IO
into the resource flags. I considered this unconditional setting of the flags
odd already in the original code, and added this extra flag only for
consistency reasons (because the settings reported by X indicated that
this was missing). Perhaps the adjustment (original and the added
extra flag) shouldn't be done if IORESOURCE_IO wasn't already set.
Perhaps one of those seeing the issue could try out returning from the
function right after that printk(), without any adjustment to the flags.

Jan



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 04/25] xen: Add XEN config options

2007-04-23 Thread Andi Kleen
On Monday 23 April 2007 23:56:42 Jeremy Fitzhardinge wrote:
 The XEN config option enables the Xen paravirt_ops interface, which is
 installed when the kernel finds itself running under Xen.
 
 Xen is no longer a sub-architecture, so the X86_XEN subarch config
 option has gone.
 
 Xen is currently incompatible with PREEMPT, but this is fixed up later
 in the series.

Shouldn't this be after the change that adds arch/i386/xen/Kconfig?

Otherwise you break bisects

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH -mm 2/3] freezer: Introduce freezer_flags

2007-04-23 Thread Rafael J. Wysocki
On Tuesday, 24 April 2007 00:55, Oleg Nesterov wrote:
 On 04/24, Rafael J. Wysocki wrote:
 
  Should I clear it in dup_task_struct() or is there a better place?
 
 I personally think we should do this in dup_task_struct(). In fact, I believe
 it is better to replace the
 
   *tsk = *orig;
 
 with some helper (like setup_thread_stack() below), and that helper clears
 -freezer_flags. Say, copy_task_struct().

Hmm, wouldn't that be overkill?  copy_task_struct() would have to do
*tsk = *orig anyway, and we only need to clear one field apart from this.

Some other fields are cleared towards the end of dup_task_struct(), so perhaps
we could clear freezer_flags in there too?

Rafael
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Prevent softlockup triggering in nvidiafb

2007-04-23 Thread Antonino A. Daplas
 On Mon, Apr 23, 2007 at 04:55:05PM +0100, Alan Cox wrote:
   On Mon, 23 Apr 2007 11:36:30 -0400
   Dave Jones [EMAIL PROTECTED] wrote:
  
If the chip locks up, we get into a long polling loop,
where the softlockup detector kicks in.
See https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=151878
 for an example.
  
   Surely in this situation the softlockup report and trap out is
 precisely what should be occurring.

 We can't do anything useful with the trace.  It already prints out info
 that the hardware locked up.

And when nvidiafb detects a lockup, it will go to safe mode. Better than
rebooting, I think.

Tony



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 04/25] xen: Add XEN config options

2007-04-23 Thread Jeremy Fitzhardinge
The XEN config option enables the Xen paravirt_ops interface, which is
installed when the kernel finds itself running under Xen.

Xen is no longer a sub-architecture, so the X86_XEN subarch config
option has gone.

Xen is currently incompatible with PREEMPT, but this is fixed up later
in the series.

Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
Signed-off-by: Ian Pratt [EMAIL PROTECTED]
Signed-off-by: Christian Limpach [EMAIL PROTECTED]
Signed-off-by: Chris Wright [EMAIL PROTECTED]

---
 arch/i386/Kconfig |2 ++
 arch/i386/xen/Kconfig |   10 ++
 2 files changed, 12 insertions(+)

===
--- a/arch/i386/Kconfig
+++ b/arch/i386/Kconfig
@@ -216,6 +216,8 @@ config PARAVIRT
  under a hypervisor, improving performance significantly.
  However, when run without a hypervisor the kernel is
  theoretically slower.  If in doubt, say N.
+
+source arch/i386/xen/Kconfig
 
 config VMI
bool VMI Paravirt-ops support
===
--- /dev/null
+++ b/arch/i386/xen/Kconfig
@@ -0,0 +1,10 @@
+#
+# This Kconfig describes xen options
+#
+
+config XEN
+   bool Enable support for Xen hypervisor
+   depends on PARAVIRT  HZ_100  !PREEMPT  !NO_HZ
+   default y
+   help
+ This is the Linux Xen port.

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 08/25] xen: xen: fix multicall batching

2007-04-23 Thread Jeremy Fitzhardinge
Disable interrupts between allocating a multicall entry and actually
issuing it, to prevent an interrupt from coming in, allocating and
initializing further multicall entries, and then issuing them all,
including the partially completed one.

Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]

---
 arch/i386/xen/enlighten.c  |   44 +++-
 arch/i386/xen/mmu.c|   18 --
 arch/i386/xen/multicalls.c |9 -
 arch/i386/xen/multicalls.h |   27 +++
 arch/i386/xen/xen-ops.h|5 +
 5 files changed, 71 insertions(+), 32 deletions(-)

===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -160,13 +160,25 @@ static void xen_halt(void)
 
 static void xen_set_lazy_mode(enum paravirt_lazy_mode mode)
 {
-   enum paravirt_lazy_mode *lazy = get_cpu_var(xen_lazy_mode);
+   switch(mode) {
+   case PARAVIRT_LAZY_NONE:
+   BUG_ON(x86_read_percpu(xen_lazy_mode) == PARAVIRT_LAZY_NONE);
+   break;
+
+   case PARAVIRT_LAZY_MMU:
+   case PARAVIRT_LAZY_CPU:
+   BUG_ON(x86_read_percpu(xen_lazy_mode) != PARAVIRT_LAZY_NONE);
+   break;
+
+   case PARAVIRT_LAZY_FLUSH:
+   /* flush if necessary, but don't change state */
+   if (x86_read_percpu(xen_lazy_mode) != PARAVIRT_LAZY_NONE)
+   xen_mc_flush();
+   return;
+   }
 
xen_mc_flush();
-
-   *lazy = mode;
-
-   put_cpu_var(xen_lazy_mode);
+   x86_write_percpu(xen_lazy_mode, mode);
 }
 
 static unsigned long xen_store_tr(void)
@@ -193,7 +208,7 @@ static void xen_set_ldt(const void *addr
 
MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
 
-   xen_mc_issue();
+   xen_mc_issue(PARAVIRT_LAZY_CPU);
 }
 
 static void xen_load_gdt(const struct Xgt_desc_struct *dtr)
@@ -217,7 +232,7 @@ static void xen_load_gdt(const struct Xg
 
MULTI_set_gdt(mcs.mc, frames, size/8);
 
-   xen_mc_issue();
+   xen_mc_issue(PARAVIRT_LAZY_CPU);
 }
 
 static void load_TLS_descriptor(struct thread_struct *t,
@@ -225,18 +240,20 @@ static void load_TLS_descriptor(struct t
 {
struct desc_struct *gdt = get_cpu_gdt_table(cpu);
xmaddr_t maddr = virt_to_machine(gdt[GDT_ENTRY_TLS_MIN+i]);
-   struct multicall_space mc = xen_mc_entry(0);
+   struct multicall_space mc = __xen_mc_entry(0);
 
MULTI_update_descriptor(mc.mc, maddr.maddr, t-tls_array[i]);
 }
 
 static void xen_load_tls(struct thread_struct *t, unsigned int cpu)
 {
+   xen_mc_batch();
+
load_TLS_descriptor(t, cpu, 0);
load_TLS_descriptor(t, cpu, 1);
load_TLS_descriptor(t, cpu, 2);
 
-   xen_mc_issue();
+   xen_mc_issue(PARAVIRT_LAZY_CPU);
 }
 
 static void xen_write_ldt_entry(struct desc_struct *dt, int entrynum, u32 low, 
u32 high)
@@ -356,13 +373,9 @@ static void xen_load_esp0(struct tss_str
 static void xen_load_esp0(struct tss_struct *tss,
   struct thread_struct *thread)
 {
-   if (xen_get_lazy_mode() != PARAVIRT_LAZY_CPU) {
-   if (HYPERVISOR_stack_switch(__KERNEL_DS, thread-esp0))
-   BUG();
-   } else {
-   struct multicall_space mcs = xen_mc_entry(0);
-   MULTI_stack_switch(mcs.mc, __KERNEL_DS, thread-esp0);
-   }
+   struct multicall_space mcs = xen_mc_entry(0);
+   MULTI_stack_switch(mcs.mc, __KERNEL_DS, thread-esp0);
+   xen_mc_issue(PARAVIRT_LAZY_CPU);
 }
 
 static void xen_set_iopl_mask(unsigned mask)
@@ -452,7 +465,7 @@ static void xen_write_cr3(unsigned long 
 
MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
 
-   xen_mc_issue();
+   xen_mc_issue(PARAVIRT_LAZY_CPU);
}
 }
 
===
--- a/arch/i386/xen/mmu.c
+++ b/arch/i386/xen/mmu.c
@@ -344,7 +344,7 @@ static int pin_page(struct page *page, u
else {
void *pt = lowmem_page_address(page);
unsigned long pfn = page_to_pfn(page);
-   struct multicall_space mcs = xen_mc_entry(0);
+   struct multicall_space mcs = __xen_mc_entry(0);
 
flush = 0;
 
@@ -364,10 +364,12 @@ void xen_pgd_pin(pgd_t *pgd)
struct multicall_space mcs;
struct mmuext_op *op;
 
+   xen_mc_batch();
+
if (pgd_walk(pgd, pin_page, TASK_SIZE))
kmap_flush_unused();
 
-   mcs = xen_mc_entry(sizeof(*op));
+   mcs = __xen_mc_entry(sizeof(*op));
op = mcs.args;
 
 #ifdef CONFIG_X86_PAE
@@ -378,7 +380,7 @@ void xen_pgd_pin(pgd_t *pgd)
op-arg1.mfn = pfn_to_mfn(PFN_DOWN(__pa(pgd)));
MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
 
-   xen_mc_flush();
+   xen_mc_issue(0);
 }
 
 /* The init_mm pagetable is really pinned as soon as its 

[PATCH 10/25] xen: Implement xen_sched_clock

2007-04-23 Thread Jeremy Fitzhardinge
Implement xen_sched_clock, which returns the number of ns the current
vcpu has been actually in the running state (vs blocked,
runnable-but-not-running, or offline) since boot.

Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
Cc: john stultz [EMAIL PROTECTED]

---
 arch/i386/xen/enlighten.c |2 +-
 arch/i386/xen/time.c  |   22 +-
 arch/i386/xen/xen-ops.h   |3 +--
 3 files changed, 23 insertions(+), 4 deletions(-)

===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -676,7 +676,7 @@ static const struct paravirt_ops xen_par
.set_wallclock = xen_set_wallclock,
.get_wallclock = xen_get_wallclock,
.get_cpu_khz = xen_cpu_khz,
-   .sched_clock = xen_clocksource_read,
+   .sched_clock = xen_sched_clock,
 
 #ifdef CONFIG_X86_LOCAL_APIC
.apic_write = paravirt_nop,
===
--- a/arch/i386/xen/time.c
+++ b/arch/i386/xen/time.c
@@ -16,6 +16,8 @@
 #define XEN_SHIFT 22
 #define TIMER_SLOP 10  /* Xen may fire a timer up to this many ns 
early */
 #define NS_PER_TICK(10ll / HZ)
+
+static cycle_t xen_clocksource_read(void);
 
 /* These are perodically updated in shared_info, and then copied here. */
 struct shadow_time_info {
@@ -118,6 +120,24 @@ static void do_stolen_accounting(void)
account_steal_time(idle_task(smp_processor_id()), ticks);
 }
 
+/*
+ * Xen sched_clock implementation.  Returns the number of unstolen
+ * nanoseconds, which is nanoseconds the VCPU spent in RUNNING+BLOCKED
+ * states.
+ */
+unsigned long long xen_sched_clock(void)
+{
+   struct vcpu_runstate_info state;
+   cycle_t now = xen_clocksource_read();
+
+   get_runstate_snapshot(state);
+
+   WARN_ON(state.state != RUNSTATE_running);
+
+   return state.time[RUNSTATE_blocked] +
+   state.time[RUNSTATE_running] +
+   (now - state.state_entry_time);
+}
 
 
 /* Get the CPU speed from Xen */
@@ -209,7 +229,7 @@ static u64 get_nsec_offset(struct shadow
return scale_delta(delta, shadow-tsc_to_nsec_mul, shadow-tsc_shift);
 }
 
-cycle_t xen_clocksource_read(void)
+static cycle_t xen_clocksource_read(void)
 {
struct shadow_time_info *shadow = get_cpu_var(shadow_time);
cycle_t ret;
===
--- a/arch/i386/xen/xen-ops.h
+++ b/arch/i386/xen/xen-ops.h
@@ -2,7 +2,6 @@
 #define XEN_OPS_H
 
 #include linux/init.h
-#include linux/clocksource.h
 
 DECLARE_PER_CPU(struct vcpu_info *, xen_vcpu);
 DECLARE_PER_CPU(unsigned long, xen_cr3);
@@ -18,7 +17,7 @@ void __init xen_time_init(void);
 void __init xen_time_init(void);
 unsigned long xen_get_wallclock(void);
 int xen_set_wallclock(unsigned long time);
-cycle_t xen_clocksource_read(void);
+unsigned long long xen_sched_clock(void);
 
 void xen_mark_init_mm_pinned(void);
 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 01/25] xen: Add apply_to_page_range() which applies a function to a pte range.

2007-04-23 Thread Jeremy Fitzhardinge
Add a new mm function apply_to_page_range() which applies a given
function to every pte in a given virtual address range in a given mm
structure. This is a generic alternative to cut-and-pasting the Linux
idiomatic pagetable walking code in every place that a sequence of
PTEs must be accessed.

Although this interface is intended to be useful in a wide range of
situations, it is currently used specifically by several Xen
subsystems, for example: to ensure that pagetables have been allocated
for a virtual address range, and to construct batched special
pagetable update requests to map I/O memory (in ioremap()).

Signed-off-by: Ian Pratt [EMAIL PROTECTED]
Signed-off-by: Christian Limpach [EMAIL PROTECTED]
Signed-off-by: Chris Wright [EMAIL PROTECTED]
Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
Cc: Christoph Lameter [EMAIL PROTECTED]
Cc: Matt Mackall [EMAIL PROTECTED]
Acked-by: Ingo Molnar [EMAIL PROTECTED] 

---
 include/linux/mm.h |5 ++
 mm/memory.c|   94 
 2 files changed, 99 insertions(+)

===
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1135,6 +1135,11 @@ struct page *follow_page(struct vm_area_
 #define FOLL_GET   0x04/* do get_page on page */
 #define FOLL_ANON  0x08/* give ZERO_PAGE if no pgtable */
 
+typedef int (*pte_fn_t)(pte_t *pte, struct page *pmd_page, unsigned long addr,
+   void *data);
+extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
+  unsigned long size, pte_fn_t fn, void *data);
+
 #ifdef CONFIG_PROC_FS
 void vm_stat_account(struct mm_struct *, unsigned long, struct file *, long);
 #else
===
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1448,6 +1448,100 @@ int remap_pfn_range(struct vm_area_struc
 }
 EXPORT_SYMBOL(remap_pfn_range);
 
+static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
+unsigned long addr, unsigned long end,
+pte_fn_t fn, void *data)
+{
+   pte_t *pte;
+   int err;
+   struct page *pmd_page;
+   spinlock_t *ptl;
+
+   pte = (mm == init_mm) ?
+   pte_alloc_kernel(pmd, addr) :
+   pte_alloc_map_lock(mm, pmd, addr, ptl);
+   if (!pte)
+   return -ENOMEM;
+
+   BUG_ON(pmd_huge(*pmd));
+
+   pmd_page = pmd_page(*pmd);
+
+   do {
+   err = fn(pte, pmd_page, addr, data);
+   if (err)
+   break;
+   } while (pte++, addr += PAGE_SIZE, addr != end);
+
+   if (mm != init_mm)
+   pte_unmap_unlock(pte-1, ptl);
+   return err;
+}
+
+static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
+unsigned long addr, unsigned long end,
+pte_fn_t fn, void *data)
+{
+   pmd_t *pmd;
+   unsigned long next;
+   int err;
+
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return -ENOMEM;
+   do {
+   next = pmd_addr_end(addr, end);
+   err = apply_to_pte_range(mm, pmd, addr, next, fn, data);
+   if (err)
+   break;
+   } while (pmd++, addr = next, addr != end);
+   return err;
+}
+
+static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd,
+unsigned long addr, unsigned long end,
+pte_fn_t fn, void *data)
+{
+   pud_t *pud;
+   unsigned long next;
+   int err;
+
+   pud = pud_alloc(mm, pgd, addr);
+   if (!pud)
+   return -ENOMEM;
+   do {
+   next = pud_addr_end(addr, end);
+   err = apply_to_pmd_range(mm, pud, addr, next, fn, data);
+   if (err)
+   break;
+   } while (pud++, addr = next, addr != end);
+   return err;
+}
+
+/*
+ * Scan a region of virtual memory, filling in page tables as necessary
+ * and calling a provided function on each leaf page table.
+ */
+int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
+   unsigned long size, pte_fn_t fn, void *data)
+{
+   pgd_t *pgd;
+   unsigned long next;
+   unsigned long end = addr + size;
+   int err;
+
+   BUG_ON(addr = end);
+   pgd = pgd_offset(mm, addr);
+   do {
+   next = pgd_addr_end(addr, end);
+   err = apply_to_pud_range(mm, pgd, addr, next, fn, data);
+   if (err)
+   break;
+   } while (pgd++, addr = next, addr != end);
+   return err;
+}
+EXPORT_SYMBOL_GPL(apply_to_page_range);
+
 /*
  * handle_pte_fault chooses page fault handler according to an entry
  * which was read non-atomically.  Before making any commitment, on

-- 

-
To unsubscribe 

[PATCH 24/25] xen: xen: diddle netfront

2007-04-23 Thread Jeremy Fitzhardinge
Move things around a bit to match xen-unstable netfront.

Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]

---
 drivers/net/xen-netfront.c |   36 +---
 1 file changed, 17 insertions(+), 19 deletions(-)

===
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -750,19 +750,6 @@ no_skb:
notify_remote_via_irq(np-irq);
 }
 
-static void xennet_move_rx_slot(struct netfront_info *np, struct sk_buff *skb,
-   grant_ref_t ref)
-{
-   int new = xennet_rxidx(np-rx.req_prod_pvt);
-
-   BUG_ON(np-rx_skbs[new]);
-   np-rx_skbs[new] = skb;
-   np-grant_rx_ref[new] = ref;
-   RING_GET_REQUEST(np-rx, np-rx.req_prod_pvt)-id = new;
-   RING_GET_REQUEST(np-rx, np-rx.req_prod_pvt)-gref = ref;
-   np-rx.req_prod_pvt++;
-}
-
 static void xennet_make_frags(struct sk_buff *skb, struct net_device *dev,
  struct netif_tx_request *tx)
 {
@@ -944,6 +931,19 @@ static irqreturn_t netif_int(int irq, vo
spin_unlock_irqrestore(np-tx_lock, flags);
 
return IRQ_HANDLED;
+}
+
+static void xennet_move_rx_slot(struct netfront_info *np, struct sk_buff *skb,
+   grant_ref_t ref)
+{
+   int new = xennet_rxidx(np-rx.req_prod_pvt);
+
+   BUG_ON(np-rx_skbs[new]);
+   np-rx_skbs[new] = skb;
+   np-grant_rx_ref[new] = ref;
+   RING_GET_REQUEST(np-rx, np-rx.req_prod_pvt)-id = new;
+   RING_GET_REQUEST(np-rx, np-rx.req_prod_pvt)-gref = ref;
+   np-rx.req_prod_pvt++;
 }
 
 static void handle_incoming_queue(struct net_device *dev, struct sk_buff_head 
*rxq)
@@ -1169,7 +1169,8 @@ static RING_IDX xennet_fill_frags(struct
return cons;
 }
 
-static int xennet_set_skb_gso(struct sk_buff *skb, struct netif_extra_info 
*gso)
+static int xennet_set_skb_gso(struct sk_buff *skb,
+ struct netif_extra_info *gso)
 {
if (!gso-u.gso.size) {
if (net_ratelimit())
@@ -1456,11 +1457,8 @@ static void netif_release_rx_bufs(struct
 
if (!xen_feature(XENFEAT_auto_translated_physmap)) {
/* Do all the remapping work and M2P updates. */
-   mcl-op = __HYPERVISOR_mmu_update;
-   mcl-args[0] = (unsigned long)np-rx_mmu;
-   mcl-args[1] = mmu - np-rx_mmu;
-   mcl-args[2] = 0;
-   mcl-args[3] = DOMID_SELF;
+   MULTI_mmu_update(mcl, np-rx_mmu, mmu - np-rx_mmu,
+0, DOMID_SELF);
mcl++;
HYPERVISOR_multicall(np-rx_mcl, mcl - np-rx_mcl);
}

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 25/25] xen: Xen machine operations

2007-04-23 Thread Jeremy Fitzhardinge
Make the appropriate hypercalls to halt and reboot the virtual machine.

Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]

---
 arch/i386/xen/enlighten.c |   43 +++
 arch/i386/xen/smp.c   |4 +---
 2 files changed, 44 insertions(+), 3 deletions(-)

===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -14,6 +14,7 @@
 
 #include xen/interface/xen.h
 #include xen/interface/vcpu.h
+#include xen/interface/sched.h
 #include xen/features.h
 #include xen/page.h
 
@@ -28,6 +29,7 @@
 #include asm/pgtable.h
 #include asm/smp.h
 #include asm/tlbflush.h
+#include asm/reboot.h
 
 #include xen-ops.h
 #include mmu.h
@@ -787,6 +789,45 @@ static const struct smp_ops xen_smp_ops 
 };
 #endif /* CONFIG_SMP */
 
+static void xen_reboot(int reason)
+{
+#ifdef CONFIG_SMP
+   smp_send_stop();
+#endif
+
+   if (HYPERVISOR_sched_op(SCHEDOP_shutdown, reason))
+   BUG();
+}
+
+static void xen_restart(char *msg)
+{
+   xen_reboot(SHUTDOWN_reboot);
+}
+
+static void xen_emergency_restart(void)
+{
+   xen_reboot(SHUTDOWN_reboot);
+}
+
+static void xen_machine_halt(void)
+{
+   xen_reboot(SHUTDOWN_poweroff);
+}
+
+static void xen_crash_shutdown(struct pt_regs *regs)
+{
+   xen_reboot(SHUTDOWN_crash);
+}
+
+static const struct machine_ops __initdata xen_machine_ops = {
+   .restart = xen_restart,
+   .halt = xen_machine_halt,
+   .power_off = xen_machine_halt,
+   .shutdown = xen_machine_halt,
+   .crash_shutdown = xen_crash_shutdown,
+   .emergency_restart = xen_emergency_restart,
+};
+
 /* First C function to be called on Xen boot */
 static asmlinkage void __init xen_start_kernel(void)
 {
@@ -800,6 +841,8 @@ static asmlinkage void __init xen_start_
 
/* Install Xen paravirt ops */
paravirt_ops = xen_paravirt_ops;
+   machine_ops = xen_machine_ops;
+
 #ifdef CONFIG_SMP
smp_ops = xen_smp_ops;
 #endif
===
--- a/arch/i386/xen/smp.c
+++ b/arch/i386/xen/smp.c
@@ -303,9 +303,7 @@ static void stop_self(void *v)
 
 void xen_smp_send_stop(void)
 {
-   cpumask_t mask = cpu_online_map;
-   cpu_clear(smp_processor_id(), mask);
-   xen_smp_call_function_mask(mask, stop_self, NULL, 0);
+   smp_call_function(stop_self, NULL, 0, 0);
 }
 
 void xen_smp_send_reschedule(int cpu)

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 21/25] xen: Add the Xen virtual network device driver.

2007-04-23 Thread Jeremy Fitzhardinge
The network device frontend driver allows the kernel to access network
devices exported exported by a virtual machine containing a physical
network device driver.

Signed-off-by: Ian Pratt [EMAIL PROTECTED]
Signed-off-by: Christian Limpach [EMAIL PROTECTED]
Signed-off-by: Chris Wright [EMAIL PROTECTED]
Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Cc: Jeff Garzik [EMAIL PROTECTED]
Cc: Stephen Hemminger [EMAIL PROTECTED]
---
 drivers/net/Kconfig|   12 
 drivers/net/Makefile   |2 
 drivers/net/xen-netfront.c | 1957 
 3 files changed, 1971 insertions(+)

===
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -2508,6 +2508,18 @@ source drivers/atm/Kconfig
 
 source drivers/s390/net/Kconfig
 
+config XEN_NETDEV_FRONTEND
+   tristate Xen network device frontend driver
+   depends on XEN
+   default y
+   help
+ The network device frontend driver allows the kernel to
+ access network devices exported exported by a virtual
+ machine containing a physical network device driver. The
+ frontend driver is intended for unprivileged guest domains;
+ if you are compiling a kernel for a Xen guest, you almost
+ certainly want to enable this.
+
 config ISERIES_VETH
tristate iSeries Virtual Ethernet driver support
depends on PPC_ISERIES
===
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -218,3 +218,5 @@ obj-$(CONFIG_FS_ENET) += fs_enet/
 obj-$(CONFIG_FS_ENET) += fs_enet/
 
 obj-$(CONFIG_NETXEN_NIC) += netxen/
+
+obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o
===
--- /dev/null
+++ b/drivers/net/xen-netfront.c
@@ -0,0 +1,1957 @@
+/**
+ * Virtual network driver for conversing with remote driver backends.
+ *
+ * Copyright (c) 2002-2005, K A Fraser
+ * Copyright (c) 2005, XenSource Ltd
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the Software), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#include linux/module.h
+#include linux/version.h
+#include linux/kernel.h
+#include linux/netdevice.h
+#include linux/etherdevice.h
+#include linux/skbuff.h
+#include linux/ethtool.h
+#include linux/in.h
+#include linux/if_ether.h
+#include linux/moduleparam.h
+#include linux/mm.h
+#include xen/xenbus.h
+#include xen/interface/io/netif.h
+#include xen/interface/memory.h
+#ifdef CONFIG_XEN_BALLOON
+#include xen/balloon.h
+#endif
+#include xen/interface/grant_table.h
+
+#include xen/events.h
+#include xen/page.h
+#include xen/grant_table.h
+
+/*
+ * Mutually-exclusive module options to select receive data path:
+ *  rx_copy : Packets are copied by network backend into local memory
+ *  rx_flip : Page containing packet data is transferred to our ownership
+ * For fully-virtualised guests there is no option - copying must be used.
+ * For paravirtualised guests, flipping is the default.
+ */
+static int rx_copy;
+module_param(rx_copy, bool, 0);
+MODULE_PARM_DESC(rx_copy, Copy packets from network card (rather than flip));
+static int rx_flip;
+module_param(rx_flip, bool, 0);
+MODULE_PARM_DESC(rx_flip, Flip packets from network card (rather than copy));
+
+#define RX_COPY_THRESHOLD 256
+
+#define GRANT_INVALID_REF  0
+
+#define NET_TX_RING_SIZE __RING_SIZE((struct netif_tx_sring *)0, PAGE_SIZE)
+#define NET_RX_RING_SIZE __RING_SIZE((struct netif_rx_sring *)0, PAGE_SIZE)
+
+struct netfront_info {
+   struct list_head list;

[PATCH 07/25] xen: Complete pagetable pinning for Xen

2007-04-23 Thread Jeremy Fitzhardinge
Xen has a notion of pinned pagetables, which are pagetables that
remain read-only to the guest and are validated by the hypervisor.
This makes context switches much cheaper, because the hypervisor
doesn't need to revalidate the pagetable each time.

This patch adds a PG_pinned flag for pagetable pages so we can tell if
it has been pinned or not.  This allows various pagetable update
optimisations.

This also adds a mm parameter to the alloc_pt pv_op, so that Xen can
see if we're adding a page to a pinned pagetable.  This is not
necessary for alloc_pd or release_p[dt], which is fortunate because it
isn't available at all callsites.

This also adds a new paravirt hook which is called during setup once
the zones and memory allocator have been initialized.  When the
init_mm pagetable is first built, the struct page array does not yet
exist, and so there's nowhere to put he init_mm pagetable's PG_pinned
flags.  Once the zones are initialized and the struct page array
exists, we can set the PG_pinned flags for those pages.

This patch also adds the Xen support for pte pages allocated out of
highmem (highpte), principly by implementing xen_kmap_atomic_pte.

Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
Cc: Zach Amsden [EMAIL PROTECTED]

---
 arch/i386/kernel/setup.c|3 
 arch/i386/kernel/vmi.c  |2 
 arch/i386/mm/init.c |2 
 arch/i386/mm/pageattr.c |2 
 arch/i386/xen/enlighten.c   |  105 +++-
 arch/i386/xen/mmu.c |  280 +++
 arch/i386/xen/mmu.h |2 
 arch/i386/xen/xen-ops.h |2 
 include/asm-i386/paravirt.h |   16 +-
 include/asm-i386/pgalloc.h  |6 
 include/asm-i386/setup.h|4 
 include/linux/page-flags.h  |5 
 12 files changed, 289 insertions(+), 140 deletions(-)

===
--- a/arch/i386/kernel/setup.c
+++ b/arch/i386/kernel/setup.c
@@ -607,9 +607,12 @@ void __init setup_arch(char **cmdline_p)
sparse_init();
zone_sizes_init();
 
+
/*
 * NOTE: at this point the bootmem allocator is fully available.
 */
+
+   paravirt_post_allocator_init();
 
dmi_scan_machine();
 
===
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -361,7 +361,7 @@ static void *vmi_kmap_atomic_pte(struct 
 }
 #endif
 
-static void vmi_allocate_pt(u32 pfn)
+static void vmi_allocate_pt(struct mm_struct *mm, u32 pfn)
 {
vmi_set_page_type(pfn, VMI_PAGE_L1);
vmi_ops.allocate_page(pfn, VMI_PAGE_L1, 0, 0, 0);
===
--- a/arch/i386/mm/init.c
+++ b/arch/i386/mm/init.c
@@ -87,7 +87,7 @@ static pte_t * __init one_page_table_ini
if (pmd_none(*pmd)) {
pte_t *page_table = (pte_t *) 
alloc_bootmem_low_pages(PAGE_SIZE);
 
-   paravirt_alloc_pt(__pa(page_table)  PAGE_SHIFT);
+   paravirt_alloc_pt(init_mm, __pa(page_table)  PAGE_SHIFT);
set_pmd(pmd, __pmd(__pa(page_table) | _PAGE_TABLE));
BUG_ON(page_table != pte_offset_kernel(pmd, 0));
}
===
--- a/arch/i386/mm/pageattr.c
+++ b/arch/i386/mm/pageattr.c
@@ -60,7 +60,7 @@ static struct page *split_large_page(uns
address = __pa(address);
addr = address  LARGE_PAGE_MASK; 
pbase = (pte_t *)page_address(base);
-   paravirt_alloc_pt(page_to_pfn(base));
+   paravirt_alloc_pt(init_mm, page_to_pfn(base));
for (i = 0; i  PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
set_pte(pbase[i], pfn_pte(addr  PAGE_SHIFT,
   addr == address ? prot : ref_prot));
===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -8,6 +8,9 @@
 #include linux/sched.h
 #include linux/bootmem.h
 #include linux/module.h
+#include linux/mm.h
+#include linux/page-flags.h
+#include linux/highmem.h
 
 #include xen/interface/xen.h
 #include xen/features.h
@@ -453,32 +456,59 @@ static void xen_write_cr3(unsigned long 
}
 }
 
-static void xen_alloc_pt(u32 pfn)
-{
-   /* XXX pfn isn't necessarily a lowmem page */
+/* Early in boot, while setting up the initial pagetable, assume
+   everything is pinned. */
+static void xen_alloc_pt_init(struct mm_struct *mm, u32 pfn)
+{
+   BUG_ON(mem_map);/* should only be used early */
make_lowmem_page_readonly(__va(PFN_PHYS(pfn)));
 }
 
-static void xen_alloc_pd(u32 pfn)
-{
-   make_lowmem_page_readonly(__va(PFN_PHYS(pfn)));
-}
-
-static void xen_release_pd(u32 pfn)
-{
-   make_lowmem_page_readwrite(__va(PFN_PHYS(pfn)));
-}
-
+/* This needs to make sure the new pte page is pinned iff its being
+   attached to a pinned pagetable. */
+static void xen_alloc_pt(struct mm_struct *mm, 

Re: [PATCH 00/25] xen: Xen implementation for paravirt_ops

2007-04-23 Thread Jeremy Fitzhardinge
Andi Kleen wrote:
 On Monday 23 April 2007 23:56:38 Jeremy Fitzhardinge wrote:
   
 Hi Andi,

 It applies to 2.6.21-rc7 + your patches + the last batch of pv_ops
 patches 
 

 I got most of those except for the broken sched_clock change.
   

Er, we had a bit of back-and-forward with that.  How did that end up?

 I posted. 
 

 How much testing outside Jeremylabs has it gotten? Some beta
 testing before merging would be good, otherwise we'll just have
 a flood of fixes shortly when it is exposed to users.
   

Yes.  I'm just prepping a tree for xen-devel, and I primed people at the
Xen Summit last week.

 This patch generally restricts itself to Xen-specific parts of the tree,
 though it does make a few small changes elsewhere.
 

 The general problem is that it is much more than just an architecture update.

   
 These patches include:
  - some helper routines for allocating address space and walking pagetables
 

 Needs review from mm people.
   

These have been pretty well looked at already.  They have been posted
repeatedly, and I think all the comments have been sorted out. 
alloc_vm_area() will be a bit affected by Andrew's -mm patch to make
vmalloc_sync_all a globally-visible arch export, but they merge nicely.

  - Xen interface header files
  - Core Xen implementation
  - Efficient late-pinning/early-unpinning pagetable handling 
 

 The number of new paravirt hooks makes me thing of renaming it to
 everything_ops @|
   

There's only one new op in this series, and I couldn't work out a way to
avoid it, other than putting a #ifdef CONFIG_XEN in kernel/setup.c.  The
last patch posting didn't add any new hooks.  Which ones are you
referring to?

  - Virtualized time, including stolen time
 

 Can you let it be reviewed by the time people? (Thomas, Ingo, John, Roman 
 etc.)
   

Thomas has looked at and generally approves of the Xen clocksource/event
code.  The stolen time code is really only used to generate a few
numbers in /proc, and so has very little direct impact on the rest of
the kernel, and hasn't really attracted much interest as a result.  I've
posted the patch to implement sched_clock in terms of unstolen time to
the various time people repeatedly, and nobody has responded, so I guess
it doesn't irritate anyone too much; it would be nice to have some
definite feedback though.

  - Xen console, based on hvc console
  - Xenbus
 

 That one would need to be reviewed first. It's so much code that I can't
 do it all myself.
   

I put a specific plea for GregKH to look at this.

  - Netfront, the paravirtualized network device
 

 That one should go through the network device maintainer/netdev.
   

Stephen Hemminger has looked at this in the past and we've addressed all
his comments so far.  But it would be nice to get some more net
developers to review this; it was cc:d to netdev.

  - Blockfront, the paravirtualized block device
 

 And that needs a block device review and whoever maintains that (Jens?) 
   

He was cc:d.  I'll ask him specifically.

J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 18/25] xen: Add Xen grant table support

2007-04-23 Thread Jeremy Fitzhardinge
Add Xen 'grant table' driver which allows granting of access to
selected local memory pages by other virtual machines and,
symmetrically, the mapping of remote memory pages which other virtual
machines have granted access to.

This driver is a prerequisite for many of the Xen virtual device
drivers, which grant the 'device driver domain' restricted and
temporary access to only those memory pages that are currently
involved in I/O operations.

Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
Signed-off-by: Ian Pratt [EMAIL PROTECTED]
Signed-off-by: Christian Limpach [EMAIL PROTECTED]
Signed-off-by: Chris Wright [EMAIL PROTECTED]
---
 drivers/xen/Makefile|1 
 drivers/xen/grant-table.c   |  576 +++
 include/xen/grant_table.h   |  107 ++
 include/xen/interface/grant_table.h |  112 +-
 4 files changed, 777 insertions(+), 19 deletions(-)

===
--- a/drivers/xen/Makefile
+++ b/drivers/xen/Makefile
@@ -1,1 +1,2 @@ obj-y   += hvc-console.o
+obj-y  += grant-table.o
 obj-y  += hvc-console.o
===
--- /dev/null
+++ b/drivers/xen/grant-table.c
@@ -0,0 +1,576 @@
+/**
+ * grant_table.c
+ *
+ * Granting foreign access to our memory reservation.
+ *
+ * Copyright (c) 2005-2006, Christopher Clark
+ * Copyright (c) 2004-2005, K A Fraser
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the Software), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#include linux/module.h
+#include linux/sched.h
+#include linux/mm.h
+#include linux/vmalloc.h
+
+#include xen/interface/xen.h
+#include xen/page.h
+#include xen/grant_table.h
+
+#include asm/pgtable.h
+#include asm/uaccess.h
+#include asm/sync_bitops.h
+
+
+/* External tools reserve first few grant table entries. */
+#define NR_RESERVED_ENTRIES 8
+#define GNTTAB_LIST_END 0x
+#define GREFS_PER_GRANT_FRAME (PAGE_SIZE / sizeof(struct grant_entry))
+
+static grant_ref_t **gnttab_list;
+static unsigned int nr_grant_frames;
+static unsigned int boot_max_nr_grant_frames;
+static int gnttab_free_count;
+static grant_ref_t gnttab_free_head;
+static DEFINE_SPINLOCK(gnttab_list_lock);
+
+static struct grant_entry *shared;
+
+static struct gnttab_free_callback *gnttab_free_callback_list;
+
+static int gnttab_expand(unsigned int req_entries);
+
+#define RPP (PAGE_SIZE / sizeof(grant_ref_t))
+#define gnttab_entry(entry) (gnttab_list[(entry) / RPP][(entry) % RPP])
+
+static int get_free_entries(int count)
+{
+   unsigned long flags;
+   int ref, rc;
+   grant_ref_t head;
+
+   spin_lock_irqsave(gnttab_list_lock, flags);
+
+   if ((gnttab_free_count  count) 
+   ((rc = gnttab_expand(count - gnttab_free_count))  0)) {
+   spin_unlock_irqrestore(gnttab_list_lock, flags);
+   return rc;
+   }
+
+   ref = head = gnttab_free_head;
+   gnttab_free_count -= count;
+   while (count--  1)
+   head = gnttab_entry(head);
+   gnttab_free_head = gnttab_entry(head);
+   gnttab_entry(head) = GNTTAB_LIST_END;
+
+   spin_unlock_irqrestore(gnttab_list_lock, flags);
+
+   return ref;
+}
+
+#define get_free_entry() get_free_entries(1)
+
+static void do_free_callbacks(void)
+{
+   struct gnttab_free_callback *callback, *next;
+
+   callback = gnttab_free_callback_list;
+   gnttab_free_callback_list = NULL;
+
+   while (callback != NULL) {
+   next = callback-next;
+   if (gnttab_free_count = callback-count) {
+   

[PATCH 20/25] xen: Add Xen virtual block device driver.

2007-04-23 Thread Jeremy Fitzhardinge
The block device frontend driver allows the kernel to access block
devices exported exported by a virtual machine containing a physical
block device driver.

Signed-off-by: Ian Pratt [EMAIL PROTECTED]
Signed-off-by: Christian Limpach [EMAIL PROTECTED]
Signed-off-by: Chris Wright [EMAIL PROTECTED]
Cc: Arjan van de Ven [EMAIL PROTECTED]
Cc: Greg KH [EMAIL PROTECTED]
Cc: Jens Axboe [EMAIL PROTECTED]
---
 drivers/block/Kconfig|1 
 drivers/block/Makefile   |1 
 drivers/block/xen/Kconfig|   14 
 drivers/block/xen/Makefile   |5 
 drivers/block/xen/blkfront.c |  844 ++
 drivers/block/xen/block.h|  135 ++
 drivers/block/xen/vbd.c  |  229 +++
 include/linux/major.h|2 
 8 files changed, 1231 insertions(+)

===
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -445,6 +445,7 @@ config CDROM_PKTCDVD_WCACHE
  don't do deferred write error handling yet.
 
 source drivers/s390/block/Kconfig
+source drivers/block/xen/Kconfig
 
 config ATA_OVER_ETH
tristate ATA over Ethernet support
===
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_BLK_DEV_SX8) += sx8.o
 obj-$(CONFIG_BLK_DEV_SX8)  += sx8.o
 obj-$(CONFIG_BLK_DEV_UB)   += ub.o
 
+obj-$(CONFIG_XEN)  += xen/
===
--- /dev/null
+++ b/drivers/block/xen/Kconfig
@@ -0,0 +1,14 @@
+menu Xen block device drivers
+depends on XEN
+
+config XEN_BLKDEV_FRONTEND
+   tristate Block device frontend driver
+   depends on XEN
+   default y
+   help
+ The block device frontend driver allows the kernel to access block
+ devices exported from a device driver virtual machine. Unless you
+ are building a dedicated device driver virtual machine, then you
+ almost certainly want to say Y here.
+
+endmenu
===
--- /dev/null
+++ b/drivers/block/xen/Makefile
@@ -0,0 +1,5 @@
+
+obj-$(CONFIG_XEN_BLKDEV_FRONTEND)  := xenblk.o
+
+xenblk-objs := blkfront.o vbd.o
+
===
--- /dev/null
+++ b/drivers/block/xen/blkfront.c
@@ -0,0 +1,844 @@
+/**
+ * blkfront.c
+ *
+ * XenLinux virtual block device driver.
+ *
+ * Copyright (c) 2003-2004, Keir Fraser  Steve Hand
+ * Modifications by Mark A. Williamson are (c) Intel Research Cambridge
+ * Copyright (c) 2004, Christian Limpach
+ * Copyright (c) 2004, Andrew Warfield
+ * Copyright (c) 2005, Christopher Clark
+ * Copyright (c) 2005, XenSource Ltd
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the Software), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#include linux/version.h
+#include block.h
+#include linux/cdrom.h
+#include linux/sched.h
+#include linux/interrupt.h
+#include scsi/scsi.h
+#include xen/xenbus.h
+#include xen/interface/grant_table.h
+#include xen/grant_table.h
+#include xen/events.h
+#include xen/page.h
+#include asm/xen/hypervisor.h
+
+#define BLKIF_STATE_DISCONNECTED 0
+#define BLKIF_STATE_CONNECTED1
+#define BLKIF_STATE_SUSPENDED2
+
+#define MAXIMUM_OUTSTANDING_BLOCK_REQS \
+(BLKIF_MAX_SEGMENTS_PER_REQUEST * BLK_RING_SIZE)
+#define GRANT_INVALID_REF  0
+
+static void connect(struct blkfront_info *);
+static void blkfront_closing(struct xenbus_device *);
+static int blkfront_remove(struct xenbus_device *);
+static int 

[PATCH 11/25] xen: Xen SMP guest support

2007-04-23 Thread Jeremy Fitzhardinge
This is a fairly straightforward Xen implementation of smp_ops.  One
thing this must to is carefully set up all the various sibling and
core maps so that the smp scheduler setup works properly (the setup is
very simple, since vcpus don't have any siblings or multiple cores).

Xen has its own IPI mechanisms, and has no dependency on any
APIC-based IPI.  The smp_ops hooks and the flush_tlb_others pv_op
allow a Xen guest to avoid all APIC code in arch/i386 (the only apic
operation is a single apic_read for the apic version number).

One subtle point which needs to be addressed is unpinning pagetables
when another cpu may have a lazy tlb reference to the pagetable. Xen
will not allow an in-use pagetable to be unpinned, so we must find any
other cpus with a reference to the pagetable and get them to shoot
down their references.

Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
Cc: Benjamin LaHaise [EMAIL PROTECTED]
Cc: Ingo Molnar [EMAIL PROTECTED]
Cc: Andi Kleen [EMAIL PROTECTED]

---
 arch/i386/kernel/smp.c |   16 
 arch/i386/kernel/smpboot.c |4 
 arch/i386/xen/Makefile |6 
 arch/i386/xen/enlighten.c  |  118 -
 arch/i386/xen/events.c |   78 +++
 arch/i386/xen/mmu.c|   66 ++-
 arch/i386/xen/mmu.h|9 
 arch/i386/xen/setup.c  |9 
 arch/i386/xen/smp.c|  419 
 arch/i386/xen/time.c   |9 
 arch/i386/xen/xen-ops.h|   25 +
 include/asm-i386/mach-default/irq_vectors_limits.h |2 
 include/asm-i386/mmu_context.h |   17 
 include/asm-i386/processor.h   |1 
 include/asm-i386/smp.h |2 
 include/xen/events.h   |   27 +
 16 files changed, 730 insertions(+), 78 deletions(-)

===
--- a/arch/i386/kernel/smp.c
+++ b/arch/i386/kernel/smp.c
@@ -23,6 +23,7 @@
 
 #include asm/mtrr.h
 #include asm/tlbflush.h
+#include asm/mmu_context.h
 #include mach_apic.h
 
 /*
@@ -256,21 +257,6 @@ static struct mm_struct * flush_mm;
 static struct mm_struct * flush_mm;
 static unsigned long flush_va;
 static DEFINE_SPINLOCK(tlbstate_lock);
-
-/*
- * We cannot call mmdrop() because we are in interrupt context, 
- * instead update mm-cpu_vm_mask.
- *
- * We need to reload %cr3 since the page tables may be going
- * away from under us..
- */
-static inline void leave_mm (unsigned long cpu)
-{
-   if (per_cpu(cpu_tlbstate, cpu).state == TLBSTATE_OK)
-   BUG();
-   cpu_clear(cpu, per_cpu(cpu_tlbstate, cpu).active_mm-cpu_vm_mask);
-   load_cr3(swapper_pg_dir);
-}
 
 /*
  *
===
--- a/arch/i386/kernel/smpboot.c
+++ b/arch/i386/kernel/smpboot.c
@@ -151,7 +151,7 @@ void __init smp_alloc_memory(void)
  * a given CPU
  */
 
-static void __cpuinit smp_store_cpu_info(int id)
+void __cpuinit smp_store_cpu_info(int id)
 {
struct cpuinfo_x86 *c = cpu_data + id;
 
@@ -785,7 +785,7 @@ static inline struct task_struct * alloc
 /* Initialize the CPU's GDT.  This is either the boot CPU doing itself
(still using the master per-cpu area), or a CPU doing it for a
secondary which will soon come up. */
-static __cpuinit void init_gdt(int cpu)
+__cpuinit void init_gdt(int cpu)
 {
struct desc_struct *gdt = get_cpu_gdt_table(cpu);
 
===
--- a/arch/i386/xen/Makefile
+++ b/arch/i386/xen/Makefile
@@ -1,2 +1,4 @@ obj-y   := enlighten.o setup.o events.o t
-obj-y  := enlighten.o setup.o events.o time.o \
-   features.o mmu.o multicalls.o
+obj-y  := enlighten.o setup.o events.o time.o \
+   features.o mmu.o multicalls.o
+
+obj-$(CONFIG_SMP)  += smp.o
===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -13,6 +13,7 @@
 #include linux/highmem.h
 
 #include xen/interface/xen.h
+#include xen/interface/vcpu.h
 #include xen/features.h
 #include xen/page.h
 
@@ -25,6 +26,8 @@
 #include asm/setup.h
 #include asm/desc.h
 #include asm/pgtable.h
+#include asm/smp.h
+#include asm/tlbflush.h
 
 #include xen-ops.h
 #include mmu.h
@@ -44,7 +47,7 @@ struct start_info *xen_start_info;
 struct start_info *xen_start_info;
 EXPORT_SYMBOL_GPL(xen_start_info);
 
-static void xen_vcpu_setup(int cpu)
+void xen_vcpu_setup(int cpu)
 {
per_cpu(xen_vcpu, cpu) = HYPERVISOR_shared_info-vcpu_info[cpu];
 }
@@ -152,10 +155,10 @@ static void xen_safe_halt(void)
 
 static void xen_halt(void)
 {
-#if 0
if (irqs_disabled())
HYPERVISOR_vcpu_op(VCPUOP_down, 

[PATCH 22/25] xen: xen-netfront: use skb.cb for storing private data

2007-04-23 Thread Jeremy Fitzhardinge
Netfront's use of nh.raw and h.raw for storing page+offset is a bit
hinky, and it breaks with upcoming network stack updates which reduce
these fields to sub-pointer sizes.  Fortunately, skb offers the cb
field specifically for stashing this kind of info, so use it.

Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
Cc: Herbert Xu [EMAIL PROTECTED]
Cc: Chris Wright [EMAIL PROTECTED]
Cc: Christian Limpach [EMAIL PROTECTED]

---
 drivers/net/xen-netfront.c |   18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

===
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -52,6 +52,13 @@
 #include xen/page.h
 #include xen/grant_table.h
 
+struct netfront_cb {
+   struct page *page;
+   unsigned offset;
+};
+
+#define NETFRONT_SKB_CB(skb)   ((struct netfront_cb *)((skb)-cb))
+
 /*
  * Mutually-exclusive module options to select receive data path:
  *  rx_copy : Packets are copied by network backend into local memory
@@ -944,10 +951,11 @@ static void handle_incoming_queue(struct
struct sk_buff *skb;
 
while ((skb = __skb_dequeue(rxq)) != NULL) {
-   struct page *page = (struct page *)skb-nh.raw;
+   struct page *page = NETFRONT_SKB_CB(skb)-page;
void *vaddr = page_address(page);
-
-   memcpy(skb-data, vaddr + (skb-h.raw - skb-nh.raw),
+   unsigned offset = NETFRONT_SKB_CB(skb)-offset;
+
+   memcpy(skb-data, vaddr + offset,
   skb_headlen(skb));
 
if (page != skb_shinfo(skb)-frags[0].page)
@@ -1251,8 +1259,8 @@ err:
}
}
 
-   skb-nh.raw = (void *)skb_shinfo(skb)-frags[0].page;
-   skb-h.raw = skb-nh.raw + rx-offset;
+   NETFRONT_SKB_CB(skb)-page = skb_shinfo(skb)-frags[0].page;
+   NETFRONT_SKB_CB(skb)-offset = rx-offset;
 
len = rx-status;
if (len  RX_COPY_THRESHOLD)

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 04/25] xen: Add XEN config options

2007-04-23 Thread Jeremy Fitzhardinge
Andi Kleen wrote:
 On Monday 23 April 2007 23:56:42 Jeremy Fitzhardinge wrote:
   
 The XEN config option enables the Xen paravirt_ops interface, which is
 installed when the kernel finds itself running under Xen.

 Xen is no longer a sub-architecture, so the X86_XEN subarch config
 option has gone.

 Xen is currently incompatible with PREEMPT, but this is fixed up later
 in the series.
 

 Shouldn't this be after the change that adds arch/i386/xen/Kconfig?

 Otherwise you break bisects
   

It should be OK.  The series should build and run at each patch (though
I have to admit I haven't tested this).  In general I've been adding
config options for each feature as the feature itself is added.

J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 16/25] xen: Use the hvc console infrastructure for Xen console

2007-04-23 Thread Jeremy Fitzhardinge
Implement a Xen back-end for hvc console.

From: Gerd Hoffmann [EMAIL PROTECTED]
Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]

---
 arch/i386/xen/Kconfig |1 
 arch/i386/xen/events.c|3 -
 drivers/Makefile  |3 +
 drivers/xen/Makefile  |1 
 drivers/xen/hvc-console.c |  134 +
 include/xen/events.h  |1 
 6 files changed, 142 insertions(+), 1 deletion(-)

===
--- a/arch/i386/xen/Kconfig
+++ b/arch/i386/xen/Kconfig
@@ -5,6 +5,7 @@ config XEN
 config XEN
bool Enable support for Xen hypervisor
depends on PARAVIRT
+   select HVC_DRIVER
default y
help
  This is the Linux Xen port.
===
--- a/arch/i386/xen/events.c
+++ b/arch/i386/xen/events.c
@@ -219,7 +219,7 @@ static int find_unbound_irq(void)
return irq;
 }
 
-static int bind_evtchn_to_irq(unsigned int evtchn)
+int bind_evtchn_to_irq(unsigned int evtchn)
 {
int irq;
 
@@ -244,6 +244,7 @@ static int bind_evtchn_to_irq(unsigned i
 
return irq;
 }
+EXPORT_SYMBOL_GPL(bind_evtchn_to_irq);
 
 static int bind_ipi_to_irq(unsigned int ipi, unsigned int cpu)
 {
===
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -14,6 +14,9 @@ obj-$(CONFIG_ACPI)+= acpi/
 # was used and do nothing if so
 obj-$(CONFIG_PNP)  += pnp/
 obj-$(CONFIG_ARM_AMBA) += amba/
+
+# Xen is the default console when running as a guest
+obj-$(CONFIG_XEN)  += xen/
 
 # char/ comes before serial/ etc so that the VT console is the boot-time
 # default.
===
--- /dev/null
+++ b/drivers/xen/Makefile
@@ -0,0 +1,1 @@
+obj-y  += hvc-console.o
===
--- /dev/null
+++ b/drivers/xen/hvc-console.c
@@ -0,0 +1,134 @@
+/*
+ * xen console driver interface to hvc_console.c
+ *
+ * (c) 2007 Gerd Hoffmann [EMAIL PROTECTED]
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307 USA
+ */
+
+#include linux/console.h
+#include linux/delay.h
+#include linux/err.h
+#include linux/init.h
+#include linux/types.h
+
+#include asm/xen/hypervisor.h
+#include xen/page.h
+#include xen/events.h
+#include xen/interface/io/console.h
+
+#include ../char/hvc_console.h
+
+#define HVC_COOKIE   0x58656e /* Xen in hex */
+
+static struct hvc_struct *hvc;
+static int xencons_irq;
+
+/* -- */
+
+static inline struct xencons_interface *xencons_interface(void)
+{
+   return mfn_to_virt(xen_start_info-console.domU.mfn);
+}
+
+static inline void notify_daemon(void)
+{
+   /* Use evtchn: this is called early, before irq is set up. */
+   notify_remote_via_evtchn(xen_start_info-console.domU.evtchn);
+}
+
+static int write_console(uint32_t vtermno, const char *data, int len)
+{
+   struct xencons_interface *intf = xencons_interface();
+   XENCONS_RING_IDX cons, prod;
+   int sent = 0;
+
+   cons = intf-out_cons;
+   prod = intf-out_prod;
+   mb();
+   BUG_ON((prod - cons)  sizeof(intf-out));
+
+   while ((sent  len)  ((prod - cons)  sizeof(intf-out)))
+   intf-out[MASK_XENCONS_IDX(prod++, intf-out)] = data[sent++];
+
+   wmb();
+   intf-out_prod = prod;
+
+   notify_daemon();
+   return sent;
+}
+
+static int read_console(uint32_t vtermno, char *buf, int len)
+{
+   struct xencons_interface *intf = xencons_interface();
+   XENCONS_RING_IDX cons, prod;
+   int recv = 0;
+
+   cons = intf-in_cons;
+   prod = intf-in_prod;
+   mb();
+   BUG_ON((prod - cons)  sizeof(intf-in));
+
+   while (cons != prod  recv  len)
+   buf[recv++] = intf-in[MASK_XENCONS_IDX(cons++,intf-in)];
+
+   mb();
+   intf-in_cons = cons;
+
+   notify_daemon();
+   return recv;
+}
+
+static struct hv_ops hvc_ops = {
+   .get_chars = read_console,
+   .put_chars = write_console,
+};
+
+static int __init xen_init(void)
+{
+   struct hvc_struct *hp;
+
+   if (!is_running_on_xen())
+   return 0;
+
+ 

[PATCH 23/25] xen: Lockdep fixes for xen-netfront

2007-04-23 Thread Jeremy Fitzhardinge
netfront contains two locking problems found by lockdep:

1. rx_lock is a normal spinlock, and tx_lock is an irq spinlock.  This
   means that in normal use, tx_lock may be taken by an interrupt routine
   while rx_lock is held.  However, netif_disconnect_backend takes them
   in the order tx_lock-rx_lock, which could lead to a deadlock.  Reverse
   them
2. rx_lock can also be taken in softirq context, so it should be taken/released
   with spin_(un)lock_bh.

Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
Cc: Chris Wright [EMAIL PROTECTED]
Cc: Christian Limpach [EMAIL PROTECTED]

---
 drivers/net/xen-netfront.c |   30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

===
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -515,14 +515,14 @@ static int network_open(struct net_devic
 
memset(np-stats, 0, sizeof(np-stats));
 
-   spin_lock(np-rx_lock);
+   spin_lock_bh(np-rx_lock);
if (netfront_carrier_ok(np)) {
network_alloc_rx_buffers(dev);
np-rx.sring-rsp_event = np-rx.rsp_cons + 1;
if (RING_HAS_UNCONSUMED_RESPONSES(np-rx))
netif_rx_schedule(dev);
}
-   spin_unlock(np-rx_lock);
+   spin_unlock_bh(np-rx_lock);
 
network_maybe_wake_tx(dev);
 
@@ -1212,10 +1212,10 @@ static int netif_poll(struct net_device 
int pages_flipped = 0;
int err;
 
-   spin_lock(np-rx_lock);
+   spin_lock_bh(np-rx_lock);
 
if (unlikely(!netfront_carrier_ok(np))) {
-   spin_unlock(np-rx_lock);
+   spin_unlock_bh(np-rx_lock);
return 0;
}
 
@@ -1356,7 +1356,7 @@ err:
local_irq_restore(flags);
}
 
-   spin_unlock(np-rx_lock);
+   spin_unlock_bh(np-rx_lock);
 
return more_to_do;
 }
@@ -1399,7 +1399,7 @@ static void netif_release_rx_bufs(struct
 
skb_queue_head_init(free_list);
 
-   spin_lock(np-rx_lock);
+   spin_lock_bh(np-rx_lock);
 
for (id = 0; id  NET_RX_RING_SIZE; id++) {
if ((ref = np-grant_rx_ref[id]) == GRANT_INVALID_REF) {
@@ -1469,7 +1469,7 @@ static void netif_release_rx_bufs(struct
while ((skb = __skb_dequeue(free_list)) != NULL)
dev_kfree_skb(skb);
 
-   spin_unlock(np-rx_lock);
+   spin_unlock_bh(np-rx_lock);
 }
 
 static int network_close(struct net_device *dev)
@@ -1579,8 +1579,8 @@ static int network_connect(struct net_de
dev_info(dev-dev, has %sing receive path.\n,
 np-copying_receiver ? copy : flipp);
 
+   spin_lock_bh(np-rx_lock);
spin_lock_irq(np-tx_lock);
-   spin_lock(np-rx_lock);
 
/*
 * Recovery procedure:
@@ -1632,8 +1632,8 @@ static int network_connect(struct net_de
network_tx_buf_gc(dev);
network_alloc_rx_buffers(dev);
 
-   spin_unlock(np-rx_lock);
spin_unlock_irq(np-tx_lock);
+   spin_unlock_bh(np-rx_lock);
 
return 0;
 }
@@ -1689,7 +1689,7 @@ static ssize_t store_rxbuf_min(struct de
if (target  RX_MAX_TARGET)
target = RX_MAX_TARGET;
 
-   spin_lock(np-rx_lock);
+   spin_lock_bh(np-rx_lock);
if (target  np-rx_max_target)
np-rx_max_target = target;
np-rx_min_target = target;
@@ -1698,7 +1698,7 @@ static ssize_t store_rxbuf_min(struct de
 
network_alloc_rx_buffers(netdev);
 
-   spin_unlock(np-rx_lock);
+   spin_unlock_bh(np-rx_lock);
return len;
 }
 
@@ -1732,7 +1732,7 @@ static ssize_t store_rxbuf_max(struct de
if (target  RX_MAX_TARGET)
target = RX_MAX_TARGET;
 
-   spin_lock(np-rx_lock);
+   spin_lock_bh(np-rx_lock);
if (target  np-rx_min_target)
np-rx_min_target = target;
np-rx_max_target = target;
@@ -1741,7 +1741,7 @@ static ssize_t store_rxbuf_max(struct de
 
network_alloc_rx_buffers(netdev);
 
-   spin_unlock(np-rx_lock);
+   spin_unlock_bh(np-rx_lock);
return len;
 }
 
@@ -1885,11 +1885,11 @@ static void netif_disconnect_backend(str
 static void netif_disconnect_backend(struct netfront_info *info)
 {
/* Stop old i/f to prevent errors whilst we rebuild the state. */
+   spin_lock_bh(info-rx_lock);
spin_lock_irq(info-tx_lock);
-   spin_lock(info-rx_lock);
netfront_carrier_off(info);
-   spin_unlock(info-rx_lock);
spin_unlock_irq(info-tx_lock);
+   spin_unlock_bh(info-rx_lock);
 
if (info-irq)
unbind_from_irqhandler(info-irq, info-netdev);

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 13/25] xen: xen: lazy-mmu operations

2007-04-23 Thread Jeremy Fitzhardinge
This patch uses the lazy-mmu hooks to batch mmu operations where
possible.  This is primarily useful for batching operations applied to
active pagetables, which happens during mprotect, munmap, mremap and
the like (mmap does not do bulk pagetable operations, so it isn't
helped).

Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]

---
 arch/i386/xen/enlighten.c  |   56 +++-
 arch/i386/xen/mmu.c|   56 
 arch/i386/xen/multicalls.c |4 +--
 3 files changed, 78 insertions(+), 38 deletions(-)

===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -451,28 +451,38 @@ static void xen_apic_write(unsigned long
 
 static void xen_flush_tlb(void)
 {
-   struct mmuext_op op;
-
-   op.cmd = MMUEXT_TLB_FLUSH_LOCAL;
-   if (HYPERVISOR_mmuext_op(op, 1, NULL, DOMID_SELF))
-   BUG();
+   struct mmuext_op *op;
+   struct multicall_space mcs = xen_mc_entry(sizeof(*op));
+
+   op = mcs.args;
+   op-cmd = MMUEXT_TLB_FLUSH_LOCAL;
+   MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
+
+   xen_mc_issue(PARAVIRT_LAZY_MMU);
 }
 
 static void xen_flush_tlb_single(unsigned long addr)
 {
-   struct mmuext_op op;
-
-   op.cmd = MMUEXT_INVLPG_LOCAL;
-   op.arg1.linear_addr = addr  PAGE_MASK;
-   if (HYPERVISOR_mmuext_op(op, 1, NULL, DOMID_SELF))
-   BUG();
+   struct mmuext_op *op;
+   struct multicall_space mcs = xen_mc_entry(sizeof(*op));
+
+   op = mcs.args;
+   op-cmd = MMUEXT_INVLPG_LOCAL;
+   op-arg1.linear_addr = addr  PAGE_MASK;
+   MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
+
+   xen_mc_issue(PARAVIRT_LAZY_MMU);
 }
 
 static void xen_flush_tlb_others(const cpumask_t *cpus, struct mm_struct *mm,
 unsigned long va)
 {
-   struct mmuext_op op;
+   struct {
+   struct mmuext_op op;
+   cpumask_t mask;
+   } *args;
cpumask_t cpumask = *cpus;
+   struct multicall_space mcs;
 
/*
 * A couple of (to be removed) sanity checks:
@@ -489,17 +499,21 @@ static void xen_flush_tlb_others(const c
if (cpus_empty(cpumask))
return;
 
+   mcs = xen_mc_entry(sizeof(*args));
+   args = mcs.args;
+   args-mask = cpumask;
+   args-op.arg2.vcpumask = args-mask;
+
if (va == TLB_FLUSH_ALL) {
-   op.cmd = MMUEXT_TLB_FLUSH_MULTI;
-   op.arg2.vcpumask = (void *)cpus;
+   args-op.cmd = MMUEXT_TLB_FLUSH_MULTI;
} else {
-   op.cmd = MMUEXT_INVLPG_MULTI;
-   op.arg1.linear_addr = va;
-   op.arg2.vcpumask = (void *)cpus;
-   }
-
-   if (HYPERVISOR_mmuext_op(op, 1, NULL, DOMID_SELF))
-   BUG();
+   args-op.cmd = MMUEXT_INVLPG_MULTI;
+   args-op.arg1.linear_addr = va;
+   }
+
+   MULTI_mmuext_op(mcs.mc, args-op, 1, NULL, DOMID_SELF);
+
+   xen_mc_issue(PARAVIRT_LAZY_MMU);
 }
 
 static unsigned long xen_read_cr2(void)
===
--- a/arch/i386/xen/mmu.c
+++ b/arch/i386/xen/mmu.c
@@ -56,12 +56,20 @@ void make_lowmem_page_readwrite(void *va
 
 void xen_set_pmd(pmd_t *ptr, pmd_t val)
 {
-   struct mmu_update u;
-
-   u.ptr = virt_to_machine(ptr).maddr;
-   u.val = pmd_val_ma(val);
-   if (HYPERVISOR_mmu_update(u, 1, NULL, DOMID_SELF)  0)
-   BUG();
+   struct multicall_space mcs;
+   struct mmu_update *u;
+
+   preempt_disable();
+
+   mcs = xen_mc_entry(sizeof(*u));
+   u = mcs.args;
+   u-ptr = virt_to_machine(ptr).maddr;
+   u-val = pmd_val_ma(val);
+   MULTI_mmu_update(mcs.mc, u, 1, NULL, DOMID_SELF);
+
+   xen_mc_issue(PARAVIRT_LAZY_MMU);
+
+   preempt_enable();
 }
 
 /*
@@ -104,20 +112,38 @@ void xen_set_pte_at(struct mm_struct *mm
 void xen_set_pte_at(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pteval)
 {
-   if ((mm != current-mm  mm != init_mm) ||
-   HYPERVISOR_update_va_mapping(addr, pteval, 0) != 0)
-   xen_set_pte(ptep, pteval);
+   if (mm == current-mm || mm == init_mm) {
+   if (xen_get_lazy_mode() == PARAVIRT_LAZY_MMU) {
+   struct multicall_space mcs;
+   mcs = xen_mc_entry(0);
+
+   MULTI_update_va_mapping(mcs.mc, addr, pteval, 0);
+   xen_mc_issue(PARAVIRT_LAZY_MMU);
+   return;
+   } else
+   if (HYPERVISOR_update_va_mapping(addr, pteval, 0) == 0)
+   return;
+   }
+   xen_set_pte(ptep, pteval);
 }
 
 #ifdef CONFIG_X86_PAE
 void xen_set_pud(pud_t *ptr, pud_t val)
 {
-   struct mmu_update u;
-
-   u.ptr = 

[PATCH 17/25] xen: Add early printk support via hvc console

2007-04-23 Thread Jeremy Fitzhardinge
Add early printk support via hvc console, enable using
earlyprintk=xen on the kernel command line.

From: Gerd Hoffmann [EMAIL PROTECTED]
Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
Acked-by: Ingo Molnar [EMAIL PROTECTED]

---
 arch/x86_64/kernel/early_printk.c |5 +
 drivers/xen/hvc-console.c |   25 +
 include/xen/hvc-console.h |6 ++
 3 files changed, 36 insertions(+)

===
--- a/arch/x86_64/kernel/early_printk.c
+++ b/arch/x86_64/kernel/early_printk.c
@@ -6,6 +6,7 @@
 #include asm/io.h
 #include asm/processor.h
 #include asm/fcntl.h
+#include xen/hvc-console.h
 
 /* Simple VGA output */
 
@@ -243,6 +244,10 @@ static int __init setup_early_printk(cha
simnow_init(buf + 6);
early_console = simnow_console;
keep_early = 1;
+#ifdef CONFIG_XEN
+   } else if (!strncmp(buf, xen, 3)) {
+   early_console = xenboot_console;
+#endif
}
register_console(early_console);
return 0;
===
--- a/drivers/xen/hvc-console.c
+++ b/drivers/xen/hvc-console.c
@@ -28,6 +28,7 @@
 #include xen/page.h
 #include xen/events.h
 #include xen/interface/io/console.h
+#include xen/hvc-console.h
 
 #include ../char/hvc_console.h
 
@@ -132,3 +133,27 @@ module_init(xen_init);
 module_init(xen_init);
 module_exit(xen_fini);
 console_initcall(xen_cons_init);
+
+static void xenboot_write_console(struct console *console, const char *string,
+ unsigned len)
+{
+   unsigned int linelen, off = 0;
+   const char *pos;
+
+   while (off  len  NULL != (pos = strchr(string+off, '\n'))) {
+   linelen = pos-string+off;
+   if (off + linelen  len)
+   break;
+   write_console(0, string+off, linelen);
+   write_console(0, \r\n, 2);
+   off += linelen + 1;
+   }
+   if (off  len)
+   write_console(0, string+off, len-off);
+}
+
+struct console xenboot_console = {
+   .name   = xenboot,
+   .write  = xenboot_write_console,
+   .flags  = CON_PRINTBUFFER | CON_BOOT,
+};
===
--- /dev/null
+++ b/include/xen/hvc-console.h
@@ -0,0 +1,6 @@
+#ifndef XEN_HVC_CONSOLE_H
+#define XEN_HVC_CONSOLE_H
+
+extern struct console xenboot_console;
+
+#endif /* XEN_HVC_CONSOLE_H */

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 03/25] xen: Add nosegneg capability to the vsyscall page notes

2007-04-23 Thread Jeremy Fitzhardinge
Add the nosegneg fake capabilty to the vsyscall page notes. This is
used by the runtime linker to select a glibc version which then
disables negative-offset accesses to the thread-local segment via
%gs. These accesses require emulation in Xen (because segments are
truncated to protect the hypervisor address space) and avoiding them
provides a measurable performance boost.

Signed-off-by: Ian Pratt [EMAIL PROTECTED]
Signed-off-by: Christian Limpach [EMAIL PROTECTED]
Signed-off-by: Chris Wright [EMAIL PROTECTED]
Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
Acked-by: Zachary Amsden [EMAIL PROTECTED]
Cc: Roland McGrath [EMAIL PROTECTED]
Cc: Ulrich Drepper [EMAIL PROTECTED]

---
 arch/i386/kernel/vsyscall-note.S |   28 
 1 file changed, 28 insertions(+)

===
--- a/arch/i386/kernel/vsyscall-note.S
+++ b/arch/i386/kernel/vsyscall-note.S
@@ -23,3 +24,31 @@ 3:   .balign 4;  /* pad out section */   
 
ASM_ELF_NOTE_BEGIN(.note.kernel-version, a, UTS_SYSNAME, 0)
.long LINUX_VERSION_CODE
ASM_ELF_NOTE_END
+
+#ifdef CONFIG_XEN
+/*
+ * Add a special note telling glibc's dynamic linker a fake hardware
+ * flavor that it will use to choose the search path for libraries in the
+ * same way it uses real hardware capabilities like mmx.
+ * We supply nosegneg as the fake capability, to indicate that we
+ * do not like negative offsets in instructions using segment overrides,
+ * since we implement those inefficiently.  This makes it possible to
+ * install libraries optimized to avoid those access patterns in someplace
+ * like /lib/i686/tls/nosegneg.  Note that an /etc/ld.so.conf.d/file
+ * corresponding to the bits here is needed to make ldconfig work right.
+ * It should contain:
+ * hwcap 0 nosegneg
+ * to match the mapping of bit to name that we give here.
+ */
+#define NOTE_KERNELCAP_BEGIN(ncaps, mask) \
+   ASM_ELF_NOTE_BEGIN(.note.kernelcap, a, GNU, 2) \
+   .long ncaps, mask
+#define NOTE_KERNELCAP(bit, name) \
+   .byte bit; .asciz name
+#define NOTE_KERNELCAP_END ASM_ELF_NOTE_END
+
+NOTE_KERNELCAP_BEGIN(1, 2)
+NOTE_KERNELCAP(1, nosegneg)
+NOTE_KERNELCAP_END
+#endif
+

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 15/25] xen: xen time fixups

2007-04-23 Thread Jeremy Fitzhardinge
1. make sure timer state is set up before bringing up CPU
2. make sure snapshot of 64-bit time values is atomic

Be sure, however, that the clockevent source is registered on its home
CPU.

Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]

---
 arch/i386/xen/smp.c |4 +-
 arch/i386/xen/time.c|   93 +++
 arch/i386/xen/xen-ops.h |3 +
 3 files changed, 67 insertions(+), 33 deletions(-)

===
--- a/arch/i386/xen/smp.c
+++ b/arch/i386/xen/smp.c
@@ -78,10 +78,11 @@ static __cpuinit void cpu_bringup_and_id
int cpu = smp_processor_id();
 
cpu_init();
-   xen_setup_timer();
 
preempt_disable();
per_cpu(cpu_state, cpu) = CPU_ONLINE;
+
+   xen_setup_cpu_clockevents();
 
/* We can take interrupts now: we're officially up. */
local_irq_enable();
@@ -275,6 +276,7 @@ int __cpuinit xen_cpu_up(unsigned int cp
per_cpu(current_task, cpu) = idle;
xen_vcpu_setup(cpu);
irq_ctx_init(cpu);
+   xen_setup_timer(cpu);
 
/* make sure interrupts start blocked */
per_cpu(xen_vcpu, cpu)-evtchn_upcall_mask = 1;
===
--- a/arch/i386/xen/time.c
+++ b/arch/i386/xen/time.c
@@ -40,6 +40,35 @@ static DEFINE_PER_CPU(u64, residual_stol
 static DEFINE_PER_CPU(u64, residual_stolen);
 static DEFINE_PER_CPU(u64, residual_blocked);
 
+/* return an consistent snapshot of 64-bit time/counter value */
+static u64 get64(const u64 *p)
+{
+   u64 ret;
+
+   if (BITS_PER_LONG  64) {
+   u32 *p32 = (u32 *)p;
+   u32 h, l;
+
+   /*
+* Read high then low, and then make sure high is
+* still the same; this will only loop if low wraps
+* and carries into high.
+* XXX some clean way to make this endian-proof?
+*/
+   do {
+   h = p32[1];
+   barrier();
+   l = p32[0];
+   barrier();
+   } while (p32[1] != h);
+
+   ret = (((u64)h)  32) | l;
+   } else
+   ret = *p;
+
+   return ret;
+}
+
 /*
  * Runstate accounting
  */
@@ -53,24 +82,22 @@ static void get_runstate_snapshot(struct
state = __get_cpu_var(runstate);
 
do {
-   state_time = state-state_entry_time;
+   state_time = get64(state-state_entry_time);
barrier();
*res = *state;
barrier();
-   } while(state-state_entry_time != state_time);
-}
-
-static void setup_runstate_info(void)
+   } while(get64(state-state_entry_time) != state_time);
+}
+
+static void setup_runstate_info(int cpu)
 {
struct vcpu_register_runstate_memory_area area;
 
-   area.addr.v = __get_cpu_var(runstate);
+   area.addr.v = per_cpu(runstate, cpu);
 
if (HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area,
-  smp_processor_id(), area))
+  cpu, area))
BUG();
-
-   get_runstate_snapshot(__get_cpu_var(runstate_snapshot));
 }
 
 static void do_stolen_accounting(void)
@@ -185,12 +212,10 @@ unsigned long xen_cpu_khz(void)
  * Reads a consistent set of time-base values from Xen, into a shadow data
  * area.
  */
-static void get_time_values_from_xen(void)
+static unsigned get_time_values_from_xen(void)
 {
struct vcpu_time_info   *src;
struct shadow_time_info *dst;
-
-   preempt_disable();
 
src = __get_cpu_var(xen_vcpu)-time;
dst = __get_cpu_var(shadow_time);
@@ -205,7 +230,7 @@ static void get_time_values_from_xen(voi
rmb();
} while ((src-version  1) | (dst-version ^ src-version));
 
-   preempt_enable();
+   return dst-version;
 }
 
 /*
@@ -249,7 +274,7 @@ static u64 get_nsec_offset(struct shadow
 static u64 get_nsec_offset(struct shadow_time_info *shadow)
 {
u64 now, delta;
-   rdtscll(now);
+   now = native_read_tsc();
delta = now - shadow-tsc_timestamp;
return scale_delta(delta, shadow-tsc_to_nsec_mul, shadow-tsc_shift);
 }
@@ -258,10 +283,14 @@ static cycle_t xen_clocksource_read(void
 {
struct shadow_time_info *shadow = get_cpu_var(shadow_time);
cycle_t ret;
-
-   get_time_values_from_xen();
-
-   ret = shadow-system_timestamp + get_nsec_offset(shadow);
+   unsigned version;
+
+   do {
+   version = get_time_values_from_xen();
+   barrier();
+   ret = shadow-system_timestamp + get_nsec_offset(shadow);
+   barrier();
+   } while(version != __get_cpu_var(xen_vcpu)-time.version);
 
put_cpu_var(shadow_time);
 
@@ -483,9 +512,8 @@ static irqreturn_t xen_timer_interrupt(i
return ret;
 }
 
-void 

[PATCH 12/25] xen: Add support for preemption

2007-04-23 Thread Jeremy Fitzhardinge
Add Xen support for preemption.  This is mostly a cleanup of existing
preempt_enable/disable calls, or just comments to explain the current
usage.

Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]

---
 arch/i386/xen/Kconfig  |2 
 arch/i386/xen/enlighten.c  |   93 
 arch/i386/xen/mmu.c|4 +
 arch/i386/xen/multicalls.c |   11 ++---
 arch/i386/xen/time.c   |   22 --
 5 files changed, 88 insertions(+), 44 deletions(-)

===
--- a/arch/i386/xen/Kconfig
+++ b/arch/i386/xen/Kconfig
@@ -4,7 +4,7 @@
 
 config XEN
bool Enable support for Xen hypervisor
-   depends on PARAVIRT  !PREEMPT
+   depends on PARAVIRT
default y
help
  This is the Linux Xen port.
===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -2,6 +2,7 @@
 #include linux/init.h
 #include linux/smp.h
 #include linux/preempt.h
+#include linux/hardirq.h
 #include linux/percpu.h
 #include linux/delay.h
 #include linux/start_kernel.h
@@ -92,11 +93,10 @@ static unsigned long xen_save_fl(void)
struct vcpu_info *vcpu;
unsigned long flags;
 
-   preempt_disable();
vcpu = x86_read_percpu(xen_vcpu);
+
/* flag has opposite sense of mask */
flags = !vcpu-evtchn_upcall_mask;
-   preempt_enable();
 
/* convert to IF type flag
   -0 - 0x
@@ -109,41 +109,56 @@ static void xen_restore_fl(unsigned long
 {
struct vcpu_info *vcpu;
 
-   preempt_disable();
-
/* convert from IF type flag */
flags = !(flags  X86_EFLAGS_IF);
+
+   /* There's a one instruction preempt window here.  We need to
+  make sure we're don't switch CPUs between getting the vcpu
+  pointer and updating the mask. */
+   preempt_disable();
vcpu = x86_read_percpu(xen_vcpu);
vcpu-evtchn_upcall_mask = flags;
+   preempt_enable_no_resched();
+
+   /* Doesn't matter if we get preempted here, because any
+  pending event will get dealt with anyway. */
+
if (flags == 0) {
+   preempt_check_resched();
barrier(); /* unmask then check (avoid races) */
if (unlikely(vcpu-evtchn_upcall_pending))
force_evtchn_callback();
-   preempt_enable();
-   } else
-   preempt_enable_no_resched();
+   }
 }
 
 static void xen_irq_disable(void)
 {
+   /* There's a one instruction preempt window here.  We need to
+  make sure we're don't switch CPUs between getting the vcpu
+  pointer and updating the mask. */
+   preempt_disable();
+   x86_read_percpu(xen_vcpu)-evtchn_upcall_mask = 1;
+   preempt_enable_no_resched();
+}
+
+static void xen_irq_enable(void)
+{
struct vcpu_info *vcpu;
-   preempt_disable();
-   vcpu = x86_read_percpu(xen_vcpu);
-   vcpu-evtchn_upcall_mask = 1;
-   preempt_enable_no_resched();
-}
-
-static void xen_irq_enable(void)
-{
-   struct vcpu_info *vcpu;
-
+
+   /* There's a one instruction preempt window here.  We need to
+  make sure we're don't switch CPUs between getting the vcpu
+  pointer and updating the mask. */
preempt_disable();
vcpu = x86_read_percpu(xen_vcpu);
vcpu-evtchn_upcall_mask = 0;
+   preempt_enable_no_resched();
+
+   /* Doesn't matter if we get preempted here, because any
+  pending event will get dealt with anyway. */
+
barrier(); /* unmask then check (avoid races) */
if (unlikely(vcpu-evtchn_upcall_pending))
force_evtchn_callback();
-   preempt_enable();
 }
 
 static void xen_safe_halt(void)
@@ -163,6 +178,8 @@ static void xen_halt(void)
 
 static void xen_set_lazy_mode(enum paravirt_lazy_mode mode)
 {
+   BUG_ON(preemptible());
+
switch(mode) {
case PARAVIRT_LAZY_NONE:
BUG_ON(x86_read_percpu(xen_lazy_mode) == PARAVIRT_LAZY_NONE);
@@ -262,12 +279,17 @@ static void xen_write_ldt_entry(struct d
xmaddr_t mach_lp = virt_to_machine(lp);
u64 entry = (u64)high  32 | low;
 
+   preempt_disable();
+
xen_mc_flush();
if (HYPERVISOR_update_descriptor(mach_lp.maddr, entry))
BUG();
-}
-
-static int cvt_gate_to_trap(int vector, u32 low, u32 high, struct trap_info 
*info)
+
+   preempt_enable();
+}
+
+static int cvt_gate_to_trap(int vector, u32 low, u32 high,
+   struct trap_info *info)
 {
u8 type, dpl;
 
@@ -295,11 +317,13 @@ static DEFINE_PER_CPU(struct Xgt_desc_st
also update Xen. */
 static void xen_write_idt_entry(struct desc_struct *dt, int entrynum, u32 low, 
u32 high)
 {
-
-   int cpu = smp_processor_id();
unsigned long p = (unsigned long)dt[entrynum];
-   unsigned long start = 

[PATCH 14/25] xen: xen: deal with negative stolen time

2007-04-23 Thread Jeremy Fitzhardinge
Stolen time should never be negative; if it ever is, it probably
indicates some other bug.  However, if it does happen, then its better
to just clamp it at zero, rather than trying to account for it as a
huge positive number.

Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]

---
 arch/i386/xen/time.c |   19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

===
--- a/arch/i386/xen/time.c
+++ b/arch/i386/xen/time.c
@@ -77,7 +77,7 @@ static void do_stolen_accounting(void)
 {
struct vcpu_runstate_info state;
struct vcpu_runstate_info *snap;
-   u64 blocked, runnable, offline, stolen;
+   s64 blocked, runnable, offline, stolen;
cputime_t ticks;
 
get_runstate_snapshot(state);
@@ -97,6 +97,10 @@ static void do_stolen_accounting(void)
   including any left-overs from last time.  Passing NULL to
   account_steal_time accounts the time as stolen. */
stolen = runnable + offline + __get_cpu_var(residual_stolen);
+
+   if (stolen  0)
+   stolen = 0;
+
ticks = 0;
while(stolen = NS_PER_TICK) {
ticks++;
@@ -109,6 +113,10 @@ static void do_stolen_accounting(void)
   including any left-overs from last time.  Passing idle to
   account_steal_time accounts the time as idle/wait. */
blocked += __get_cpu_var(residual_blocked);
+
+   if (blocked  0)
+   blocked = 0;
+
ticks = 0;
while(blocked = NS_PER_TICK) {
ticks++;
@@ -127,7 +135,8 @@ unsigned long long xen_sched_clock(void)
 {
struct vcpu_runstate_info state;
cycle_t now;
-   unsigned long long ret;
+   u64 ret;
+   s64 offset;
 
/*
 * Ideally sched_clock should be called on a per-cpu basis
@@ -142,9 +151,13 @@ unsigned long long xen_sched_clock(void)
 
WARN_ON(state.state != RUNSTATE_running);
 
+   offset = now - state.state_entry_time;
+   if (offset  0)
+   offset = 0;
+
ret = state.time[RUNSTATE_blocked] +
state.time[RUNSTATE_running] +
-   (now - state.state_entry_time);
+   offset;
 
preempt_enable();
 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 09/25] xen: Account for time stolen by Xen

2007-04-23 Thread Jeremy Fitzhardinge
This accounts for the time Xen steals from our VCPUs.  This accounting
gets run on each timer interrupt, just as a way to get it run
relatively often, and when interesting things are going on.

Stolen time is not really used by much in the kernel; it is reported
in /proc/stats, and that's about it.

Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
Cc: john stultz [EMAIL PROTECTED]

---
 arch/i386/xen/time.c |  101 +-
 1 file changed, 100 insertions(+), 1 deletion(-)

===
--- a/arch/i386/xen/time.c
+++ b/arch/i386/xen/time.c
@@ -2,6 +2,7 @@
 #include linux/interrupt.h
 #include linux/clocksource.h
 #include linux/clockchips.h
+#include linux/kernel_stat.h
 
 #include asm/xen/hypervisor.h
 #include asm/xen/hypercall.h
@@ -14,6 +15,7 @@
 
 #define XEN_SHIFT 22
 #define TIMER_SLOP 10  /* Xen may fire a timer up to this many ns 
early */
+#define NS_PER_TICK(10ll / HZ)
 
 /* These are perodically updated in shared_info, and then copied here. */
 struct shadow_time_info {
@@ -26,6 +28,99 @@ struct shadow_time_info {
 
 static DEFINE_PER_CPU(struct shadow_time_info, shadow_time);
 
+/* runstate info updated by Xen */
+static DEFINE_PER_CPU(struct vcpu_runstate_info, runstate);
+
+/* snapshots of runstate info */
+static DEFINE_PER_CPU(struct vcpu_runstate_info, runstate_snapshot);
+
+/* unused ns of stolen and blocked time */
+static DEFINE_PER_CPU(u64, residual_stolen);
+static DEFINE_PER_CPU(u64, residual_blocked);
+
+/*
+ * Runstate accounting
+ */
+static void get_runstate_snapshot(struct vcpu_runstate_info *res)
+{
+   u64 state_time;
+   struct vcpu_runstate_info *state;
+
+   preempt_disable();
+
+   state = __get_cpu_var(runstate);
+
+   do {
+   state_time = state-state_entry_time;
+   barrier();
+   *res = *state;
+   barrier();
+   } while(state-state_entry_time != state_time);
+
+   preempt_enable();
+}
+
+static void setup_runstate_info(void)
+{
+   struct vcpu_register_runstate_memory_area area;
+
+   area.addr.v = __get_cpu_var(runstate);
+
+   if (HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area,
+  smp_processor_id(), area))
+   BUG();
+
+   get_runstate_snapshot(__get_cpu_var(runstate_snapshot));
+}
+
+static void do_stolen_accounting(void)
+{
+   struct vcpu_runstate_info state;
+   struct vcpu_runstate_info *snap;
+   u64 blocked, runnable, offline, stolen;
+   cputime_t ticks;
+
+   get_runstate_snapshot(state);
+
+   WARN_ON(state.state != RUNSTATE_running);
+
+   snap = __get_cpu_var(runstate_snapshot);
+
+   /* work out how much time the VCPU has not been runn*ing*  */
+   blocked = state.time[RUNSTATE_blocked] - snap-time[RUNSTATE_blocked];
+   runnable = state.time[RUNSTATE_runnable] - 
snap-time[RUNSTATE_runnable];
+   offline = state.time[RUNSTATE_offline] - snap-time[RUNSTATE_offline];
+
+   *snap = state;
+
+   /* Add the appropriate number of ticks of stolen time,
+  including any left-overs from last time.  Passing NULL to
+  account_steal_time accounts the time as stolen. */
+   stolen = runnable + offline + __get_cpu_var(residual_stolen);
+   ticks = 0;
+   while(stolen = NS_PER_TICK) {
+   ticks++;
+   stolen -= NS_PER_TICK;
+   }
+   __get_cpu_var(residual_stolen) = stolen;
+   account_steal_time(NULL, ticks);
+
+   /* Add the appropriate number of ticks of blocked time,
+  including any left-overs from last time.  Passing idle to
+  account_steal_time accounts the time as idle/wait. */
+   blocked += __get_cpu_var(residual_blocked);
+   ticks = 0;
+   while(blocked = NS_PER_TICK) {
+   ticks++;
+   blocked -= NS_PER_TICK;
+   }
+   __get_cpu_var(residual_blocked) = blocked;
+   account_steal_time(idle_task(smp_processor_id()), ticks);
+}
+
+
+
+/* Get the CPU speed from Xen */
 unsigned long xen_cpu_khz(void)
 {
u64 cpu_khz = 100ULL  32;
@@ -338,6 +433,8 @@ static irqreturn_t xen_timer_interrupt(i
ret = IRQ_HANDLED;
}
 
+   do_stolen_accounting();
+
return ret;
 }
 
@@ -363,6 +460,8 @@ static void xen_setup_timer(int cpu)
evt-irq = irq;
clockevents_register_device(evt);
 
+   setup_runstate_info();
+
put_cpu_var(xen_clock_events);
 }
 
@@ -375,7 +474,7 @@ __init void xen_time_init(void)
clocksource_register(xen_clocksource);
 
if (HYPERVISOR_vcpu_op(VCPUOP_stop_periodic_timer, cpu, NULL) == 0) {
-   /* Successfully turned off 100hz tick, so we have the
+   /* Successfully turned off 100Hz tick, so we have the
   vcpuop-based timer interface */
printk(KERN_DEBUG Xen: using vcpuop 

[PATCH 00/25] xen: Xen implementation for paravirt_ops

2007-04-23 Thread Jeremy Fitzhardinge
Hi Andi,

This series of patches implements the Xen paravirt-ops interface.
It applies to 2.6.21-rc7 + your patches + the last batch of pv_ops
patches I posted.

This patch generally restricts itself to Xen-specific parts of the tree,
though it does make a few small changes elsewhere.

These patches include:
 - some helper routines for allocating address space and walking pagetables
 - Xen interface header files
 - Core Xen implementation
 - Efficient late-pinning/early-unpinning pagetable handling
 - Virtualized time, including stolen time
 - SMP support
 - Preemption support
 - Batched pagetable updates
 - Xen console, based on hvc console
 - Xenbus
 - Netfront, the paravirtualized network device
 - Blockfront, the paravirtualized block device

Thanks,
J
-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 02/25] xen: Allocate and free vmalloc areas

2007-04-23 Thread Jeremy Fitzhardinge
Allocate/destroy a 'vmalloc' VM area: alloc_vm_area and free_vm_area
The alloc function ensures that page tables are constructed for the
region of kernel virtual address space and mapped into init_mm.

Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
Signed-off-by: Ian Pratt [EMAIL PROTECTED]
Signed-off-by: Christian Limpach [EMAIL PROTECTED]
Signed-off-by: Chris Wright [EMAIL PROTECTED]
Cc: Jan Beulich [EMAIL PROTECTED]
Cc: Andi Kleen [EMAIL PROTECTED]

---
 include/linux/vmalloc.h |4 +++
 mm/vmalloc.c|   51 +++
 2 files changed, 55 insertions(+)

===
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -68,6 +68,10 @@ extern int map_vm_area(struct vm_struct 
struct page ***pages);
 extern void unmap_vm_area(struct vm_struct *area);
 
+/* Allocate/destroy a 'vmalloc' VM area. */
+extern struct vm_struct *alloc_vm_area(unsigned long size);
+extern void free_vm_area(struct vm_struct *area);
+
 /*
  * Internals.  Dont't use..
  */
===
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -757,3 +757,54 @@ out_einval_locked:
 }
 EXPORT_SYMBOL(remap_vmalloc_range);
 
+static int f(pte_t *pte, struct page *pmd_page, unsigned long addr, void *data)
+{
+   /* apply_to_page_range() does all the hard work. */
+   return 0;
+}
+
+/**
+ * alloc_vm_area - allocate a range of kernel address space
+ * @size:  size of the area
+ * @returns:   NULL on failure, vm_struct on success
+ *
+ * This function reserves a range of kernel address space, and
+ * allocates pagetables to map that range.  No actual mappings
+ * are created.  If the kernel address space is not shared
+ * between processes, it syncs the pagetable across all
+ * processes.
+ */
+struct vm_struct *alloc_vm_area(unsigned long size)
+{
+   struct vm_struct *area;
+
+   area = get_vm_area(size, VM_IOREMAP);
+   if (area == NULL)
+   return NULL;
+
+   /*
+* This ensures that page tables are constructed for this region
+* of kernel virtual address space and mapped into init_mm.
+*/
+   if (apply_to_page_range(init_mm, (unsigned long)area-addr,
+   area-size, f, NULL)) {
+   free_vm_area(area);
+   return NULL;
+   }
+
+   /* Make sure the pagetables are constructed in process kernel
+  mappings */
+   vmalloc_sync_all();
+
+   return area;
+}
+EXPORT_SYMBOL_GPL(alloc_vm_area);
+
+void free_vm_area(struct vm_struct *area)
+{
+   struct vm_struct *ret;
+   ret = remove_vm_area(area-addr);
+   BUG_ON(ret != area);
+   kfree(area);
+}
+EXPORT_SYMBOL_GPL(free_vm_area);

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH -mm 2/3] freezer: Introduce freezer_flags

2007-04-23 Thread Oleg Nesterov
On 04/24, Rafael J. Wysocki wrote:

 On Tuesday, 24 April 2007 00:55, Oleg Nesterov wrote:
  On 04/24, Rafael J. Wysocki wrote:
  
   Should I clear it in dup_task_struct() or is there a better place?
  
  I personally think we should do this in dup_task_struct(). In fact, I 
  believe
  it is better to replace the
  
  *tsk = *orig;
  
  with some helper (like setup_thread_stack() below), and that helper clears
  -freezer_flags. Say, copy_task_struct().
 
 Hmm, wouldn't that be overkill?  copy_task_struct() would have to do
 *tsk = *orig anyway, and we only need to clear one field apart from this.
 
 Some other fields are cleared towards the end of dup_task_struct(), so perhaps
 we could clear freezer_flags in there too?

Yes. And I strongly believe it is bad we don't have the helper which does some
random stuf like p-did_exec = 0.

The same for thread_info. Could you answer quickly where do we clear TIF_FREEZE
currently? We don't.

Oleg.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/25] xen: Add nosegneg capability to the vsyscall page notes

2007-04-23 Thread Roland McGrath
 + * It should contain:
 + *   hwcap 0 nosegneg
 + * to match the mapping of bit to name that we give here.

This needs to be hwcap 0 nosegneg to match:

 +NOTE_KERNELCAP_BEGIN(1, 2)
 +NOTE_KERNELCAP(1, nosegneg)
 +NOTE_KERNELCAP_END

The actual bits you are using should be fine.  (You're intentionally
skipping bit 0 to work around hold glibc bugs, which you might want to add
to the comments.  Also a comment or perhaps using 11 syntax would make it
more clear that 2 is the bit mask containing bit 1 and that's why it has
to be 2, and not because of some other magical property of 2.)  But if
kernel packagers don't write the matching bit number in their ld.so.conf.d
files, then ld.so.cache lookups won't work right.


Thanks,
Roland
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Loud pop coming from hard drive on reboot

2007-04-23 Thread Chuck Ebbert
Peter Zijlstra wrote:
 
 but I have an increasing seek error rate as well. I got the ST disk
 because thinkwiki suggested it.
 

Apparently Seagate has their own definition of seek error rate.
Large numbers are normal, or at least very common.

Now I wonder if they have their own way of doing retract count...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE][PATCH] Kcli - Kernel command line interface.

2007-04-23 Thread Andrew Morton
On Mon, 23 Apr 2007 14:31:39 -0700 (PDT)
Matt Ranon [EMAIL PROTECTED] wrote:


(text reformatted to less than 80 cols.  Please, we'll get along a lot
better if you don't send 1000-column emails)

 The Jem team is pleased to announce the release of Kcli, an in-kernel
 command line interface.  Kcli is intended for a special class of embedded
 Linux applications.  The Linux kernel has become the defacto standard OS
 for embedded applications.  This means that Linux is getting bent in some
 ways that may appear strange to some.  One of these ways, is embedded
 applications that do not use user space.  User space consists of a
 statically linked, one line program, that simply sleeps forever,
 transforming Linux into a classical embedded RTOS.  VxWorks developers will
 understand what we are talking about, and they may recall how much they
 depend on the VxWorks shell.  Kcli attempts to meet the need for a shell
 for this class of embedded Linux applications.  

Alas, we are not vxworks developers, and probably most of us know zero
about the use-cases for this feature, why embedded systems find it
valuable, etc.

So it's up to you to tell us all this.

 Kcli provides a command line environment that runs in the kernel, and that
 can be extended with custom commands registered by other kernel modules. 
 We have found Kcli invaluable for our development, and we are releasing the
 patch, in case others find it useful.
 
 Kcli is directly derived from libcli written by David Parrish and Brendan
 O'Dea, and the regular expression support is directly derived from diet
 libc written by Felix von Leitner.
 
 The Jem team fully understands that this kind of patch may not be
 appropriate for inclusion in the mainline kernel code.  We have no
 expectation that it will be, and we leave that decision fully in the hands
 of those responsible.

We don't have enough information to make that call.

  Nonetheless, we feel that others may find it useful,
 and we will also appreciate any appropriate feedback from the community.
 
 Kcli is standalone, and modifies no kernel files, except for the Kconfig
 and Makefile modifications required to wire it into the configuration and
 build.

The obvious question is: what's _wrong_ with doing all this in some
cut-down userspace environment like busybox?  Why is this stuff better?

Obviously some embedded developers have considered that option and
have rejected it.  But we do need to be told, at length, why that
decision was made.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA SB600 works in 2.6.20.4 but not in 2.6.21-rc5 with irqpoll parameter

2007-04-23 Thread Jeff Garzik

Karsten Vieth wrote:

I can't report this problem from a new kernel, but i have the same
problem with the kernel 2.6.20.1-33x from f7-test3.

I managed to boot with these options:
linux noapic acpi=off pci=nomsi irqpoll


Can you narrow down the options?

Hopefully pci=nomsi or similar should do it.

irqpoll in particular is heavyweight and to be avoided if possible.

Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Return EPERM not ECHILD on security_task_wait failure

2007-04-23 Thread James Morris
On Thu, 15 Mar 2007, Roland McGrath wrote:

 This patch makes do_wait return -EPERM instead of -ECHILD if some
 children were ruled out solely because security_task_wait failed.

What about using the return value from the security_task_wait hook (which 
should be -EACCES) ?


- James
-- 
James Morris
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] First glitch1 results, 2.6.21-rc7-git6-CFSv5 + SD 0.46

2007-04-23 Thread Ed Tomlinson
On Monday 23 April 2007 17:57, Bill Davidsen wrote:
 I am not sure a binary attachment will go thru, I will move to the web 
 ste if not.

I did a quick try of this script here.

With SD 0.46 with X at nice 0 I was getting 1-2 frames per second.  I decided 
to try cfs v5.
The option disable auto renicing did not work so many threads other than X are 
now at -19...

SD 0.46 1-2 FPS
cfs v5 nice -19 219-233 FPS
cfs v5 nice 0   1000-1996

Looks like, in this case, nice -19 for X is NOT a good idea.

Kernel is 2.6.20.7 (gentoo) UP amd64 with HZ 300 voluntary prempt (a fully 
premptable kernel eventually 
locks up switching between 32 and 64 apps)

Thanks,

Ed Tomlinson
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] First glitch1 results, 2.6.21-rc7-git6-CFSv5 + SD 0.46

2007-04-23 Thread Ed Tomlinson
On Monday 23 April 2007 19:45, Ed Tomlinson wrote:
 On Monday 23 April 2007 17:57, Bill Davidsen wrote:
  I am not sure a binary attachment will go thru, I will move to the web 
  ste if not.
 
 I did a quick try of this script here.
 
 With SD 0.46 with X at nice 0 I was getting 1-2 frames per second.  I decided 
 to try cfs v5.
 The option disable auto renicing did not work so many threads other than X 
 are now at -19...
 
 SD 0.46   1-2 FPS
 cfs v5 nice -19   219-233 FPS
 cfs v5 nice 0 1000-1996
   cfs v5 nice -10  60-65 FPS
 
 Looks like, in this case, nice -19 for X is NOT a good idea.
 
 Kernel is 2.6.20.7 (gentoo) UP amd64 with HZ 300 voluntary prempt (a fully 
 premptable kernel eventually 
 locks up switching between 32 and 64 apps)

Thanks
Ed 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: MODULE_MAINTAINER

2007-04-23 Thread Rusty Russell
On Mon, 2007-04-23 at 07:52 -0400, Robert P. J. Day wrote:
 On Mon, 23 Apr 2007, Rusty Russell wrote:
 
  On Mon, 2007-04-23 at 11:33 +0200, Rene Herman wrote:
   On 04/04/2007 06:38 PM, Rene Herman wrote:
  
   Rusty?
 
  Valid points have been made on both sides.  I suggest:
 
  #define MODULE_MAINTAINER(_maintainer) \
  MODULE_AUTHOR((Maintained by) _maintainer)
 
 why bring MODULE_AUTHOR into it?  just define it in terms of
 MODULE_INFO:

Because author is an established field.  People might well search for
it.  This is fairly clear, and assuming that the maintainer has actually
done any maintenance, they're an author too.

Rusty.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Question about Reiser4

2007-04-23 Thread H. Peter Anvin

Theodore Tso wrote:


One of the big problems of using a filesystem as a DB is the system
call overheads.  If you use huge numbers of tiny files, then each
attempt read an atom of information from the DB takes three system
calls --- an open(), read(), and close(), with all of the overheads in
terms of dentry and inode cache.



Now, to be fair, there are probably a number of cases where 
open/lseek/readv/close and open/lseek/writev/close would be worth doing 
as a single system call.  The big problem as far as I can see involves 
EINTR handling; such a system call has serious restartability implications.


Of course, there are Ingo's syslets...

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Remove obsolete label from ISDN4Linux (v3)

2007-04-23 Thread Tilman Schmidt
Am 22.04.2007 17:17 schrieb Alan Cox:
 Well once it ends up  BROKEN perhaps patches will appear, or before
 that. If not well the pain factor will resolve the problem.

 No risk of deadlock. It'll progress to  BROKEN which will either cause
 sufficient pain for someone to get off their arse and fix it, for enough
 of a vendors users to get the vendor to do the work or for someone who
 cares to pay a third party to do the work.

 Do I sense some hidden agenda there?
 
 No I'm speaking from experience - if a subsystem maintainer is too
 busy/working on other projects and the subsystem stops working it
 produces a rapid and sudden supply of new maintainers, unless nobody
 cares in which case it can go in the bitbucket.
 
  The isdn4linux subsystem will not progress to BROKEN unless
  somebody pushes it there. 
 
 It has drivers using functions that will soon be deleted. That isn't so
 much as pushing more like getting fed up of pulling someone elses cart
 along.

Do I understand you correctly? You deliberately want to move it
to BROKEN to cause pain in the hope of forcing somebody other than
the person who did the kernel change in the first place (quote
stable_api_nonsense.txt) to do the fixing up?

Am 22.04.2007 18:20 schrieb Alan Cox:
 Why, or
 rather how, were the writers of newer APIs _allowed_ to push *their*
 stuff into the kernel _without_ even bothering to convert the
 *existing* users of the older APIs in the kernel? This goes against
 
 Because to convert the existing ISDN4Linux heap into the new APIs would
 require someone with all the cards involved and a lot of time (as the
 card drivers need a *lot* of work by now to bring them up to todays work)

Not true. None of the past kernel API changes were done by someone
who had all the hardware for the affected drivers. I have personally
acked changes to the driver I maintain from people who don't have
the hardware, and the changes were fine. The one inventing a new
kernel API to replace an old one is in the best position for actually
replacing it in the existing users of the old API, and that's also
what stable_api_nonsense.txt stipulates.

 Precedent, that implies it is a new behaviour - which it isn't. We
 regularly break old driver code when it is neccessary in order to make
 general progress. Grep for BROKEN in the kernel tree.

I did grep for BROKEN in the 2.6.21-rc7 sources and couldn't find
an instance of a driver that was still in active use being broken in
order to make general progress. OTOH I remember several cases of
drivers being kept alive even though they were in the way of
progress, because there were still users relying on them.

 You, and anyone else who wants to, are free to work on I4L and fix it,
 improve it and make it better. 

You are turning the situation on its head. I4L works. Somebody wants
to push through a kernel API change that would break it. In every
other case I know, it was the responsibility of those doing the
kernel API change to fix the in-tree users of that API. As long as
they didn't finish that job, the old API would stay. Nobody advocates
moving reiserfs to BROKEN for still using lock_kernel(), to cite a
recent issue. So why isdn4linux?

-- 
Tilman Schmidt  E-Mail: [EMAIL PROTECTED]
Bonn, Germany
- Undetected errors are handled as if no error occurred. (IBM) -



signature.asc
Description: OpenPGP digital signature


[patch 1/7] libata: check for AN support

2007-04-23 Thread Kristen Carlson Accardi
Check to see if an ATAPI device supports Asynchronous Notification.
If so, enable it.

Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED]

Index: 2.6-git/drivers/ata/libata-core.c
===
--- 2.6-git.orig/drivers/ata/libata-core.c
+++ 2.6-git/drivers/ata/libata-core.c
@@ -70,6 +70,7 @@ const unsigned long sata_deb_timing_long
 static unsigned int ata_dev_init_params(struct ata_device *dev,
u16 heads, u16 sectors);
 static unsigned int ata_dev_set_xfermode(struct ata_device *dev);
+static unsigned int ata_dev_set_AN(struct ata_device *dev);
 static void ata_dev_xfermask(struct ata_device *dev);
 
 static unsigned int ata_print_id = 1;
@@ -1744,6 +1745,23 @@ int ata_dev_configure(struct ata_device 
}
dev-cdb_len = (unsigned int) rc;
 
+   /*
+* check to see if this ATAPI device supports
+* Asynchronous Notification
+*/
+   if ((ap-flags  ATA_FLAG_AN)  ata_id_has_AN(id))
+   {
+   /* issue SET feature command to turn this on */
+   rc = ata_dev_set_AN(dev);
+   if (rc) {
+   ata_dev_printk(dev, KERN_ERR,
+   unable to set AN\n);
+   rc = -EINVAL;
+   goto err_out_nosup;
+   }
+   dev-flags |= ATA_DFLAG_AN;
+   }
+
if (ata_id_cdb_intr(dev-id)) {
dev-flags |= ATA_DFLAG_CDB_INTR;
cdb_intr_string = , CDB intr;
@@ -3525,6 +3543,42 @@ static unsigned int ata_dev_set_xfermode
 }
 
 /**
+ * ata_dev_set_AN - Issue SET FEATURES - SATA FEATURES
+ *   with sector count set to indicate
+ *   Asynchronous Notification feature
+ * @dev: Device to which command will be sent
+ *
+ * Issue SET FEATURES - SATA FEATURES command to device @dev
+ * on port @ap.
+ *
+ * LOCKING:
+ * PCI/etc. bus probe sem.
+ *
+ * RETURNS:
+ * 0 on success, AC_ERR_* mask otherwise.
+ */
+static unsigned int ata_dev_set_AN(struct ata_device *dev)
+{
+   struct ata_taskfile tf;
+   unsigned int err_mask;
+
+   /* set up set-features taskfile */
+   DPRINTK(set features - SATA features\n);
+
+   ata_tf_init(dev, tf);
+   tf.command = ATA_CMD_SET_FEATURES;
+   tf.feature = SETFEATURES_SATA_ENABLE;
+   tf.flags |= ATA_TFLAG_ISADDR | ATA_TFLAG_DEVICE;
+   tf.protocol = ATA_PROT_NODATA;
+   tf.nsect = SATA_AN;
+
+   err_mask = ata_exec_internal(dev, tf, NULL, DMA_NONE, NULL, 0);
+
+   DPRINTK(EXIT, err_mask=%x\n, err_mask);
+   return err_mask;
+}
+
+/**
  * ata_dev_init_params - Issue INIT DEV PARAMS command
  * @dev: Device to which command will be sent
  * @heads: Number of heads (taskfile parameter)
Index: 2.6-git/include/linux/ata.h
===
--- 2.6-git.orig/include/linux/ata.h
+++ 2.6-git/include/linux/ata.h
@@ -194,6 +194,12 @@ enum {
SETFEATURES_WC_ON   = 0x02, /* Enable write cache */
SETFEATURES_WC_OFF  = 0x82, /* Disable write cache */
 
+   SETFEATURES_SATA_ENABLE = 0x10, /* Enable use of SATA feature */
+   SETFEATURES_SATA_DISABLE = 0x90, /* Disable use of SATA feature */
+
+   /* SETFEATURE Sector counts for SATA features */
+   SATA_AN = 0x05,  /* Asynchronous Notification */
+
/* ATAPI stuff */
ATAPI_PKT_DMA   = (1  0),
ATAPI_DMADIR= (1  2), /* ATAPI data dir:
@@ -299,6 +305,8 @@ struct ata_taskfile {
 #define ata_id_queue_depth(id) (((id)[75]  0x1f) + 1)
 #define ata_id_removeable(id)  ((id)[0]  (1  7))
 #define ata_id_has_dword_io(id)((id)[50]  (1  0))
+#define ata_id_has_AN(id)  \
+   ((id[76]  (~id[76]))  ((id)[78]  (1  5)))
 #define ata_id_iordy_disable(id) ((id)[49]  (1  10))
 #define ata_id_has_iordy(id) ((id)[49]  (1  9))
 #define ata_id_u32(id,n)   \
Index: 2.6-git/include/linux/libata.h
===
--- 2.6-git.orig/include/linux/libata.h
+++ 2.6-git/include/linux/libata.h
@@ -136,6 +136,7 @@ enum {
ATA_DFLAG_CDB_INTR  = (1  2), /* device asserts INTRQ when ready 
for CDB */
ATA_DFLAG_NCQ   = (1  3), /* device supports NCQ */
ATA_DFLAG_FLUSH_EXT = (1  4), /* do FLUSH_EXT instead of FLUSH */
+   ATA_DFLAG_AN= (1  5), /* device supports Async 
notification */
ATA_DFLAG_CFG_MASK  = (1  8) - 1,
 
ATA_DFLAG_PIO   = (1  8), /* device limited to PIO mode */
@@ -174,6 +175,7 @@ enum {
ATA_FLAG_SETXFER_POLLING= (1  14), /* use polling for SETXFER */

[patch 5/7] genhd: send async notification on media change

2007-04-23 Thread Kristen Carlson Accardi
Send an uevent to user space to indicate that a media change event has occurred.

Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED]

Index: 2.6-git/block/genhd.c
===
--- 2.6-git.orig/block/genhd.c
+++ 2.6-git/block/genhd.c
@@ -643,6 +643,25 @@ struct seq_operations diskstats_op = {
.show   = diskstats_show
 };
 
+static void media_change_notify_thread(struct work_struct *work)
+{
+   struct gendisk *gd = container_of(work, struct gendisk, async_notify);
+   char event[] = MEDIA_CHANGE=1;
+   char *envp[] = { event, NULL };
+
+   /*
+* set enviroment vars to indicate which event this is for
+* so that user space will know to go check the media status.
+*/
+   kobject_uevent_env(gd-kobj, KOBJ_CHANGE, envp);
+}
+
+void genhd_media_change_notify(struct gendisk *disk)
+{
+   schedule_work(disk-async_notify);
+}
+EXPORT_SYMBOL_GPL(genhd_media_change_notify);
+
 struct gendisk *alloc_disk(int minors)
 {
return alloc_disk_node(minors, -1);
@@ -672,6 +691,8 @@ struct gendisk *alloc_disk_node(int mino
kobj_set_kset_s(disk,block_subsys);
kobject_init(disk-kobj);
rand_initialize_disk(disk);
+   INIT_WORK(disk-async_notify,
+   media_change_notify_thread);
}
return disk;
 }
Index: 2.6-git/include/linux/genhd.h
===
--- 2.6-git.orig/include/linux/genhd.h
+++ 2.6-git/include/linux/genhd.h
@@ -66,6 +66,7 @@ struct partition {
 #include linux/smp.h
 #include linux/string.h
 #include linux/fs.h
+#include linux/workqueue.h
 
 struct partition {
unsigned char boot_ind; /* 0x80 - active */
@@ -139,6 +140,7 @@ struct gendisk {
 #else
struct disk_stats dkstats;
 #endif
+   struct work_struct async_notify;
 };
 
 /* Structure for sysfs attributes on block devices */
@@ -419,7 +421,7 @@ extern struct gendisk *alloc_disk_node(i
 extern struct gendisk *alloc_disk(int minors);
 extern struct kobject *get_disk(struct gendisk *disk);
 extern void put_disk(struct gendisk *disk);
-
+extern void genhd_media_change_notify(struct gendisk *disk);
 extern void blk_register_region(dev_t dev, unsigned long range,
struct module *module,
struct kobject *(*probe)(dev_t, int *, void *),

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 0/7] Asynchronous Notification for ATAPI devices (v2)

2007-04-23 Thread Kristen Carlson Accardi
This patch series implements Asynchronous Notification (AN) for SATA
ATAPI devices as defined in SATA 2.5 and AHCI 1.1 and higher.  Drives
which support this feature will send a notification when new media is
inserted and removed, preventing the need for user space to poll for
new media.  This support is exposed to user space via a flag that will
be set in /sys/block/sr*/capability_flags.  If the flag is set, user
space can disable polling for the new media, and the genhd driver will
send a KOBJ_CHANGE event with the envp set to MEDIA_CHANGE_EVENT=1.

Note that this patch only implements support for directly attached
drives - AN with drives attached to a port multiplier requires 
additional changes.

Thanks!
Kristen

-- 
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 3/7] scsi: expose AN to user space

2007-04-23 Thread Kristen Carlson Accardi
Get media change notification capability from disk and pass this information
to genhd by setting appropriate flag.

Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED]

Index: 2.6-git/drivers/scsi/sr.c
===
--- 2.6-git.orig/drivers/scsi/sr.c
+++ 2.6-git/drivers/scsi/sr.c
@@ -601,6 +601,8 @@ static int sr_probe(struct device *dev)
 
dev_set_drvdata(dev, cd);
disk-flags |= GENHD_FL_REMOVABLE;
+   if (sdev-media_change_notify)
+   disk-flags |= GENHD_FL_MEDIA_CHANGE_NOTIFY;
add_disk(disk);
 
sdev_printk(KERN_DEBUG, sdev,
Index: 2.6-git/include/scsi/scsi_device.h
===
--- 2.6-git.orig/include/scsi/scsi_device.h
+++ 2.6-git/include/scsi/scsi_device.h
@@ -124,7 +124,7 @@ struct scsi_device {
unsigned fix_capacity:1;/* READ_CAPACITY is too high by 1 */
unsigned guess_capacity:1;  /* READ_CAPACITY might be too high by 1 
*/
unsigned retry_hwerror:1;   /* Retry HARDWARE_ERROR */
-
+   unsigned media_change_notify:1; /* dev supports async media notify */
unsigned int device_blocked;/* Device returned QUEUE_FULL. */
 
unsigned int max_device_blocked; /* what device_blocked counts down 
from  */
Index: 2.6-git/drivers/scsi/sd.c
===
--- 2.6-git.orig/drivers/scsi/sd.c
+++ 2.6-git/drivers/scsi/sd.c
@@ -1706,6 +1706,9 @@ static int sd_probe(struct device *dev)
if (sdp-removable)
gd-flags |= GENHD_FL_REMOVABLE;
 
+   if (sdp-media_change_notify)
+   gd-flags |= GENHD_FL_MEDIA_CHANGE_NOTIFY;
+
dev_set_drvdata(dev, sdkp);
add_disk(gd);
 

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 4/7] libata: expose AN to user space

2007-04-23 Thread Kristen Carlson Accardi
If Asynchronous Notification of media change events is supported,
pass that information up to the SCSI layer.

Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED]

Index: 2.6-git/drivers/ata/libata-scsi.c
===
--- 2.6-git.orig/drivers/ata/libata-scsi.c
+++ 2.6-git/drivers/ata/libata-scsi.c
@@ -899,6 +899,9 @@ static void ata_scsi_dev_config(struct s
blk_queue_max_hw_segments(q, q-max_hw_segments - 1);
}
 
+   if (dev-flags  ATA_DFLAG_AN)
+   sdev-media_change_notify = 1;
+
if (dev-flags  ATA_DFLAG_NCQ) {
int depth;
 

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 6/7] SCSI: save disk in scsi_device

2007-04-23 Thread Kristen Carlson Accardi
Give anyone who has access to scsi_device access to the genhd struct as well.

Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED]
Index: 2.6-git/drivers/scsi/sd.c
===
--- 2.6-git.orig/drivers/scsi/sd.c
+++ 2.6-git/drivers/scsi/sd.c
@@ -1711,6 +1711,7 @@ static int sd_probe(struct device *dev)
 
dev_set_drvdata(dev, sdkp);
add_disk(gd);
+   sdp-disk = gd;
 
sdev_printk(KERN_NOTICE, sdp, Attached scsi %sdisk %s\n,
sdp-removable ? removable  : , gd-disk_name);
Index: 2.6-git/drivers/scsi/sr.c
===
--- 2.6-git.orig/drivers/scsi/sr.c
+++ 2.6-git/drivers/scsi/sr.c
@@ -604,6 +604,7 @@ static int sr_probe(struct device *dev)
if (sdev-media_change_notify)
disk-flags |= GENHD_FL_MEDIA_CHANGE_NOTIFY;
add_disk(disk);
+   sdev-disk = disk;
 
sdev_printk(KERN_DEBUG, sdev,
Attached scsi CD-ROM %s\n, cd-cdi.name);
Index: 2.6-git/include/scsi/scsi_device.h
===
--- 2.6-git.orig/include/scsi/scsi_device.h
+++ 2.6-git/include/scsi/scsi_device.h
@@ -138,7 +138,7 @@ struct scsi_device {
 
struct device   sdev_gendev;
struct class_device sdev_classdev;
-
+   struct gendisk  *disk;
struct execute_work ew; /* used to get process context on put */
 
enum scsi_device_state sdev_state;

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 7/7] libata: send event when AN received

2007-04-23 Thread Kristen Carlson Accardi
When we get an SDB FIS with the 'N' bit set, we should send
an event to user space to indicate that there has been a
media change.  This will be done via the block device. 

Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED]
Index: 2.6-git/drivers/ata/ahci.c
===
--- 2.6-git.orig/drivers/ata/ahci.c
+++ 2.6-git/drivers/ata/ahci.c
@@ -1147,6 +1147,25 @@ static void ahci_host_intr(struct ata_po
return;
}
 
+   if (status  PORT_IRQ_SDB_FIS) {
+   /*
+* if this is an ATAPI device with AN turned on,
+* then we should interrogate the device to
+* determine the cause of the interrupt
+*
+* for AN - this we should check the SDB FIS
+* and find the I and N bits set
+*/
+   const u32 *f = pp-rx_fis + RX_FIS_SDB;
+
+   /* check the 'N' bit in word 0 of the FIS */
+   if (f[0]  (1  15)) {
+   int port_addr =  ((f[0]  0x0f00)  8);
+   struct ata_device *adev = ap-device[port_addr];
+   if (adev-flags  ATA_DFLAG_AN)
+   ata_scsi_media_change_notify(adev);
+   }
+   }
if (ap-sactive)
qc_active = readl(port_mmio + PORT_SCR_ACT);
else
Index: 2.6-git/include/linux/libata.h
===
--- 2.6-git.orig/include/linux/libata.h
+++ 2.6-git/include/linux/libata.h
@@ -737,6 +737,7 @@ extern void ata_host_init(struct ata_hos
 extern int ata_scsi_detect(struct scsi_host_template *sht);
 extern int ata_scsi_ioctl(struct scsi_device *dev, int cmd, void __user *arg);
 extern int ata_scsi_queuecmd(struct scsi_cmnd *cmd, void (*done)(struct 
scsi_cmnd *));
+extern void ata_scsi_media_change_notify(struct ata_device *atadev);
 extern void ata_sas_port_destroy(struct ata_port *);
 extern struct ata_port *ata_sas_port_alloc(struct ata_host *,
   struct ata_port_info *, struct 
Scsi_Host *);
Index: 2.6-git/drivers/ata/libata-scsi.c
===
--- 2.6-git.orig/drivers/ata/libata-scsi.c
+++ 2.6-git/drivers/ata/libata-scsi.c
@@ -3057,6 +3057,22 @@ static void ata_scsi_remove_dev(struct a
 }
 
 /**
+ * ata_scsi_media_change_notify - send media change event
+ * @atadev: Pointer to the disk device with media change event
+ *
+ * Tell the block layer to send a media change notification
+ * event.
+ *
+ * LOCKING:
+ * interrupt context, may not sleep.
+ */
+void ata_scsi_media_change_notify(struct ata_device *atadev)
+{
+   genhd_media_change_notify(atadev-sdev-disk);
+}
+EXPORT_SYMBOL_GPL(ata_scsi_media_change_notify);
+
+/**
  * ata_scsi_hotplug - SCSI part of hotplug
  * @work: Pointer to ATA port to perform SCSI hotplug on
  *

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 2/7] genhd: expose AN to user space

2007-04-23 Thread Kristen Carlson Accardi
Allow user space to determine if a disk supports Asynchronous Notification
of media changes.  This is done by adding a new sysfs file capability_flags,
which is documented in (insert file name).  This sysfs file will export all
disk capabilities flags to user space.  We also define a new flag to define
the media change notification capability.

Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED]

Index: 2.6-git/block/genhd.c
===
--- 2.6-git.orig/block/genhd.c
+++ 2.6-git/block/genhd.c
@@ -370,7 +370,10 @@ static ssize_t disk_size_read(struct gen
 {
return sprintf(page, %llu\n, (unsigned long long)get_capacity(disk));
 }
-
+static ssize_t disk_capability_read(struct gendisk *disk, char *page)
+{
+   return sprintf(page, %x\n, disk-flags);
+}
 static ssize_t disk_stats_read(struct gendisk * disk, char *page)
 {
preempt_disable();
@@ -413,6 +416,10 @@ static struct disk_attribute disk_attr_s
.attr = {.name = size, .mode = S_IRUGO },
.show   = disk_size_read
 };
+static struct disk_attribute disk_attr_capability = {
+   .attr = {.name = capability_flags, .mode = S_IRUGO },
+   .show   = disk_capability_read
+};
 static struct disk_attribute disk_attr_stat = {
.attr = {.name = stat, .mode = S_IRUGO },
.show   = disk_stats_read
@@ -453,6 +460,7 @@ static struct attribute * default_attrs[
disk_attr_removable.attr,
disk_attr_size.attr,
disk_attr_stat.attr,
+   disk_attr_capability.attr,
 #ifdef CONFIG_FAIL_MAKE_REQUEST
disk_attr_fail.attr,
 #endif
Index: 2.6-git/include/linux/genhd.h
===
--- 2.6-git.orig/include/linux/genhd.h
+++ 2.6-git/include/linux/genhd.h
@@ -94,6 +94,7 @@ struct hd_struct {
 
 #define GENHD_FL_REMOVABLE 1
 #define GENHD_FL_DRIVERFS  2
+#define GENHD_FL_MEDIA_CHANGE_NOTIFY   4
 #define GENHD_FL_CD8
 #define GENHD_FL_UP16
 #define GENHD_FL_SUPPRESS_PARTITION_INFO   32
Index: 2.6-git/Documentation/block/capability_flags.txt
===
--- /dev/null
+++ 2.6-git/Documentation/block/capability_flags.txt
@@ -0,0 +1,15 @@
+Generic Block Device Capability Flags
+===
+This file documents the sysfs file block/disk/capability_flags
+
+capability_flags is a hex word indicating which capabilities a specific
+disk supports.  For more information on bits not listed here, see
+include/linux/genhd.h
+
+Capability Value
+---
+GENHD_FL_MEDIA_CHANGE_NOTIFY   4
+   When this bit is set, the disk supports Asynchronous Notification
+   of media change events.  These events will be broadcast to user
+   space via kernel uevent.
+

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Question about Reiser4

2007-04-23 Thread Neil Brown
On Monday April 23, [EMAIL PROTECTED] wrote:
 Theodore Tso wrote:
  
  One of the big problems of using a filesystem as a DB is the system
  call overheads.  If you use huge numbers of tiny files, then each
  attempt read an atom of information from the DB takes three system
  calls --- an open(), read(), and close(), with all of the overheads in
  terms of dentry and inode cache.
  
 
 Now, to be fair, there are probably a number of cases where 
 open/lseek/readv/close and open/lseek/writev/close would be worth doing 
 as a single system call.  The big problem as far as I can see involves 
 EINTR handling; such a system call has serious restartability implications.
 
 Of course, there are Ingo's syslets...

Our you could think outside the circle:
Store all your small files as symlinks, then use symlink to create
them and readlink to read them. (You would probably end up use
symlinkat and readlinkat).
Only one system call instead of three.
I guess you don't get meaningful permission bits then... I wonder if
that really matters.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Question about Reiser4

2007-04-23 Thread Theodore Tso
On Mon, Apr 23, 2007 at 04:53:03PM -0700, H. Peter Anvin wrote:
 Theodore Tso wrote:
 
 One of the big problems of using a filesystem as a DB is the system
 call overheads.  If you use huge numbers of tiny files, then each
 attempt read an atom of information from the DB takes three system
 calls --- an open(), read(), and close(), with all of the overheads in
 terms of dentry and inode cache.
 
 
 Now, to be fair, there are probably a number of cases where 
 open/lseek/readv/close and open/lseek/writev/close would be worth doing 
 as a single system call.  The big problem as far as I can see involves 
 EINTR handling; such a system call has serious restartability implications.

Sure, but Hans wants to change /etc/inetd.conf into /etc/inetd.conf.d,
where you have: /etc/inetd.conf.d/telnet/port,
/etc/inetd.conf.d/telnet/protocol, /etc/inetd.conf.d/telnet/wait,
/etc/inetd.conf.d/telnet/userid, /etc/inetd.conf.d/telnet/daemon,
etc. for each individual line in /etc/inetd.conf.  (And where each
file might only contains 2-4 characters each: i.e., 23, tcp,
root, etc.)

So it's not enough just to collapse open/pread/close into a single
system call; in order to gain back the performance squandered by all
of these itsy-bitsy tiny little files.  You want to collapse the
open/pread/close for many of these little files into a single system
call, hence Hans's insistence on sys_reiser4(); otherwise his scheme
doesn't work all that well at all.

- Ted

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Question about Reiser4

2007-04-23 Thread H. Peter Anvin

Neil Brown wrote:


Our you could think outside the circle:
Store all your small files as symlinks, then use symlink to create
them and readlink to read them. (You would probably end up use
symlinkat and readlinkat).
Only one system call instead of three.
I guess you don't get meaningful permission bits then... I wonder if
that really matters.



For some applications, oh yes it does.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH]Fix parsing kernelcore boot option for ia64

2007-04-23 Thread KAMEZAWA Hiroyuki
On Mon, 23 Apr 2007 19:32:46 +0100
[EMAIL PROTECTED] (Mel Gorman) wrote:

   I wasn't even aware of this kernelcore thing.  It's pretty nasty-looking. 
   yet another reminder that this code hasn't been properly reviewed in the
   past year or three.
  
  Just now, I'm making memory-unplug patches with current MOVABLE_ZONE
  code. So, I might be the first user of it on ia64.
  
  Anyway, I'll try to fix it.
  
 
 Can you review this patch and see does it fix the problem please? There
 was a second problem that showed up while testing this in relation to the
 bootmem allocator assumptions about zone boundary alignment. I'll follow up
 this mail with the patch in case you are seeing that problem.
 
 Subject: Fix parsing kernelcore boot option V2
 cmdline_parse_kernelcore() should return the next pointer of boot option
 like memparse() doing. If not, it is cause of eternal loop on ia64 box.
 This patch is for 2.6.21-rc6-mm1. This patch changes the kernelcore command
 line parsing so that is compatible with both early_param() way of doing
 things and IA64.
 
In my understanding, why ia64 doesn't use early_param() macro for mem= at el. 
is that 
it has to use mem= option at efi handling which is called before 
parse_early_param().

Current ia64's boot path is
 setup_arch()
- efi handling - parse_early_param() - numa handling - pgdat/zone init

kernelcore= option is just used at pgdat/zone initialization. (no arch 
dependent part...)

So I think just adding
==
early_param(kernelcore,cmpdline_parse_kernelcore)
==
to ia64 is ok.

-Kame

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Return EPERM not ECHILD on security_task_wait failure

2007-04-23 Thread Roland McGrath
 On Thu, 15 Mar 2007, Roland McGrath wrote:
 
  This patch makes do_wait return -EPERM instead of -ECHILD if some
  children were ruled out solely because security_task_wait failed.
 
 What about using the return value from the security_task_wait hook (which 
 should be -EACCES) ?

As I said in some earlier discussion following my original patch, that
would be fine with me.  I haven't coded up that variant, but it's simple
enough.  Would you like to do it?


Thanks,
Roland
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Question about Reiser4

2007-04-23 Thread H. Peter Anvin

Theodore Tso wrote:


Now, to be fair, there are probably a number of cases where 
open/lseek/readv/close and open/lseek/writev/close would be worth doing 
as a single system call.  The big problem as far as I can see involves 
EINTR handling; such a system call has serious restartability implications.


Sure, but Hans wants to change /etc/inetd.conf into /etc/inetd.conf.d,
where you have: /etc/inetd.conf.d/telnet/port,
/etc/inetd.conf.d/telnet/protocol, /etc/inetd.conf.d/telnet/wait,
/etc/inetd.conf.d/telnet/userid, /etc/inetd.conf.d/telnet/daemon,
etc. for each individual line in /etc/inetd.conf.  (And where each
file might only contains 2-4 characters each: i.e., 23, tcp,
root, etc.)

So it's not enough just to collapse open/pread/close into a single
system call; in order to gain back the performance squandered by all
of these itsy-bitsy tiny little files.  You want to collapse the
open/pread/close for many of these little files into a single system
call, hence Hans's insistence on sys_reiser4(); otherwise his scheme
doesn't work all that well at all.



Heh.  sys_read_tree() -- walk a directory tree and return it as a data 
structure in memory :)


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB: kmem_cache_destroy doesn't - version 2.

2007-04-23 Thread Neil Brown
On Monday April 23, [EMAIL PROTECTED] wrote:
 Would this work? Contains a solution somewhat along the lines of your 
 thoughts on the subject.
 

Concept seems sound.
Code needs a kfree of the name returned by create_unique_id, and I
think ID_STR_LENGTH needs to be at least 34.
Maybe that should be allocated on the stack in sysfs_slab_add, rather
than using kmalloc/free.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Return EPERM not ECHILD on security_task_wait failure

2007-04-23 Thread James Morris
On Mon, 23 Apr 2007, Roland McGrath wrote:

 As I said in some earlier discussion following my original patch, that
 would be fine with me.  I haven't coded up that variant, but it's simple
 enough.  Would you like to do it?

Sure.


-- 
James Morris
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB: kmem_cache_destroy doesn't - version 2.

2007-04-23 Thread Christoph Lameter
On Tue, 24 Apr 2007, Neil Brown wrote:

 On Monday April 23, [EMAIL PROTECTED] wrote:
  Would this work? Contains a solution somewhat along the lines of your 
  thoughts on the subject.
  
 
 Concept seems sound.
 Code needs a kfree of the name returned by create_unique_id, and I
 think ID_STR_LENGTH needs to be at least 34.

Sysfs copies the string?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB: kmem_cache_destroy doesn't - version 2.

2007-04-23 Thread Neil Brown
On Monday April 23, [EMAIL PROTECTED] wrote:
 On Tue, 24 Apr 2007, Neil Brown wrote:
 
  On Monday April 23, [EMAIL PROTECTED] wrote:
   Would this work? Contains a solution somewhat along the lines of your 
   thoughts on the subject.
   
  
  Concept seems sound.
  Code needs a kfree of the name returned by create_unique_id, and I
  think ID_STR_LENGTH needs to be at least 34.
 
 Sysfs copies the string?

kobject_set_name copies the string, either into a small char array in
the kobject, or into kmalloced space.
kobject_set_name actually takes a format and arbitrary args and uses
vsnprintf, so it has to make it's own copy.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Update the list information for kexec and kdump

2007-04-23 Thread Simon Horman
On Mon, Apr 23, 2007 at 12:04:01PM -0600, Eric W. Biederman wrote:
 Simon Horman [EMAIL PROTECTED] writes:
 
  Update the list information for kexec and kdump
 
  Signed-off-by: Simon Horman [EMAIL PROTECTED]
 
  --- 
  Is it too early for this change?
 
 It looks like the new list is working, and isn't likely to get overwhelmed
 with spam.  I don't know if everyone has switched over yet but we can
 certainly update MAINTAINERS. 

Last time I checked there were 28 people in the kexec@ list.
This isn't everyone, but it is getting there.

May I add an Acked-by you ?

 Eric
 
  Index: linux-2.6/MAINTAINERS
  ===
  --- linux-2.6.orig/MAINTAINERS  2007-04-23 17:34:30.0 +0900
  +++ linux-2.6/MAINTAINERS   2007-04-23 17:34:47.0 +0900
  @@ -1951,7 +1951,7 @@ P:Vivek Goyal
   M: [EMAIL PROTECTED]
   P: Haren Myneni
   M: [EMAIL PROTECTED]
  -L: [EMAIL PROTECTED]
  +L: [EMAIL PROTECTED]
   L: linux-kernel@vger.kernel.org
   W: http://lse.sourceforge.net/kdump/
   S: Maintained
  @@ -2001,7 +2001,7 @@ P:Eric Biederman
   M: [EMAIL PROTECTED]
   W: http://www.xmission.com/~ebiederm/files/kexec/
   L: linux-kernel@vger.kernel.org
  -L: [EMAIL PROTECTED]
  +L: [EMAIL PROTECTED]
   S: Maintained
   
   KPROBES

-- 
Horms
  H: http://www.vergenet.net/~horms/
  W: http://www.valinux.co.jp/en/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: AppArmor FAQ

2007-04-23 Thread Crispin Cowan
David Wagner wrote:
 James Morris  wrote:
   
 [...] you can change the behavior of the application and then bypass 
 policy entirely by utilizing any mechanism other than direct filesystem 
 access: IPC, shared memory, Unix domain sockets, local IP networking, 
 remote networking etc.
 
 [...]
   
 Just look at their code and their own description of AppArmor.
 
 My gosh, you're right.  What the heck?  With all due respect to the
 developers of AppArmor, I can't help thinking that that's pretty lame.
 I think this raises substantial questions about the value of AppArmor.
 What is the point of having a jail if it leaves gaping holes that
 malicious code could use to escape?

 And why isn't this documented clearly, with the implications fully
 explained?

 I would like to hear the AppArmor developers defend this design decision.
   
It was a simplicity trade off at the time, when AppArmor was mostly
aimed at servers, and there was no HAL or DBUS. Now it is definitely a
limitation that we are addressing. We are working on a mediation system
for what kind of IPC a confined process can do
http://forge.novell.com/pipermail/apparmor-dev/2007-April/000503.html

When our IPC mediation system is code instead of vapor, it will also
appear here for review. Meanwhile, AppArmor does not make IPC security
any worse, confined processes are still subject to the usual Linux IPC
restrictions. AppArmor actually makes the IPC situation somewhat more
secure than stock Linux, e.g. normal DBUS deployment can be controlled
through file access permissions. But we are not claiming AppArmor to be
an IPC security enhancement, yet.

The proposed set of patches is a self-contained access control system
for file system access, and we would like it reviewed as such. Current
AppArmor docs are quite explicit that AppArmor only mediates file access
and POSIX.1e capabilities.

Crispin

-- 
Crispin Cowan, Ph.D.   http://crispincowan.com/~crispin/
Director of Software Engineering   http://novell.com

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread Li, Tong N
I don't know if we've discussed this or not. Since both CFS and SD claim
to be fair, I'd like to hear more opinions on the fairness aspect of
these designs. In areas such as OS, networking, and real-time, fairness,
and its more general form, proportional fairness, are well-defined
terms. In fact, perfect fairness is not feasible since it requires all
runnable threads to be running simultaneously and scheduled with
infinitesimally small quanta (like a fluid system). So to evaluate if a
new scheduling algorithm is fair, the common approach is to take the
ideal fair algorithm (often referred to as Generalized Processor
Scheduling or GPS) as a reference model and analyze if the new algorithm
can achieve a constant error bound (different error metrics also exist).
I understand that via experiments we can show a design is reasonably
fair in the common case, but IMHO, to claim that a design is fair, there
needs to be some kind of formal analysis on the fairness bound, and this
bound should be proven to be constant. Even if the bound is not
constant, at least this analysis can help us better understand and
predict the degree of fairness that users would experience (e.g., would
the system be less fair if the number of threads increases? What happens
if a large number of threads dynamically join and leave the system?).

  tong
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB: kmem_cache_destroy doesn't - version 2.

2007-04-23 Thread Christoph Lameter
On Tue, 24 Apr 2007, Neil Brown wrote:

 kobject_set_name actually takes a format and arbitrary args and uses
 vsnprintf, so it has to make it's own copy.

Ok then this should be fine...

SLAB: Fix sysfs directory handling

This fixes the problem that SLUB does not track the names of aliased
slabs by changing the way that SLUB manages the files in /sys/slab.

If the slab that is being operated on is not mergeable (usually the
case if we are debugging) then do not create any aliases. If an alias
exists that we conflict with then remove it before creating the
directory for the unmergeable slab. If there is a true slab cache there
and not an alias then we fail since there is a true duplication of
slab cache names. So debugging allows the detection of slab name
duplication as usual.

If the slab is mergeable then we create a directory with a unique name
created from the slab size, slab options and the pointer to the kmem_cache
structure (disambiguation). All names referring to the slabs will
then be created as symlinks to that unique name. These symlinks are
not going to be removed on kmem_cache_destroy() since we only carry
a counter for the number of aliases. If a new symlink is created
then it may just replace an existing one. This means that one can create
a gazillion slabs with the same name (if they all refer to mergeable
caches). It will only increase the alias count. So we have the potential
of not detecting duplicate slab names (there is actually no harm
done by doing that). We will detect the duplications as
as soon as debugging is enabled because we will then no longer
generate symlinks and special unique names.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

Index: linux-2.6.21-rc6/mm/slub.c
===
--- linux-2.6.21-rc6.orig/mm/slub.c 2007-04-23 13:08:41.0 -0700
+++ linux-2.6.21-rc6/mm/slub.c  2007-04-23 18:05:16.0 -0700
@@ -3307,16 +3307,68 @@ static struct kset_uevent_ops slab_ueven
 
 decl_subsys(slab, slab_ktype, slab_uevent_ops);
 
+#define ID_STR_LENGTH 64
+
+/* Create a unique string id for a slab cache:
+ * format
+ * :[flags-]size:[memory address of kmemcache]
+ */
+static char *create_unique_id(struct kmem_cache *s)
+{
+   char *name = kmalloc(ID_STR_LENGTH, GFP_KERNEL);
+   char *p = name;
+
+   BUG_ON(!name);
+
+   *p++ = ':';
+   /*
+* First flags affecting slabcache operations */
+   if (s-flags  SLAB_CACHE_DMA)
+   *p++ = 'd';
+   if (s-flags  SLAB_RECLAIM_ACCOUNT)
+   *p++ = 'a';
+   if (s-flags  SLAB_DESTROY_BY_RCU)
+   *p++ = 'r';\
+   /* Debug flags */
+   if (s-flags  SLAB_RED_ZONE)
+   *p++ = 'Z';
+   if (s-flags  SLAB_POISON)
+   *p++ = 'P';
+   if (s-flags  SLAB_STORE_USER)
+   *p++ = 'U';
+   if (p != name + 1)
+   *p++ = '-';
+   p += sprintf(p,%07d:0x%p ,s-size, s);
+   BUG_ON(p  name + ID_STR_LENGTH - 1);
+   return name;
+}
+
 static int sysfs_slab_add(struct kmem_cache *s)
 {
int err;
+   const char *name;
 
if (slab_state  SYSFS)
/* Defer until later */
return 0;
 
+   if (s-flags  SLUB_NEVER_MERGE) {
+   /*
+* Slabcache can never be merged so we can use the name proper.
+* This is typically the case for debug situations. In that
+* case we can catch duplicate names easily.
+*/
+   sysfs_remove_link(slab_subsys.kset.kobj, s-name);
+   name = s-name;
+   } else
+   /*
+* Create a unique name for the slab as a target
+* for the symlinks.
+*/
+   name = create_unique_id(s);
+
kobj_set_kset_s(s, slab_subsys);
-   kobject_set_name(s-kobj, s-name);
+   kobject_set_name(s-kobj, name);
kobject_init(s-kobj);
err = kobject_add(s-kobj);
if (err)
@@ -3326,6 +3378,10 @@ static int sysfs_slab_add(struct kmem_ca
if (err)
return err;
kobject_uevent(s-kobj, KOBJ_ADD);
+   if (!(s-flags  SLUB_NEVER_MERGE)) {
+   sysfs_slab_alias(s, s-name);
+   kfree(name);
+   }
return 0;
 }
 
@@ -3351,9 +3407,14 @@ static int sysfs_slab_alias(struct kmem_
 {
struct saved_alias *al;
 
-   if (slab_state == SYSFS)
+   if (slab_state == SYSFS) {
+   /*
+* If we have a leftover link then remove it.
+*/
+   sysfs_remove_link(slab_subsys.kset.kobj, name);
return sysfs_create_link(slab_subsys.kset.kobj,
s-kobj, name);
+   }
 
al = kmalloc(sizeof(struct saved_alias), GFP_KERNEL);
if (!al)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a 

Re: [ANNOUNCE][PATCH] Kcli - Kernel command line interface.

2007-04-23 Thread Matt Ranon
 (text reformatted to less than 80 cols.  Please, we'll get along a lot
 better if you don't send 1000-column emails)

Sorry. I am afraid we are from a different background, and so very
poorly versed in these things. My email client does not seem
to have an option to tell it to format in 80 cols. So, hopefully,
using CR, I am achieving the same effect. Let me know if
it doesn't work, and I will have to switch to a different email
client for conversing with the lkml.


 The obvious question is: what's _wrong_ with doing all this in some
 cut-down userspace environment like busybox?  Why is this stuff better?
 
 Obviously some embedded developers have considered that option and
 have rejected it.  But we do need to be told, at length, why that
 decision was made.

There is nothing _wrong_ with doing it all in a cut-down userspace. It
is a matter of personal preference, culture, and the application. That
is what makes Linux so great, it is all about choice.

We are developing devices that don't have a user space, and we don't
see the point in including one just for debug purposes. We will not be 
offended if Kcli is not included into the kernel mainline, nor if Kcli compels
people to call us stupid (as it already has) just because we are different 
and some people don't understand us. We are firm believers that the 
world, including the Linux kernel world,  would be a nasty place if there 
was only _one_ way to do any given task. Additionally, we  are almost 
certain that there will be others who think like we do, so we are reaching 
out to them. We also feel compelled to give _something_ back to the 
community that has given so much to us, and, for now, this is all we have.

However, our reasons for Kcli are:
1) Our devices ship with no user space, and we want the development
environment to be as close as possible to the final product.
2) Getting debug information with user space calls require context 
switches and data copies, which changes the real time profile and can mask
bugs. 
3) To use user space, we would need cross compiled libc's, special
builds of gcc, root file systems, flash storage to store it all, and all 
sorts of things which make life a lot more complicated than it needs 
to be for us. We are quite capable of producing all these things, but,
we just don't see the point in it. Our way, we just have a gcc capable 
of cross compiling the kernel and it is so simple.
4) For us, it is the opposite argument. We would need to be convinced
that having user space is worth all the overhead. Not just CPU
overhead, but all the overheads.
5) We like it in the kernel, we find it to be warm and fuzzy. Whereas,
user space is a cold, dark, and rainy place, and we just don't want to
go there. :)

We do not claim to have come up with a _better_ way. We have just
created something that we feel would be useful to others.

MRanon.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Nick Piggin

Rik van Riel wrote:

Use TLB batching for MADV_FREE.  Adds another 10-15% extra performance
to the MySQL sysbench results on my quad core system.

Signed-off-by: Rik van Riel [EMAIL PROTECTED]
---

Nick Piggin wrote:


3) because of this, we can treat any such accesses as
   happening simultaneously with the MADV_FREE and
   as illegal, aka undefined behaviour territory and
   we do not need to worry about them



Yes, but I'm wondering if it is legal in all architectures.



It's similar to trying to access memory during an munmap.

You may be able to for a short time, but it'll come back to
haunt you.


The question is whether the architecture specific tlb
flushing code will break or not.



4) because we flush the tlb before releasing the page
   table lock, other CPUs cannot remove this page from
   the address space - they will block on the page
   table lock before looking at this pte



We don't when the ptl is split.



Even then we do.  Each invocation of zap_pte_range() only touches
one page table page, and it flushes the TLB before releasing the
page table lock.


What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Question about Reiser4

2007-04-23 Thread Theodore Tso
On Mon, Apr 23, 2007 at 05:31:29PM -0700, H. Peter Anvin wrote:
 Heh.  sys_read_tree() -- walk a directory tree and return it as a data 
 structure in memory :)

But maybe you don't want every single file in the directory, but some
subset of the files in the directory tree.  So before you know it:

sys_fs_sql(SELECT port,userid,daemon FROM /etc/inetd.conf.d 
WHERE protocol=='tcp', buf, sizeof(buf));

The question is where do you stop on the slippery slope, and is it
really all that harder than simply parsing a /etc/gitconfig or
/etc/e2fsck.conf file.  There are plenty of parsers or database
libraries already written, and many of them are quite efficient.  And
personally, I'd much rather edit a single /etc/gitconfig or
/etc/e2fsck.conf file using emacs than have to cd through 3 or 4
levels of directories to edit each 2-3 byte file one at a time.  But
to each their own

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH try #2] Return access error not ECHILD on security_task_wait failure

2007-04-23 Thread James Morris
From: Roland McGrath [EMAIL PROTECTED]

wait* syscalls return -ECHILD even when an individual PID of a live child
was requested explicitly, when security_task_wait denies the operation.
This means that something like a broken SELinux policy can produce an
unexpected failure that looks just like a bug with wait or ptrace or
something.

This patch makes do_wait return -EACCES (or other appropriate
error returned from security_task_wait() instead of -ECHILD if some
children were ruled out solely because security_task_wait failed.

Signed-off-by: James Morris [EMAIL PROTECTED]
---

Updated version, returns value from security_task_wait().


 kernel/exit.c |   17 +++--
 1 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index b55ed4c..9236924 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1033,6 +1033,8 @@ asmlinkage void sys_exit_group(int error_code)
 
 static int eligible_child(pid_t pid, int options, struct task_struct *p)
 {
+   int err;
+
if (pid  0) {
if (p-pid != pid)
return 0;
@@ -1066,8 +1068,9 @@ static int eligible_child(pid_t pid, int options, struct 
task_struct *p)
if (delay_group_leader(p))
return 2;
 
-   if (security_task_wait(p))
-   return 0;
+   err = security_task_wait(p);
+   if (err)
+   return err;
 
return 1;
 }
@@ -1449,6 +1452,7 @@ static long do_wait(pid_t pid, int options, struct 
siginfo __user *infop,
DECLARE_WAITQUEUE(wait, current);
struct task_struct *tsk;
int flag, retval;
+   int allowed, denied;
 
add_wait_queue(current-signal-wait_chldexit,wait);
 repeat:
@@ -1457,6 +1461,7 @@ repeat:
 * match our criteria, even if we are not able to reap it yet.
 */
flag = 0;
+   allowed = denied = 0;
current-state = TASK_INTERRUPTIBLE;
read_lock(tasklist_lock);
tsk = current;
@@ -1472,6 +1477,12 @@ repeat:
if (!ret)
continue;
 
+   if (unlikely(ret  0)) {
+   denied = ret;
+   continue;
+   }
+   allowed = 1;
+
switch (p-state) {
case TASK_TRACED:
/*
@@ -1570,6 +1581,8 @@ check_continued:
goto repeat;
}
retval = -ECHILD;
+   if (unlikely(denied)  !allowed)
+   retval = denied;
 end:
current-state = TASK_RUNNING;
remove_wait_queue(current-signal-wait_chldexit,wait);
-- 
1.5.0.6

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 16/25] xen: Use the hvc console infrastructure for Xen console

2007-04-23 Thread Olof Johansson
On Mon, Apr 23, 2007 at 02:56:54PM -0700, Jeremy Fitzhardinge wrote:
 Implement a Xen back-end for hvc console.
 
 From: Gerd Hoffmann [EMAIL PROTECTED]
 Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
 
 ---
  arch/i386/xen/Kconfig |1 
  arch/i386/xen/events.c|3 -
  drivers/Makefile  |3 +
  drivers/xen/Makefile  |1 
  drivers/xen/hvc-console.c |  134 
 +
  include/xen/events.h  |1 
  6 files changed, 142 insertions(+), 1 deletion(-)

If you move the driver to drivers/char/hvc_xen.c instead, you won't have to 
do...

 +#include ../char/hvc_console.h

...this.

Other single-platform backend hvc drivers are under drivers/char already.


-Olof
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/25] xen: Add nosegneg capability to the vsyscall page notes

2007-04-23 Thread Jeremy Fitzhardinge
Roland McGrath wrote:
 + * It should contain:
 + *  hwcap 0 nosegneg
 + * to match the mapping of bit to name that we give here.
 

 This needs to be hwcap 0 nosegneg to match:

   
 +NOTE_KERNELCAP_BEGIN(1, 2)
 +NOTE_KERNELCAP(1, nosegneg)
 +NOTE_KERNELCAP_END
 

 The actual bits you are using should be fine.  (You're intentionally
 skipping bit 0 to work around hold glibc bugs, which you might want to add
 to the comments.  Also a comment or perhaps using 11 syntax would make it
 more clear that 2 is the bit mask containing bit 1 and that's why it has
 to be 2, and not because of some other magical property of 2.)  But if
 kernel packagers don't write the matching bit number in their ld.so.conf.d
 files, then ld.so.cache lookups won't work right.

I have to admit I still don't really understand all this.  Is it
documented somewhere?

What does hwcap 0 nosegneg actually mean?  What does the 0 mean here?

In the ELF note, what does the nosegneg string mean?  How is it used? 
Is it compared to the nosegneg in ld.so.conf?  How does this relate to
the bitfields?

Thanks,
J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Update the list information for kexec and kdump

2007-04-23 Thread Eric W. Biederman
Simon Horman [EMAIL PROTECTED] writes:

 On Mon, Apr 23, 2007 at 12:04:01PM -0600, Eric W. Biederman wrote:
 Simon Horman [EMAIL PROTECTED] writes:
 
  Update the list information for kexec and kdump
 
  Signed-off-by: Simon Horman [EMAIL PROTECTED]
 
  --- 
  Is it too early for this change?
 
 It looks like the new list is working, and isn't likely to get overwhelmed
 with spam.  I don't know if everyone has switched over yet but we can
 certainly update MAINTAINERS. 

 Last time I checked there were 28 people in the kexec@ list.
 This isn't everyone, but it is getting there.

 May I add an Acked-by you ?

Sure.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ChunkFS - measuring cross-chunk references

2007-04-23 Thread Amit Gud

On Mon, 23 Apr 2007, Amit Gud wrote:


On Mon, 23 Apr 2007, Arjan van de Ven wrote:



  The other thing which we should consider is that chunkfs really
  requires a 64-bit inode number space, which means either we only allow

 does it?
 I'd think it needs a chunk space number and a 32 bit local inode
 number ;) (same for blocks)



For inodes, yes, either 64-bit inode or some field for the chunk id in which 
the inode is. But for block numbers, you don't. Because individual chunks 
manage part of the whole file system in an independent way. They have their 
block bitmaps starting at an offset. Inode bitmaps, however, remains same.




In that sense, we also can do away without having chunk identifier encoded 
into inode number and chunkfs would still be fine with it. But we will 
then loose inode uniqueness property, which could well be OK as it is with 
other file systems in which inode number is not sufficient for unique 
identification of an inode.



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] powerpc pseries eeh: Convert to kthread API

2007-04-23 Thread Benjamin Herrenschmidt

 The only reason for using threads here is to get the error recovery
 out of an interrupt context (where errors may be detected), and then,
 an hour later, decrement a counter (which is how we limit these to 
 6 per hour). Thread reaping is trivial, the thread just exits
 after an hour.

In addition, it should be a thread and not done from within keventd
because :

 - It can take a long time (well, relatively but still too long for a
work queue)

 - The driver callbacks might need to use keventd or do flush_workqueue
to synchronize with their own workqueues when doing an internal
recovery.

 Since these are events rare, I've no particular concern about
 performance or resource consumption. The current code seems 
 to work just fine. :-)

I think moving to kthread's is cleaner (just a wrapper around kernel
threads that simplify dealing with reaping them out mostly) and I agree
with Christoph that it would be nice to be able to fire off kthreads
from interrupt context.. in many cases, we abuse work queues for things
that should really done from kthreads instead (basically anything that
takes more than a couple hundred microsecs or so).

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 22/25] xen: xen-netfront: use skb.cb for storing private data

2007-04-23 Thread Herbert Xu
On Mon, Apr 23, 2007 at 02:57:00PM -0700, Jeremy Fitzhardinge wrote:
 Netfront's use of nh.raw and h.raw for storing page+offset is a bit
 hinky, and it breaks with upcoming network stack updates which reduce
 these fields to sub-pointer sizes.  Fortunately, skb offers the cb
 field specifically for stashing this kind of info, so use it.
 
 Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
 Cc: Herbert Xu [EMAIL PROTECTED]
 Cc: Chris Wright [EMAIL PROTECTED]
 Cc: Christian Limpach [EMAIL PROTECTED]

Thanks Jeremy.  The patch looks good.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


PROBLEM: Oops: 0002 [1] SMP

2007-04-23 Thread Thiago M.

[1] Summary:

Kernel Reports Oops: 0002 [1] SMP and the system becomes unstable

[2] Full Description:

Sometimes, randomly i get this Oops message and the system becomes
unstable. By unstable i mean all applications segmentation faults when i
execute (after the Oops). Sometimes X crashes, sometimes the machine
just reboots (the reboot might be other problem tho).

This happens with kernel 2.6.20 and with 2.6.21-rc7. Was happening with
2.6.20 so i tried 2.6.21-rc7 and this also happens.

[EMAIL PROTECTED]:/var/log$ uname -a
Linux sayao-desktop 2.6.21-rc7-sayao #2 SMP Mon Apr 16 22:11:36 BRT 2007
x86_64 GNU/Linux

Here is the log:

Apr 22 21:44:33 sayao-desktop kernel: [18641.553890] Unable to handle
kernel paging request at 3e82 RIP:
Apr 22 21:44:33 sayao-desktop kernel: [18641.553899]  [__alloc_skb
+188/321] __alloc_skb+0xbc/0x141
Apr 22 21:44:33 sayao-desktop kernel: [18641.553911] PGD 203027 PUD 0
Apr 22 21:44:33 sayao-desktop kernel: [18641.553915] Oops: 0002 [1] SMP
Apr 22 21:44:33 sayao-desktop kernel: [18641.553919] CPU 0
Apr 22 21:44:33 sayao-desktop kernel: [18641.553922] Modules linked in:
binfmt_misc rfcomm l2cap bluetooth i915 drm ppdev capability commoncap
acpi_cpufreq cpufreq_userspace cpufreq_stats cpufreq_conservative
cpufreq_ondemand cpufreq_powersave freq_table asus_acpi container sbs
i2c_ec i2c_core battery video dock ac button ipv6 lp fuse snd_hda_intel
snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy
snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq psmouse
snd_timer snd_seq_device snd parport_pc parport shpchp serio_raw pcspkr
soundcore snd_page_alloc iTCO_wdt iTCO_vendor_support pci_hotplug
intel_agp af_packet evdev tsdev ext3 jbd mbcache sg ide_cd cdrom sd_mod
ata_generic usbhid hid ata_piix libata scsi_mod e100 mii ehci_hcd
generic piix uhci_hcd usbcore thermal processor fan
Apr 22 21:44:33 sayao-desktop kernel: [18641.553997] Pid: 13805, comm:
evolution Not tainted 2.6.21-rc7-sayao #2
Apr 22 21:44:33 sayao-desktop kernel: [18641.554001] RIP:
0010:[__alloc_skb+188/321]  [__alloc_skb+188/321] __alloc_skb+0xbc/0x141
Apr 22 21:44:33 sayao-desktop kernel: [18641.554007] RSP:
0018:810033169bd8  EFLAGS: 00010246
Apr 22 21:44:33 sayao-desktop kernel: [18641.554011] RAX:
3e82 RBX: 0002 RCX: 
Apr 22 21:44:33 sayao-desktop kernel: [18641.554014] RDX:
 RSI:  RDI: 81003b1bfa50
Apr 22 21:44:33 sayao-desktop kernel: [18641.554017] RBP:
3e80 R08: 0002 R09: 
Apr 22 21:44:33 sayao-desktop kernel: [18641.554021] R10:
81003b1bf980 R11: 00d0 R12: 81003b1bf980
Apr 22 21:44:33 sayao-desktop kernel: [18641.554024] R13:
81003f2109c0 R14: 04d0 R15: 3e80
Apr 22 21:44:33 sayao-desktop kernel: [18641.554028] FS:
2ad334669ea0() GS:8052f000() knlGS:
Apr 22 21:44:33 sayao-desktop kernel: [18641.554032] CS:  0010 DS: 
ES:  CR0: 80050033
Apr 22 21:44:33 sayao-desktop kernel: [18641.554035] CR2:
3e82 CR3: 27bb3000 CR4: 06e0
Apr 22 21:44:33 sayao-desktop kernel: [18641.554039] Process evolution
(pid: 13805, threadinfo 810033168000, task 810009665000)
Apr 22 21:44:33 sayao-desktop kernel: [18641.554042] Stack:
09665000 81002f9f5080 3e80 
Apr 22 21:44:33 sayao-desktop kernel: [18641.554049]  04d0
810033169ce4 3e80 803a6d82
Apr 22 21:44:33 sayao-desktop kernel: [18641.554055]  
0206 80507110 81dadc50
Apr 22 21:44:33 sayao-desktop kernel: [18641.554061] Call Trace:
Apr 22 21:44:33 sayao-desktop kernel: [18641.554086]
[sock_alloc_send_skb+130/478] sock_alloc_send_skb+0x82/0x1de
Apr 22 21:44:33 sayao-desktop kernel: [18641.554126]
[unix_stream_sendmsg+392/880] unix_stream_sendmsg+0x188/0x370
Apr 22 21:44:33 sayao-desktop kernel: [18641.554181]  [sock_aio_write
+293/313] sock_aio_write+0x125/0x139
Apr 22 21:44:33 sayao-desktop kernel: [18641.554247]  [do_sync_write
+207/277] do_sync_write+0xcf/0x115
Apr 22 21:44:33 sayao-desktop kernel: [18641.554287]
[autoremove_wake_function+0/48] autoremove_wake_function+0x0/0x30
Apr 22 21:44:33 sayao-desktop kernel: [18641.554352]  [vfs_write
+228/348] vfs_write+0xe4/0x15c
Apr 22 21:44:33 sayao-desktop kernel: [18641.554369]  [sys_write+69/121]
sys_write+0x45/0x79
Apr 22 21:44:33 sayao-desktop kernel: [18641.554393]  [system_call
+126/131] system_call+0x7e/0x83
Apr 22 21:44:33 sayao-desktop kernel: [18641.554434] 
Apr 22 21:44:33 sayao-desktop kernel: [18641.554436] 
Apr 22 21:44:33 sayao-desktop kernel: [18641.554437] Code: c7 00 01 00
00 00 66 c7 40 04 00 00 66 c7 40 06 00 00 66 c7 
Apr 22 21:44:33 sayao-desktop kernel: [18641.554453] RIP  [__alloc_skb
+188/321] __alloc_skb+0xbc/0x141
Apr 22 21:44:33 sayao-desktop kernel: [18641.554458]  RSP
810033169bd8
Apr 22 

Re: [report] renicing X, cfs-v5 vs sd-0.46

2007-04-23 Thread Gene Heskett
On Monday 23 April 2007, Niel Lambrechts wrote:
Gene Heskett wrote:
 This message prompted me to do some checking in re context switches
 myself, and I've come to the conclusion that there could be a bug in
 vmstat itself.

Perhaps. perhaps not. :)

 Run singly the context switching is reasonable even for a -19 niceness of
 x, its only showing about 200 or so on the first loop of vmstat.  But
 throw in the -n 1 arguments and it goes crazy on the second and subsequent
 loops.

man vmstat:
The first report produced gives averages since the last reboot.
Additional reports  give information on a sampling period of length delay.

I missed that, concentrating on finding the method of telling it the delay I 
guess.

So then the next question is, over what period is that obviously lower figure 
being averaged over?  Certainly not over a 1 second period else it would then 
be much higher, as seen by the figures after the initial delay.  The time 
slice spec'd in /proc/sys/kernel/sched_granularity_ns, which here is 
currently 500 or 5 milliseconds?  If that was the case, the first answer 
would be in the area of 15, not 200.

So educate me, off list if you would like and have the time.

Thanks Niel.

-- 
Cheers, Gene
There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order.
-Ed Howdershelt (Author)
Sweet sixteen is beautiful Bess,
And her voice is changing -- from No to Yes.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: PageLRU can be non-atomic bit operation

2007-04-23 Thread Hisashi Hifumi


At 22:42 07/04/23, Hugh Dickins wrote:
On Mon, 23 Apr 2007, Hisashi Hifumi wrote:
 No.  The PG_lru flag bit is just one bit amongst many others:
 what of concurrent operations changing other bits in that same
 unsigned long e.g. trying to lock the page by setting PG_locked?
 There are some places where such micro-optimizations can be made
 (typically while first allocating the page); but in general, no.

 In i386 and x86_64, btsl is used to change page flag. In this case, if btsl
 without lock prefix
 set PG_locked and PG_lru flag concurrently, does only one operation
 succeed ?

That's right: on an SMP machine, without the lock prefix, the operation
is no longer atomic: what's stored back may be missing the result of
one or the other of the racing operations.


In the case that changing the same bit concurrently, lock prefix or other
spinlock is needed. But, I think that concurrent bit operation on different 
bits

is just like OR operation , so lock prefix is not needed.

AMD instruction manual says about bts that ,

Copies a bit, specified by bit index in a register or 8-bit immediate 
value (second operand), from a bit
string (first operand), also called the bit base, to the carry flag (CF) of 
the rFLAGS register, and then

sets the bit in the bit string to 1.

BTS instruction is read-modify-write instruction on bit unit. So concurrent 
bit operation on different

bits may be possible.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Rik van Riel

This should fix the MADV_FREE code for PPC's hashed tlb.

Signed-off-by: Rik van Riel [EMAIL PROTECTED]
---

Nick Piggin wrote:

Nick Piggin wrote:


3) because of this, we can treat any such accesses as
   happening simultaneously with the MADV_FREE and
   as illegal, aka undefined behaviour territory and
   we do not need to worry about them



Yes, but I'm wondering if it is legal in all architectures.



It's similar to trying to access memory during an munmap.

You may be able to for a short time, but it'll come back to
haunt you.


The question is whether the architecture specific tlb
flushing code will break or not.


I guess we'll need to call tlb_remove_tlb_entry() inside the
MADV_FREE code to keep powerpc happy.

Thanks for pointing this one out.


Even then we do.  Each invocation of zap_pte_range() only touches
one page table page, and it flushes the TLB before releasing the
page table lock.


What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS.


Oh dear.  I see it now...

The tlb end things inside zap_pte_range() are actually
noops and the actual tlb flush only happens inside
zap_page_range().

I guess the fact that munmap gets the mmap_sem for
writing should save us, though...

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
--- linux-2.6.20.x86_64/mm/memory.c.noppc	2007-04-23 21:50:09.0 -0400
+++ linux-2.6.20.x86_64/mm/memory.c	2007-04-23 21:48:59.0 -0400
@@ -679,6 +679,7 @@ static unsigned long zap_pte_range(struc
 	}
 	ptep_test_and_clear_dirty(vma, addr, pte);
 	ptep_test_and_clear_young(vma, addr, pte);
+	tlb_remove_tlb_entry(tlb, pte, addr);
 	SetPageLazyFree(page);
 	if (PageActive(page))
 		deactivate_tail_page(page);


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread hui
On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote:
 I don't know if we've discussed this or not. Since both CFS and SD claim
 to be fair, I'd like to hear more opinions on the fairness aspect of
 these designs. In areas such as OS, networking, and real-time, fairness,
 and its more general form, proportional fairness, are well-defined
 terms. In fact, perfect fairness is not feasible since it requires all
 runnable threads to be running simultaneously and scheduled with
 infinitesimally small quanta (like a fluid system). So to evaluate if a

Unfortunately, fairness is rather non-formal in this context and probably
isn't strictly desirable given how hack much of Linux userspace is. Until
there's a method of doing directed yields, like what Will has prescribed
a kind of allotment to thread doing work for another a completely strict
mechanism, it is probably problematic with regards to corner cases.

X for example is largely non-thread safe. Until they can get their xcb
framework in place and addition thread infrastructure to do hand off
properly, it's going to be difficult schedule for it. It's well known to
be problematic.

You announced your scheduler without CCing any of the relevant people here
(and risk being completely ignored in lkml traffic):

http://lkml.org/lkml/2007/4/20/286

What is your opinion of both CFS and SDL ? How can you work be useful
to either scheduler mentioned or to the Linux kernel on its own ?

 I understand that via experiments we can show a design is reasonably
 fair in the common case, but IMHO, to claim that a design is fair, there
 needs to be some kind of formal analysis on the fairness bound, and this
 bound should be proven to be constant. Even if the bound is not
 constant, at least this analysis can help us better understand and
 predict the degree of fairness that users would experience (e.g., would
 the system be less fair if the number of threads increases? What happens
 if a large number of threads dynamically join and leave the system?).

Will has been thinking about this, but you have to also consider the
practicalities of your approach versus Con's and Ingo's.

I'm all for things like proportional scheduling and the extensions
needed to do it properly. It would be highly relevant to some version
of the -rt patch if not that patch directly.

bill

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: AppArmor FAQ

2007-04-23 Thread Joshua Brindle

Crispin Cowan wrote:

David Wagner wrote:
  

James Morris  wrote:
  

[...] you can change the behavior of the application and then bypass 
policy entirely by utilizing any mechanism other than direct filesystem 
access: IPC, shared memory, Unix domain sockets, local IP networking, 
remote networking etc.

  

[...]
  


Just look at their code and their own description of AppArmor.

  

My gosh, you're right.  What the heck?  With all due respect to the
developers of AppArmor, I can't help thinking that that's pretty lame.
I think this raises substantial questions about the value of AppArmor.
What is the point of having a jail if it leaves gaping holes that
malicious code could use to escape?

And why isn't this documented clearly, with the implications fully
explained?

I would like to hear the AppArmor developers defend this design decision.
  


It was a simplicity trade off at the time, when AppArmor was mostly
aimed at servers, and there was no HAL or DBUS. Now it is definitely a
limitation that we are addressing. We are working on a mediation system
for what kind of IPC a confined process can do
http://forge.novell.com/pipermail/apparmor-dev/2007-April/000503.html

  
Except servers use IPC and need this access control as well. Without IPC 
and network restrictions you can't protect database servers, ldap 
servers, print servers, ssh agents, virus scanning servers, spam 
scanning servers, etc from attackers with knowledge of how to abuse the IPC.

When our IPC mediation system is code instead of vapor, it will also
appear here for review. Meanwhile, AppArmor does not make IPC security
any worse, confined processes are still subject to the usual Linux IPC
restrictions. AppArmor actually makes the IPC situation somewhat more
secure than stock Linux, e.g. normal DBUS deployment can be controlled
through file access permissions. But we are not claiming AppArmor to be
an IPC security enhancement, yet.
  
Without a security interface in DBUS similar to SELinux' apparmor won't 
be able to control who can talk to who across DBUS, only who can connect 
to DBUS directly.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] powerpc pseries eeh: Convert to kthread API

2007-04-23 Thread Eric W. Biederman
Benjamin Herrenschmidt [EMAIL PROTECTED] writes:

 The only reason for using threads here is to get the error recovery
 out of an interrupt context (where errors may be detected), and then,
 an hour later, decrement a counter (which is how we limit these to 
 6 per hour). Thread reaping is trivial, the thread just exits
 after an hour.

 In addition, it should be a thread and not done from within keventd
 because :

  - It can take a long time (well, relatively but still too long for a
 work queue)

  - The driver callbacks might need to use keventd or do flush_workqueue
 to synchronize with their own workqueues when doing an internal
 recovery.

 Since these are events rare, I've no particular concern about
 performance or resource consumption. The current code seems 
 to work just fine. :-)

 I think moving to kthread's is cleaner (just a wrapper around kernel
 threads that simplify dealing with reaping them out mostly) and I agree
 with Christoph that it would be nice to be able to fire off kthreads
 from interrupt context.. in many cases, we abuse work queues for things
 that should really done from kthreads instead (basically anything that
 takes more than a couple hundred microsecs or so).

On that note does anyone have a problem is we manage the irq spawning
safe kthreads the same way that we manage the work queue entries.

i.e. by a structure allocated by the caller?

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Nick Piggin

Rik van Riel wrote:

This should fix the MADV_FREE code for PPC's hashed tlb.

Signed-off-by: Rik van Riel [EMAIL PROTECTED]
---

Nick Piggin wrote:


Nick Piggin wrote:


3) because of this, we can treat any such accesses as
   happening simultaneously with the MADV_FREE and
   as illegal, aka undefined behaviour territory and
   we do not need to worry about them




Yes, but I'm wondering if it is legal in all architectures.




It's similar to trying to access memory during an munmap.

You may be able to for a short time, but it'll come back to
haunt you.



The question is whether the architecture specific tlb
flushing code will break or not.



I guess we'll need to call tlb_remove_tlb_entry() inside the
MADV_FREE code to keep powerpc happy.

Thanks for pointing this one out.


Even then we do.  Each invocation of zap_pte_range() only touches
one page table page, and it flushes the TLB before releasing the
page table lock.



What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS.



Oh dear.  I see it now...

The tlb end things inside zap_pte_range() are actually
noops and the actual tlb flush only happens inside
zap_page_range().

I guess the fact that munmap gets the mmap_sem for
writing should save us, though...


What about an unmap_mapping_range, or another MADV_FREE or
MADV_DONTNEED?






--- linux-2.6.20.x86_64/mm/memory.c.noppc   2007-04-23 21:50:09.0 
-0400
+++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 21:48:59.0 -0400
@@ -679,6 +679,7 @@ static unsigned long zap_pte_range(struc
}
ptep_test_and_clear_dirty(vma, addr, 
pte);
ptep_test_and_clear_young(vma, addr, 
pte);
+   tlb_remove_tlb_entry(tlb, pte, addr);
SetPageLazyFree(page);
if (PageActive(page))
deactivate_tail_page(page);



--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Remove open coded implementations of memclear_highpage flush

2007-04-23 Thread Satyam Sharma

On 4/24/07, Christoph Lameter [EMAIL PROTECTED] wrote:

There are a series of open coded reimplementation of memclear_highpage_flush
all over the page cache code. Call memclear_highpage_flush in those locations.
Consolidates code and eases maintenance.


If I remember right, a very similar patchset was recently submitted
that Andrew merged in -mm(?). It also renamed memclear_highpage_flush
to something like zero_user_page (though I wonder how good a name that
is considering it takes an offset and not the whole page) and
deprecated the old name.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Remove open coded implementations of memclear_highpage flush

2007-04-23 Thread Andrew Morton
On Tue, 24 Apr 2007 07:49:45 +0530 Satyam Sharma [EMAIL PROTECTED] wrote:

 On 4/24/07, Christoph Lameter [EMAIL PROTECTED] wrote:
  There are a series of open coded reimplementation of memclear_highpage_flush
  all over the page cache code. Call memclear_highpage_flush in those 
  locations.
  Consolidates code and eases maintenance.
 
 If I remember right, a very similar patchset was recently submitted
 that Andrew merged in -mm(?).

yup.

 It also renamed memclear_highpage_flush
 to something like zero_user_page (though I wonder how good a name that
 is considering it takes an offset and not the whole page)

It's not a great name, but the fact that you must provide it with `offset'
and `length' arguments rather clears up any confusion ;)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Remove open coded implementations of memclear_highpage flush

2007-04-23 Thread Christoph Lameter
On Tue, 24 Apr 2007, Satyam Sharma wrote:

 On 4/24/07, Christoph Lameter [EMAIL PROTECTED] wrote:
  There are a series of open coded reimplementation of memclear_highpage_flush
  all over the page cache code. Call memclear_highpage_flush in those
  locations.
  Consolidates code and eases maintenance.
 
 If I remember right, a very similar patchset was recently submitted
 that Andrew merged in -mm(?). It also renamed memclear_highpage_flush
 to something like zero_user_page (though I wonder how good a name that
 is considering it takes an offset and not the whole page) and
 deprecated the old name.

My latest tree from Andrew does not have any of this. URL of patch?
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: PageLRU can be non-atomic bit operation

2007-04-23 Thread KAMEZAWA Hiroyuki
On Tue, 24 Apr 2007 10:54:27 +0900
Hisashi Hifumi [EMAIL PROTECTED] wrote:
 In the case that changing the same bit concurrently, lock prefix or other
 spinlock is needed. But, I think that concurrent bit operation on different 
 bits
 is just like OR operation , so lock prefix is not needed.
 
 AMD instruction manual says about bts that ,
 
 Copies a bit, specified by bit index in a register or 8-bit immediate 
 value (second operand), from a bit
 string (first operand), also called the bit base, to the carry flag (CF) of 
 the rFLAGS register, and then
 sets the bit in the bit string to 1.
 
 BTS instruction is read-modify-write instruction on bit unit. So concurrent 
 bit operation on different
 bits may be possible.
 
This is ia64's __set_bit() hehe..
==
static __inline__ void
__set_bit (int nr, volatile void *addr)
{
*((__u32 *) addr + (nr  5)) |= (1  (nr  31));
}
==

Bye.
-Kame

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE][PATCH] Kcli - Kernel command line interface.

2007-04-23 Thread Satyam Sharma

Hi Matt,

On 4/24/07, Matt Ranon [EMAIL PROTECTED] wrote:

 The obvious question is: what's _wrong_ with doing all this in some
 cut-down userspace environment like busybox?  Why is this stuff better?

 Obviously some embedded developers have considered that option and
 have rejected it.  But we do need to be told, at length, why that
 decision was made.

There is nothing _wrong_ with doing it all in a cut-down userspace. It
is a matter of personal preference, culture, and the application. That
is what makes Linux so great, it is all about choice.

We are developing devices that don't have a user space, and we don't
see the point in including one just for debug purposes. We will not be
offended if Kcli is not included into the kernel mainline, nor if Kcli compels
people to call us stupid (as it already has) just because we are different
and some people don't understand us. We are firm believers that the
world, including the Linux kernel world,  would be a nasty place if there
was only _one_ way to do any given task. Additionally, we  are almost
certain that there will be others who think like we do, so we are reaching
out to them. We also feel compelled to give _something_ back to the
community that has given so much to us, and, for now, this is all we have.


I'm afraid you might've misunderstood the (rather caustic, sometimes)
general nature of comments on lkml :-) But I guess you only have
everything to gain if you use features that have been developed (and
are being *maintained* in the current kernel) that already do the kind
of stuff you want done.

You might have your reasons for being so anxious to avoid any
userspace at all, but quoting famous words, continuing to maintain
Kcli out-of-tree could soon turn out to be an act for
self-flagellation for you :-)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] powerpc pseries eeh: Convert to kthread API

2007-04-23 Thread Benjamin Herrenschmidt
On Mon, 2007-04-23 at 20:08 -0600, Eric W. Biederman wrote:
 Benjamin Herrenschmidt [EMAIL PROTECTED] writes:
 
  The only reason for using threads here is to get the error recovery
  out of an interrupt context (where errors may be detected), and then,
  an hour later, decrement a counter (which is how we limit these to 
  6 per hour). Thread reaping is trivial, the thread just exits
  after an hour.
 
  In addition, it should be a thread and not done from within keventd
  because :
 
   - It can take a long time (well, relatively but still too long for a
  work queue)
 
   - The driver callbacks might need to use keventd or do flush_workqueue
  to synchronize with their own workqueues when doing an internal
  recovery.
 
  Since these are events rare, I've no particular concern about
  performance or resource consumption. The current code seems 
  to work just fine. :-)
 
  I think moving to kthread's is cleaner (just a wrapper around kernel
  threads that simplify dealing with reaping them out mostly) and I agree
  with Christoph that it would be nice to be able to fire off kthreads
  from interrupt context.. in many cases, we abuse work queues for things
  that should really done from kthreads instead (basically anything that
  takes more than a couple hundred microsecs or so).
 
 On that note does anyone have a problem is we manage the irq spawning
 safe kthreads the same way that we manage the work queue entries.
 
 i.e. by a structure allocated by the caller?

Not sure... I can see places where I might want to spawn an arbitrary
number of these without having to preallocate structures... and if I
allocate on the fly, then I need a way to free that structure when the
kthread is reaped which I don't think we have currently, do we ? (In
fact, I could use that for other things too now that I'm thinking of
it ... I might have a go at providing optional kthread destructors).

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] change kernel threads to ignore signals instead of blocking them

2007-04-23 Thread Andrew Morton
On Fri, 13 Apr 2007 11:31:16 +0400 Oleg Nesterov [EMAIL PROTECTED] wrote:

 On top of Eric's
 
   kthread-dont-depend-on-work-queues-take-2.patch
 
 Currently kernel threads use sigprocmask(SIG_BLOCK) to protect against 
 signals.
 This doesn't prevent the signal delivery, this only blocks signal_wake_up().
 Every killall -33 kthreadd means a struct siginfo leak.
 
 Change kthreadd_setup() to set all handlers to SIG_IGN instead of blocking 
 them
 (make a new helper ignore_signals() for that). If the kernel thread needs some
 signal, it should use allow_signal() anyway, and in that case it should not 
 use
 CLONE_SIGHAND.
 
 Note that we can't change daemonize() (should die!) in the same way, because
 it can be used along with CLONE_SIGHAND. This means that allow_signal() still
 should unblock the signal to work correctly with daemonize()ed threads.
 
 However, disallow_signal() doesn't block the signal any longer but ignores it.
 
 NOTE: with or without this patch the kernel threads are not protected from
 handle_stop_signal(), this seems harmless, but not good.

I'm seeing 500 zombied instances of khelper (from udev startup).  It only
happens when the utrace patches are applied.  Presumably an interaction
between utrace and one of these kthread changes.

I'll drop utrace for now.  I don't think it's getting much help from being
in -mm at present and it's getting increasingly painful to keep it merged
against all the other stuff which is happening.

Roland, I'll squirt all the extra utrace patches which I have in your direction.
Please merge them or hang on to them for later on.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: PageLRU can be non-atomic bit operation

2007-04-23 Thread Nick Piggin

Hisashi Hifumi wrote:


At 22:42 07/04/23, Hugh Dickins wrote:
 On Mon, 23 Apr 2007, Hisashi Hifumi wrote:
  No.  The PG_lru flag bit is just one bit amongst many others:
  what of concurrent operations changing other bits in that same
  unsigned long e.g. trying to lock the page by setting PG_locked?
  There are some places where such micro-optimizations can be made
  (typically while first allocating the page); but in general, no.
 
  In i386 and x86_64, btsl is used to change page flag. In this case, 
if btsl

  without lock prefix
  set PG_locked and PG_lru flag concurrently, does only one operation
  succeed ?
 
 That's right: on an SMP machine, without the lock prefix, the operation
 is no longer atomic: what's stored back may be missing the result of
 one or the other of the racing operations.
 

In the case that changing the same bit concurrently, lock prefix or other
spinlock is needed. But, I think that concurrent bit operation on 
different bits

is just like OR operation , so lock prefix is not needed.

AMD instruction manual says about bts that ,

Copies a bit, specified by bit index in a register or 8-bit immediate 
value (second operand), from a bit
string (first operand), also called the bit base, to the carry flag (CF) 
of the rFLAGS register, and then

sets the bit in the bit string to 1.

BTS instruction is read-modify-write instruction on bit unit. So 
concurrent bit operation on different

bits may be possible.



No matter what actual instruction is used, the SetPageLRU operation (ie.
without the double underscore prefix) must be atomic, and the __SetPageLRU
operation *can* be non-atomic if that would be faster.

As Hugh points out, we must have atomic ops here, so changing the generic
code to use the __ version is wrong. However if there is a faster way that
i386 can perform the atomic variant, then doing so will speed up the generic
code without breaking other architectures.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Rik van Riel

Nick Piggin wrote:


What the tlb flush used to be able to assume is that the page
has been removed from the pagetables when they are put in the
tlb flush batch.


I think this is still the case, to a degree.  There should be
no harm in removing the TLB entries after the page table has
been unlocked, right?

Or is something like the attached really needed?

From what I can see, the page table lock should be enough
synchronization between unmap_mapping_range, MADV_FREE and
MADV_DONTNEED.

I don't see why we need the attached, but in case you find
a good reason, here's my signed-off-by line for Andrew :)

Signed-off-by: Rik van Riel [EMAIL PROTECTED]

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
--- linux-2.6.20.x86_64/mm/memory.c.flushme	2007-04-23 22:26:06.0 -0400
+++ linux-2.6.20.x86_64/mm/memory.c	2007-04-23 22:42:06.0 -0400
@@ -628,6 +628,7 @@ static unsigned long zap_pte_range(struc
 long *zap_work, struct zap_details *details)
 {
 	struct mm_struct *mm = tlb-mm;
+	unsigned long start_addr = addr;
 	pte_t *pte;
 	spinlock_t *ptl;
 	int file_rss = 0;
@@ -726,6 +727,11 @@ static unsigned long zap_pte_range(struc
 
 	add_mm_rss(mm, file_rss, anon_rss);
 	arch_leave_lazy_mmu_mode();
+	if (details  details-madv_free) {
+		/* Protect against MADV_DONTNEED or unmap_mapping_range */
+		tlb_finish_mmu(tlb, start_addr, addr);
+		tlb = tlb_gather_mmu(mm, 0);
+	}
 	pte_unmap_unlock(pte - 1, ptl);
 
 	return addr;


Re: Remove open coded implementations of memclear_highpage flush

2007-04-23 Thread Satyam Sharma

On 4/24/07, Christoph Lameter [EMAIL PROTECTED] wrote:

On Tue, 24 Apr 2007, Satyam Sharma wrote:
 If I remember right, a very similar patchset was recently submitted
 that Andrew merged in -mm(?). It also renamed memclear_highpage_flush
 to something like zero_user_page (though I wonder how good a name that
 is considering it takes an offset and not the whole page) and
 deprecated the old name.

My latest tree from Andrew does not have any of this. URL of patch?


fs-deprecate-memclear_highpage_flush.patch (and friends, search for
zero_user_page) in
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/broken-out-2007-04-11-02-24.tar.gz
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [mmc] alternative TI FM MMC/SD driver for 2.6.21-rc7

2007-04-23 Thread Alex Dubov
 
 I am not in any way argue that your driver architecture is wrong or that you
 should change anything. My point was simple. [tifm_sd] can only work with
 [tifm_7xx1]. If you add support for let's say [tifm_8xx2] in the future, which
 would have port offsets different that [tifm_7xx1], you would also need a
 completely new modules for slots (sd, ms, etc).
 

Does not this constitutes an unbounded speculation? And then, what would you 
propose to do with
adapters that have SD support disabled? There are quite a few of those in the 
wild, as of right
now (SD support is provided by bundled SDHCI on such systems, if at all). 
Similar argument goes
for other media types as well - many controllers have xD support disabled too 
(I think you have
one of those - Sony really values its customers). After all, it is not healthy 
to have dead code
in the kernel.

On the other hand, if TI puts out a controller which is functionally identical, 
but has different
register map, it wouldn't be hard to refactor the code. 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-23 Thread Neil Brown
On Friday April 20, [EMAIL PROTECTED] wrote:
 Scale writeback cache per backing device, proportional to its writeout speed.

So it works like this:

 We account for writeout in full pages.
 When a page has the Writeback flag cleared, we account that as a
 successfully retired write for the relevant bdi.
 By using floating averages we keep track of how many writes each bdi
 has retired 'recently' where the unit of time in which we understand
 'recently' is a single page written.

 We keep a floating average for each bdi, and a floating average for
 the total writeouts (that 'average' is, of course, 1.)

 Using these numbers we can calculate what faction of 'recently'
 retired writes were retired by each bdi (get_writeout_scale).

 Multiplying this fraction by the system-wide number of pages that are
 allowed to be dirty before write-throttling, we get the number of
 pages that the bdi can have dirty before write-throttling the bdi.

 I note that the same fraction is *not* applied to background_thresh.
 Should it be?  I guess not - there would be interesting starting
 transients, as a bdi which had done no writeout would not be allowed
 any dirty pages, so background writeout would start immediately,
 which isn't what you want... or is it?

 For each bdi we also track the number of (dirty, writeback, unstable)
 pages and do not allow this to exceed the limit set for this bdi.

 The calculations involving 'reserve' in get_dirty_limits are a little
 confusing.  It looks like you calculating how much total head-room
 there is for the bdi (pages that the system can still dirty - pages
 this bdi has dirty) and making sure the number returned in pbdi_dirty
 doesn't allow more than that to be used.  This is probably a
 reasonable thing to do but it doesn't feel like the right place.  I
 think get_dirty_limits should return the raw threshold, and
 balance_dirty_pages should do both tests - the bdi-local test and the
 system-wide test.

 Currently you have a rather odd situation where
+   if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
+   break;
 might included numbers obtained with bdi_stat_sum being compared with
 numbers obtained with bdi_stat.


 With these patches, the VM still (I think) assumes that each BDI has
 a reasonable queue limit, so that writeback_inodes will block on a
 full queue.  If a BDI has a very large queue, balance_dirty_pages
 will simply turn lots of DIRTY pages into WRITEBACK pages and then
 think We've done our duty without actually blocking at all.

 With the extra accounting that we now have, I would like to see
 balance_dirty_pages dirty pages wait until RECLAIMABLE+WRITEBACK is
 actually less than 'threshold'.  This would probably mean that we
 would need to support per-bdi background_writeout to smooth things
 out.  Maybe that it fodder for another patch-set.

 You set:
+   vm_cycle_shift = 1 + ilog2(vm_total_pages);

 Can you explain that?  My experience is that scaling dirty limits
 with main memory isn't what we really want.  When you get machines
 with very large memory, the amount that you want to be dirty is more
 a function of the speed of your IO devices, rather than the amount
 of memory, otherwise you can sometimes see large filesystem lags
 ('sync' taking minutes?)

 I wonder if it makes sense to try to limit the dirty data for a bdi
 to the amount that it can write out in some period of time - maybe 3
 seconds.  Probably configurable.  You seem to have almost all the
 infrastructure in place to do that, and I think it could be a
 valuable feature.

 At least, I think vm_cycle_shift should be tied (loosely) to 
   dirty_ratio * vm_total_pages
 ??

On the whole, looks good!

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Andrew Morton
On Mon, 23 Apr 2007 22:53:49 -0400 Rik van Riel [EMAIL PROTECTED] wrote:

 I don't see why we need the attached, but in case you find
 a good reason, here's my signed-off-by line for Andrew :)

Andew is in a defensive crouch trying to work his way through all the bugs
he's been sent.  After I've managed to release 2.6.21-rc7-mm1 (say, December)
I expect I'll drop the MADV_FREE stuff, give you a run at creating a new
patch series.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] powerpc pseries eeh: Convert to kthread API

2007-04-23 Thread Eric W. Biederman
Benjamin Herrenschmidt [EMAIL PROTECTED] writes:

 Not sure... I can see places where I might want to spawn an arbitrary
 number of these without having to preallocate structures... and if I
 allocate on the fly, then I need a way to free that structure when the
 kthread is reaped which I don't think we have currently, do we ? (In
 fact, I could use that for other things too now that I'm thinking of
 it ... I might have a go at providing optional kthread destructors).

Well the basic problem is that for any piece of code that can be modular
we need a way to ensure all threads it has running are shutdown when we
remove the module.

Which means a fire and forget model however simple is unfortunately
the wrong thing.

Now we might be able to wrap this in some kind of manager construct,
so you don't have to manage each thread individually, but we still
have the problem of ensuring all of the threads exit when we terminate
the module.

Further in general it doesn't make sense to grab a module reference
and call that sufficient because we would like to request that the
module exits.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 23/25] xen: Lockdep fixes for xen-netfront

2007-04-23 Thread Herbert Xu
Jeremy Fitzhardinge [EMAIL PROTECTED] wrote:

 @@ -1212,10 +1212,10 @@ static int netif_poll(struct net_device 
int pages_flipped = 0;
int err;
 
 -   spin_lock(np-rx_lock);
 +   spin_lock_bh(np-rx_lock);
 
if (unlikely(!netfront_carrier_ok(np))) {
 -   spin_unlock(np-rx_lock);
 +   spin_unlock_bh(np-rx_lock);

You don't need to disable BH in netif_poll since it's always called
with BH disabled.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

2007-04-23 Thread Herbert Xu
Jiri Kosina [EMAIL PROTECTED] wrote:
 
 Hmm, *sigh*. I guess the patch below fixes the problem, but it is a 
 masterpiece in the field of ugliness. And I am not sure whether it is 
 completely correct either. Are there any immediate ideas for better 
 solution with respect to how struct sock locking works?

Please cc such patches to netdev.  Thanks.

 diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
 index 71f5cfb..c5c93cd 100644
 --- a/net/bluetooth/hci_sock.c
 +++ b/net/bluetooth/hci_sock.c
 @@ -656,7 +656,10 @@ static int hci_sock_dev_event(struct notifier_block 
 *this, unsigned long event,
/* Detach sockets from device */
read_lock(hci_sk_list.lock);
sk_for_each(sk, node, hci_sk_list.head) {
 -   lock_sock(sk);
 +   if (in_atomic())
 +   bh_lock_sock(sk);
 +   else
 +   lock_sock(sk);

This doesn't do what you think it does.  bh_lock_sock can still succeed
even with lock_sock held by someone else.

Does this need to occur immediately when an event occurs? If not I'd
suggest moving this into a workqueue.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread Peter Williams

Linus Torvalds wrote:


On Mon, 23 Apr 2007, Ingo Molnar wrote:
The give scheduler money transaction can be both an implicit 
transaction (for example when writing to UNIX domain sockets or 
blocking on a pipe, etc.), or it could be an explicit transaction: 
sched_yield_to(). This latter i've already implemented for CFS, but it's 
much less useful than the really significant implicit ones, the ones 
which will help X.


Yes. It would be wonderful to get it working automatically, so please say 
something about the implementation..


The perfect situation would be that when somebody goes to sleep, any 
extra points it had could be given to whoever it woke up last. Note that 
for something like X, it means that the points are 100% ephemeral: it gets 
points when a client sends it a request, but it would *lose* the points 
again when it sends the reply!


So it would only accumulate scheduling points while multiuple clients 
are actively waiting for it, which actually sounds like exactly the right 
thing. However, I don't really see how to do it well, especially since the 
kernel cannot actually match up the client that gave some scheduling 
points to the reply that X sends back.


There are subtle semantics with these kinds of things: especially if the 
scheduling points are only awarded when a process goes to sleep, if X is 
busy and continues to use the CPU (for another client), it wouldn't give 
any scheduling points back to clients and they really do accumulate with 
the server. Which again sounds like it would be exactly the right thing 
(both in the sense that the server that runs more gets more points, but 
also in the sense that we *only* give points at actual scheduling events).


But how do you actually *give/track* points? A simple last woken up by 
this process thing that triggers when it goes to sleep? It might work, 
but on the other hand, especially with more complex things (and networking 
tends to be pretty complex) the actual wakeup may be done by a software 
irq. Do we just say it ran within the context of X, so we assume X was 
the one that caused it? It probably would work, but we've generally tried 
very hard to avoid accessing current from interrupt context, including 
bh's.


Within reason, it's not the number of clients that X has that causes its 
CPU bandwidth use to sky rocket and cause problems.  It's more to to 
with what type of clients they are.  Most GUIs (even ones that are 
constantly updating visual data (e.g. gkrellm -- I can open quite a 
large number of these without increasing X's CPU usage very much)) cause 
very little load on the X server.  The exceptions to this are the 
various terminal emulators (e.g. xterm, gnome-terminal, etc.) when being 
used to run output intensive command line programs e.g. try ls -lR / 
in an xterm.  The other way (that I've noticed) X's CPU usage bandwidth 
sky rocket is when you grab a large window and wiggle it about a lot and 
hopefully this doesn't happen a lot so the problem that needs to be 
addressed is the one caused by text output on xterm and its ilk.


So I think that an elaborate scheme for distributing points between X 
and its clients would be overkill.  A good scheduler will make sure 
other tasks such as audio streamers get CPU when they need it with good 
responsiveness even when X takes off by giving them higher priority 
because their CPU bandwidth use is low.


The one problem that might still be apparent in these cases is the mouse 
becoming jerky while X is working like crazy to spew out text too fast 
for anyone to read.  But the only way to fix that is to give X more 
bandwidth but if it's already running at about 95% of a CPU that's 
unlikely to help.  To fix this you would probably need to modify X so 
that it knows re-rendering the cursor is more important than rendering 
text in an xterm.


In normal circumstances, the re-rendering of the mouse happens quickly 
enough for the user to experience good responsiveness because X's normal 
CPU use is low enough for it to be given high priority.


Just because the O(1) tried this model and failed doesn't mean that the 
model is bad.  O(1) was a flawed implementation of a good model.


Peter
PS Doing a kernel build in an xterm isn't an example of high enough 
output to cause a problem as (on my system) it only raises X's 
consumption from 0 to 2% to 2 to 5%.  The type of output that causes the 
problem is usually flying past too fast to read.

--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Permanent Kgdb integration into the kernel - lets get with it. (Dave: How do FreeBSD folks maintain the KGDB stub?)

2007-04-23 Thread Piet Delaney
On Sat, 2007-04-21 at 11:48 +0200, Andi Kleen wrote:
  Lots of people want kgdb.  One person is famously less keen on it, but
  we'll be able to talk him around, as long as the patches aren't daft.
 
 The big question is if the kgdb developers seriously want mainline.
 At least in the past this definitely wasn't the case.

I haven't seen any email from kgdb developers saying they didn't want
kgdb to be part of mainline. 

Happen to have any e-mail demonstrating that? 

It's appears to me that:

1. Jason Wessel is putting a lot of effort at that right now.

2. Tom Rini worked hard at this just a few months ago.

3. George Anzinger was working hard at this a year or two
   with the mm series and as likely disappointed when it wasn't
   put into the mainline. As I recall the reason Linus gave
   was that there were two competing patches and he wanted that
   be resolved before integrating it into the mainline. So
   George worked with Amit at SourceForge over that past year
   or two and it's now integrated.

 
 If they're not open to change requests from mainline reviewers we don't
 even need to bother to start the whole exercise.

What issue are there of have been that your referring to?

Once KGDB is part of KORG can't it's maintenance and support be
a kernel wide responsibility. If someone breaks kgdb shouldn't
that be backed out until the KORG developers fixes the problem?
Centralizing the responsibility for KGDB seems like mistake. I 
doubt the FreeBSD folks rip out the KGDB support of a kernel hacker
breaks KGDB and then leaves a group of KGDB developers to sort out
the problem. Seems it should be cough as a mm patch with Andrew tossing
out the patch if it breaks KGDB. Kgdb developers could try to give
Andrew a heads up if this occurs and he didn't notice it.

Once KGDB is integrated the maintenance should be minimal and changes
that break KGDB are likely best addressed by the developer that just
broke it. At least that what I'd think is an optimal approach. Perhaps
Dave O'Brien could tell us how the FreeBSD folks take care for KGDB.

 
 Just putting their stuff onto korg isn't enough.

Yep, and once it's integrated into korg it should finally become
a permanent part of the kernel and I suspect maintained by all
kernel developers. New KGDB features could be developed at SourceForge
but maintaining kernel coherence seems like a global responsibility.
Like running fault injection on your code before checking it in.

Maybe I'm totally out to lunch on this; perhaps Dave O'Brien
can straighten me our if I'm wrong or the Linux kernel core
responsibility paradigm are incompatible with this.

I'd prefer Linux being just as good as NetBSD with Debugging support; 
current presentations like:

http://foss.in/2005/slides/netbsd-linux.pdf

show our current support as being much worse. Let's fix it.

You developed a kgdb proxy for Keith Owens kdb and I suspect you
would like to have KGDB being part of the kernel mainline as
long as it's done well. I doubt anyone would argue with that.
 
Perhaps it's possible to eventually setup KGDB so it can be 
debugged with kdb. Once KGDB is mainline that are plenty of
issues that can be addressed; for example taking a kernel
core dump after dropping into kgdb and having the registers
show up correctly in Dave Anderson's crash utility.

-piet

 
 -Andi
-- 
Piet DelaneyPhone: (408) 200-5256
Blue Lane Technologies  Fax:   (408) 200-5299
10450 Bubb Rd.
Cupertino, Ca. 95014Email: [EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part


Re: BUG: Null pointer dereference in fs/open.c

2007-04-23 Thread William Heimbigner
This bug occurs in linux-2.6.20 and 2.6.21-rc7-git5, and does not occur in 
linux-2.6.19-git22.


After running pktsetup 0 /dev/hdd, I get (timestamps removed):

pktcdvd: pkt_get_last_written failed
BUG: unable to handle kernel NULL pointer dereference at virtual address 
000e
printing eip:
c0173f69
*pde = 
Oops:  [#1]
PREEMPT
Modules linked in: snd_ca0106 snd_ac97_codec ac97_bus 8139cp 8139too iTCO_wdt
CPU:0
EIP:0060:[c0173f69]Not tainted VLI
EFLAGS: 00010203   (2.6.21-rc7-git5 #22)
EIP is at do_sys_open+0x59/0xd0
eax: 0002   ebx: 4020   ecx: 0001   edx: 0002
esi: df1e3000   edi: 0003   ebp: de17bfa4   esp: de17bf84
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process vol_id (pid: 4273, ti=de17b000 task=df4143f0 task.ti=de17b000)
Stack:  c013d2a5 ff9c 0002 c059cea3 bfb6bf64 8000 b7f60ff4
   de17bfb0 c017401c  de17b000 c01041c6 bfb6bf64 8000 
   8000 b7f60ff4 bfb6a798 0005 007b 007b  0005
Call Trace:
 [c010521a] show_trace_log_lvl+0x1a/0x30
 [c01052d9] show_stack_log_lvl+0xa9/0xd0
 [c010551c] show_registers+0x21c/0x3a0
 [c01057a4] die+0x104/0x260
 [c04c5947] do_page_fault+0x277/0x610
 [c04c408c] error_code+0x74/0x7c
 [c017401c] sys_open+0x1c/0x20
 [c01041c6] sysenter_past_esp+0x5f/0x99
 ===
Code: ff 85 c0 89 c7 78 77 8b 45 08 89 d9 89 f2 89 04 24 8b 45 e8 e8 69 ff 
ff ff 3d 00 f0 ff ff 89 45 ec 77 71 8b 55 ec bb 20 00 00 40 8b 42 0c 8b 
48 30 89 4d f0 0f b7 51 66 81 e2 00 f0 00 00 81 fa

EIP: [c0173f69] do_sys_open+0x59/0xd0 SS:ESP 0068:de17bf84


from fs/open.c, comments added:
// do_sys_open is consistently called with dfd=0xff9c,
// filename=/dev/.tmp-254-0, flags=0x8000, mode=0)
long do_sys_open(int dfd, const char __user *filename, int flags, int mode)
{
char *tmp = getname(filename);
int fd = PTR_ERR(tmp);

if (!IS_ERR(tmp)) {
fd = get_unused_fd();
if (fd = 0) {
// do_filp_open consistently returns 2, in this case
struct file *f = do_filp_open(dfd, tmp, flags, mode);
// IS_ERR always returns 0 for this command
if (IS_ERR(f)) {
put_unused_fd(fd);
fd = PTR_ERR(f);
} else {
// null pointer dereference occurs here
fsnotify_open(f-f_path.dentry);
fd_install(fd, f);
}
}
putname(tmp);
}
return fd;
}

I was able to workaround this, by testing if do_filp_open was returning 
2 or not, but obviously this is a very temporal solution to a very 
specific circumstance.


If there is any more information I can provide, let me know.
William Heimbigner
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    1   2   3   4   5   6   7   8   9   >