Re: Question about Reiser4
On Mon, Apr 23, 2007 at 06:52:16AM -0700, Eric Hopper wrote: Oh, two things really interest me about Reiser4. First, I despise having to care about how many tiny files I leave lying around when writing a program. Berkeley DB and its ilk are evil, evil programs that obscure data and make things harder. Secondly, the moves Reiser4 has made towards having actual transactions at the filesystem level also intrigue me. I want to use the filesystem as a DB. IMHO, there is no reason that filesystems shouldn't be a DB sans query language. If there were a more DB-like way to deal with filesystems, I think that it would be that much easier to make something that was a decent replacement for NFS and actually worked. One of the big problems of using a filesystem as a DB is the system call overheads. If you use huge numbers of tiny files, then each attempt read an atom of information from the DB takes three system calls --- an open(), read(), and close(), with all of the overheads in terms of dentry and inode cache. Hans of course had a solution to this problem --- namely the sys_reiser4 system call, where you download a program to the kernel to execute a open/read/close via a single system call, and which returns the combined results to userspace. But now you have more complexity since there is now a reseir4-specific interpreter embeddeed in the kernel, the userspace application needs to write the equivalent of an channel program such as what was found in an IBM/360 mainframe (need I mention this can be a rich source of security bugs), and then the userspace application *still* needs to parse the result returned by the sys_reiser4() system call? So it adds a huge amount of complexity, and at the end of the day, given that you don't have the search capability, it is (a) less functional, (b) more complexitated, and (c) probably less performant than simply calling out to a database. Sadly, unless someone pays me to maintain it, I can't do the fork myself, and I likely wouldn't anyway as being a kernel hacker of something as important as a filesystem is a full-time job and I have other things that interest me a lot more. Unfortunately, the way OSS works is that you either (a) have to do the work yourself, (b) convince someone else to do the work, or (c) convince someone that it's worth paying you to do it. Personally, if I controlled large budget for Linux filesystem development, I'd put a lot more money into something like Val's chunkfs idea than resier4. Being able to have filesystems designed for fast recovery given disks getting larger and larger (but not more reliable), is a whole lot more improtant than trying to create an alternate solution to an already solved problem --- namely that of a database. When you consider that a similar idea, WinFS, was partially responsible for delaying Vista by years due to the complexity of shoving a database where it has no place being, it's another reason why I personally think that chunkfs is a much more promising avenue for future filesystem investment than reiserfs. But hey, the advantage of Open Source is that if *you* want to work on Reiser4, you're perfectly free to do so. My personal opinion is that it'd be a waste of your time, but you're free to spend your time whichever way that you want. What you don't get do is whine about how other people get to spend *their* time, or *their* money. - Ted - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20.7 locking up hard on boot
I'm honestly not sure how to try what you suggested to try, since I'm nothing even remotely close to a kernel geek and it was over my head. However, I'd gladly test anything that you think would be worth testing, if you would please put it in way that I could understand, such as change line 'foo' in probe.c into 'foolio' Thanks again for all of your help, Marcos On 4/23/07, Jan Beulich [EMAIL PROTECTED] wrote: Given that all of the reports are in cases when the adjustment is *not* being done (and only a message is being printed), I can only assume that the breakage results from the adding of PCI_BASE_ADDRESS_SPACE_IO into the resource flags. I considered this unconditional setting of the flags odd already in the original code, and added this extra flag only for consistency reasons (because the settings reported by X indicated that this was missing). Perhaps the adjustment (original and the added extra flag) shouldn't be done if IORESOURCE_IO wasn't already set. Perhaps one of those seeing the issue could try out returning from the function right after that printk(), without any adjustment to the flags. Jan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 04/25] xen: Add XEN config options
On Monday 23 April 2007 23:56:42 Jeremy Fitzhardinge wrote: The XEN config option enables the Xen paravirt_ops interface, which is installed when the kernel finds itself running under Xen. Xen is no longer a sub-architecture, so the X86_XEN subarch config option has gone. Xen is currently incompatible with PREEMPT, but this is fixed up later in the series. Shouldn't this be after the change that adds arch/i386/xen/Kconfig? Otherwise you break bisects -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH -mm 2/3] freezer: Introduce freezer_flags
On Tuesday, 24 April 2007 00:55, Oleg Nesterov wrote: On 04/24, Rafael J. Wysocki wrote: Should I clear it in dup_task_struct() or is there a better place? I personally think we should do this in dup_task_struct(). In fact, I believe it is better to replace the *tsk = *orig; with some helper (like setup_thread_stack() below), and that helper clears -freezer_flags. Say, copy_task_struct(). Hmm, wouldn't that be overkill? copy_task_struct() would have to do *tsk = *orig anyway, and we only need to clear one field apart from this. Some other fields are cleared towards the end of dup_task_struct(), so perhaps we could clear freezer_flags in there too? Rafael - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Prevent softlockup triggering in nvidiafb
On Mon, Apr 23, 2007 at 04:55:05PM +0100, Alan Cox wrote: On Mon, 23 Apr 2007 11:36:30 -0400 Dave Jones [EMAIL PROTECTED] wrote: If the chip locks up, we get into a long polling loop, where the softlockup detector kicks in. See https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=151878 for an example. Surely in this situation the softlockup report and trap out is precisely what should be occurring. We can't do anything useful with the trace. It already prints out info that the hardware locked up. And when nvidiafb detects a lockup, it will go to safe mode. Better than rebooting, I think. Tony - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 04/25] xen: Add XEN config options
The XEN config option enables the Xen paravirt_ops interface, which is installed when the kernel finds itself running under Xen. Xen is no longer a sub-architecture, so the X86_XEN subarch config option has gone. Xen is currently incompatible with PREEMPT, but this is fixed up later in the series. Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] Signed-off-by: Ian Pratt [EMAIL PROTECTED] Signed-off-by: Christian Limpach [EMAIL PROTECTED] Signed-off-by: Chris Wright [EMAIL PROTECTED] --- arch/i386/Kconfig |2 ++ arch/i386/xen/Kconfig | 10 ++ 2 files changed, 12 insertions(+) === --- a/arch/i386/Kconfig +++ b/arch/i386/Kconfig @@ -216,6 +216,8 @@ config PARAVIRT under a hypervisor, improving performance significantly. However, when run without a hypervisor the kernel is theoretically slower. If in doubt, say N. + +source arch/i386/xen/Kconfig config VMI bool VMI Paravirt-ops support === --- /dev/null +++ b/arch/i386/xen/Kconfig @@ -0,0 +1,10 @@ +# +# This Kconfig describes xen options +# + +config XEN + bool Enable support for Xen hypervisor + depends on PARAVIRT HZ_100 !PREEMPT !NO_HZ + default y + help + This is the Linux Xen port. -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 08/25] xen: xen: fix multicall batching
Disable interrupts between allocating a multicall entry and actually issuing it, to prevent an interrupt from coming in, allocating and initializing further multicall entries, and then issuing them all, including the partially completed one. Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] --- arch/i386/xen/enlighten.c | 44 +++- arch/i386/xen/mmu.c| 18 -- arch/i386/xen/multicalls.c |9 - arch/i386/xen/multicalls.h | 27 +++ arch/i386/xen/xen-ops.h|5 + 5 files changed, 71 insertions(+), 32 deletions(-) === --- a/arch/i386/xen/enlighten.c +++ b/arch/i386/xen/enlighten.c @@ -160,13 +160,25 @@ static void xen_halt(void) static void xen_set_lazy_mode(enum paravirt_lazy_mode mode) { - enum paravirt_lazy_mode *lazy = get_cpu_var(xen_lazy_mode); + switch(mode) { + case PARAVIRT_LAZY_NONE: + BUG_ON(x86_read_percpu(xen_lazy_mode) == PARAVIRT_LAZY_NONE); + break; + + case PARAVIRT_LAZY_MMU: + case PARAVIRT_LAZY_CPU: + BUG_ON(x86_read_percpu(xen_lazy_mode) != PARAVIRT_LAZY_NONE); + break; + + case PARAVIRT_LAZY_FLUSH: + /* flush if necessary, but don't change state */ + if (x86_read_percpu(xen_lazy_mode) != PARAVIRT_LAZY_NONE) + xen_mc_flush(); + return; + } xen_mc_flush(); - - *lazy = mode; - - put_cpu_var(xen_lazy_mode); + x86_write_percpu(xen_lazy_mode, mode); } static unsigned long xen_store_tr(void) @@ -193,7 +208,7 @@ static void xen_set_ldt(const void *addr MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF); - xen_mc_issue(); + xen_mc_issue(PARAVIRT_LAZY_CPU); } static void xen_load_gdt(const struct Xgt_desc_struct *dtr) @@ -217,7 +232,7 @@ static void xen_load_gdt(const struct Xg MULTI_set_gdt(mcs.mc, frames, size/8); - xen_mc_issue(); + xen_mc_issue(PARAVIRT_LAZY_CPU); } static void load_TLS_descriptor(struct thread_struct *t, @@ -225,18 +240,20 @@ static void load_TLS_descriptor(struct t { struct desc_struct *gdt = get_cpu_gdt_table(cpu); xmaddr_t maddr = virt_to_machine(gdt[GDT_ENTRY_TLS_MIN+i]); - struct multicall_space mc = xen_mc_entry(0); + struct multicall_space mc = __xen_mc_entry(0); MULTI_update_descriptor(mc.mc, maddr.maddr, t-tls_array[i]); } static void xen_load_tls(struct thread_struct *t, unsigned int cpu) { + xen_mc_batch(); + load_TLS_descriptor(t, cpu, 0); load_TLS_descriptor(t, cpu, 1); load_TLS_descriptor(t, cpu, 2); - xen_mc_issue(); + xen_mc_issue(PARAVIRT_LAZY_CPU); } static void xen_write_ldt_entry(struct desc_struct *dt, int entrynum, u32 low, u32 high) @@ -356,13 +373,9 @@ static void xen_load_esp0(struct tss_str static void xen_load_esp0(struct tss_struct *tss, struct thread_struct *thread) { - if (xen_get_lazy_mode() != PARAVIRT_LAZY_CPU) { - if (HYPERVISOR_stack_switch(__KERNEL_DS, thread-esp0)) - BUG(); - } else { - struct multicall_space mcs = xen_mc_entry(0); - MULTI_stack_switch(mcs.mc, __KERNEL_DS, thread-esp0); - } + struct multicall_space mcs = xen_mc_entry(0); + MULTI_stack_switch(mcs.mc, __KERNEL_DS, thread-esp0); + xen_mc_issue(PARAVIRT_LAZY_CPU); } static void xen_set_iopl_mask(unsigned mask) @@ -452,7 +465,7 @@ static void xen_write_cr3(unsigned long MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF); - xen_mc_issue(); + xen_mc_issue(PARAVIRT_LAZY_CPU); } } === --- a/arch/i386/xen/mmu.c +++ b/arch/i386/xen/mmu.c @@ -344,7 +344,7 @@ static int pin_page(struct page *page, u else { void *pt = lowmem_page_address(page); unsigned long pfn = page_to_pfn(page); - struct multicall_space mcs = xen_mc_entry(0); + struct multicall_space mcs = __xen_mc_entry(0); flush = 0; @@ -364,10 +364,12 @@ void xen_pgd_pin(pgd_t *pgd) struct multicall_space mcs; struct mmuext_op *op; + xen_mc_batch(); + if (pgd_walk(pgd, pin_page, TASK_SIZE)) kmap_flush_unused(); - mcs = xen_mc_entry(sizeof(*op)); + mcs = __xen_mc_entry(sizeof(*op)); op = mcs.args; #ifdef CONFIG_X86_PAE @@ -378,7 +380,7 @@ void xen_pgd_pin(pgd_t *pgd) op-arg1.mfn = pfn_to_mfn(PFN_DOWN(__pa(pgd))); MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF); - xen_mc_flush(); + xen_mc_issue(0); } /* The init_mm pagetable is really pinned as soon as its
[PATCH 10/25] xen: Implement xen_sched_clock
Implement xen_sched_clock, which returns the number of ns the current vcpu has been actually in the running state (vs blocked, runnable-but-not-running, or offline) since boot. Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] Cc: john stultz [EMAIL PROTECTED] --- arch/i386/xen/enlighten.c |2 +- arch/i386/xen/time.c | 22 +- arch/i386/xen/xen-ops.h |3 +-- 3 files changed, 23 insertions(+), 4 deletions(-) === --- a/arch/i386/xen/enlighten.c +++ b/arch/i386/xen/enlighten.c @@ -676,7 +676,7 @@ static const struct paravirt_ops xen_par .set_wallclock = xen_set_wallclock, .get_wallclock = xen_get_wallclock, .get_cpu_khz = xen_cpu_khz, - .sched_clock = xen_clocksource_read, + .sched_clock = xen_sched_clock, #ifdef CONFIG_X86_LOCAL_APIC .apic_write = paravirt_nop, === --- a/arch/i386/xen/time.c +++ b/arch/i386/xen/time.c @@ -16,6 +16,8 @@ #define XEN_SHIFT 22 #define TIMER_SLOP 10 /* Xen may fire a timer up to this many ns early */ #define NS_PER_TICK(10ll / HZ) + +static cycle_t xen_clocksource_read(void); /* These are perodically updated in shared_info, and then copied here. */ struct shadow_time_info { @@ -118,6 +120,24 @@ static void do_stolen_accounting(void) account_steal_time(idle_task(smp_processor_id()), ticks); } +/* + * Xen sched_clock implementation. Returns the number of unstolen + * nanoseconds, which is nanoseconds the VCPU spent in RUNNING+BLOCKED + * states. + */ +unsigned long long xen_sched_clock(void) +{ + struct vcpu_runstate_info state; + cycle_t now = xen_clocksource_read(); + + get_runstate_snapshot(state); + + WARN_ON(state.state != RUNSTATE_running); + + return state.time[RUNSTATE_blocked] + + state.time[RUNSTATE_running] + + (now - state.state_entry_time); +} /* Get the CPU speed from Xen */ @@ -209,7 +229,7 @@ static u64 get_nsec_offset(struct shadow return scale_delta(delta, shadow-tsc_to_nsec_mul, shadow-tsc_shift); } -cycle_t xen_clocksource_read(void) +static cycle_t xen_clocksource_read(void) { struct shadow_time_info *shadow = get_cpu_var(shadow_time); cycle_t ret; === --- a/arch/i386/xen/xen-ops.h +++ b/arch/i386/xen/xen-ops.h @@ -2,7 +2,6 @@ #define XEN_OPS_H #include linux/init.h -#include linux/clocksource.h DECLARE_PER_CPU(struct vcpu_info *, xen_vcpu); DECLARE_PER_CPU(unsigned long, xen_cr3); @@ -18,7 +17,7 @@ void __init xen_time_init(void); void __init xen_time_init(void); unsigned long xen_get_wallclock(void); int xen_set_wallclock(unsigned long time); -cycle_t xen_clocksource_read(void); +unsigned long long xen_sched_clock(void); void xen_mark_init_mm_pinned(void); -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 01/25] xen: Add apply_to_page_range() which applies a function to a pte range.
Add a new mm function apply_to_page_range() which applies a given function to every pte in a given virtual address range in a given mm structure. This is a generic alternative to cut-and-pasting the Linux idiomatic pagetable walking code in every place that a sequence of PTEs must be accessed. Although this interface is intended to be useful in a wide range of situations, it is currently used specifically by several Xen subsystems, for example: to ensure that pagetables have been allocated for a virtual address range, and to construct batched special pagetable update requests to map I/O memory (in ioremap()). Signed-off-by: Ian Pratt [EMAIL PROTECTED] Signed-off-by: Christian Limpach [EMAIL PROTECTED] Signed-off-by: Chris Wright [EMAIL PROTECTED] Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] Cc: Christoph Lameter [EMAIL PROTECTED] Cc: Matt Mackall [EMAIL PROTECTED] Acked-by: Ingo Molnar [EMAIL PROTECTED] --- include/linux/mm.h |5 ++ mm/memory.c| 94 2 files changed, 99 insertions(+) === --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1135,6 +1135,11 @@ struct page *follow_page(struct vm_area_ #define FOLL_GET 0x04/* do get_page on page */ #define FOLL_ANON 0x08/* give ZERO_PAGE if no pgtable */ +typedef int (*pte_fn_t)(pte_t *pte, struct page *pmd_page, unsigned long addr, + void *data); +extern int apply_to_page_range(struct mm_struct *mm, unsigned long address, + unsigned long size, pte_fn_t fn, void *data); + #ifdef CONFIG_PROC_FS void vm_stat_account(struct mm_struct *, unsigned long, struct file *, long); #else === --- a/mm/memory.c +++ b/mm/memory.c @@ -1448,6 +1448,100 @@ int remap_pfn_range(struct vm_area_struc } EXPORT_SYMBOL(remap_pfn_range); +static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, +unsigned long addr, unsigned long end, +pte_fn_t fn, void *data) +{ + pte_t *pte; + int err; + struct page *pmd_page; + spinlock_t *ptl; + + pte = (mm == init_mm) ? + pte_alloc_kernel(pmd, addr) : + pte_alloc_map_lock(mm, pmd, addr, ptl); + if (!pte) + return -ENOMEM; + + BUG_ON(pmd_huge(*pmd)); + + pmd_page = pmd_page(*pmd); + + do { + err = fn(pte, pmd_page, addr, data); + if (err) + break; + } while (pte++, addr += PAGE_SIZE, addr != end); + + if (mm != init_mm) + pte_unmap_unlock(pte-1, ptl); + return err; +} + +static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud, +unsigned long addr, unsigned long end, +pte_fn_t fn, void *data) +{ + pmd_t *pmd; + unsigned long next; + int err; + + pmd = pmd_alloc(mm, pud, addr); + if (!pmd) + return -ENOMEM; + do { + next = pmd_addr_end(addr, end); + err = apply_to_pte_range(mm, pmd, addr, next, fn, data); + if (err) + break; + } while (pmd++, addr = next, addr != end); + return err; +} + +static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd, +unsigned long addr, unsigned long end, +pte_fn_t fn, void *data) +{ + pud_t *pud; + unsigned long next; + int err; + + pud = pud_alloc(mm, pgd, addr); + if (!pud) + return -ENOMEM; + do { + next = pud_addr_end(addr, end); + err = apply_to_pmd_range(mm, pud, addr, next, fn, data); + if (err) + break; + } while (pud++, addr = next, addr != end); + return err; +} + +/* + * Scan a region of virtual memory, filling in page tables as necessary + * and calling a provided function on each leaf page table. + */ +int apply_to_page_range(struct mm_struct *mm, unsigned long addr, + unsigned long size, pte_fn_t fn, void *data) +{ + pgd_t *pgd; + unsigned long next; + unsigned long end = addr + size; + int err; + + BUG_ON(addr = end); + pgd = pgd_offset(mm, addr); + do { + next = pgd_addr_end(addr, end); + err = apply_to_pud_range(mm, pgd, addr, next, fn, data); + if (err) + break; + } while (pgd++, addr = next, addr != end); + return err; +} +EXPORT_SYMBOL_GPL(apply_to_page_range); + /* * handle_pte_fault chooses page fault handler according to an entry * which was read non-atomically. Before making any commitment, on -- - To unsubscribe
[PATCH 24/25] xen: xen: diddle netfront
Move things around a bit to match xen-unstable netfront. Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] --- drivers/net/xen-netfront.c | 36 +--- 1 file changed, 17 insertions(+), 19 deletions(-) === --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -750,19 +750,6 @@ no_skb: notify_remote_via_irq(np-irq); } -static void xennet_move_rx_slot(struct netfront_info *np, struct sk_buff *skb, - grant_ref_t ref) -{ - int new = xennet_rxidx(np-rx.req_prod_pvt); - - BUG_ON(np-rx_skbs[new]); - np-rx_skbs[new] = skb; - np-grant_rx_ref[new] = ref; - RING_GET_REQUEST(np-rx, np-rx.req_prod_pvt)-id = new; - RING_GET_REQUEST(np-rx, np-rx.req_prod_pvt)-gref = ref; - np-rx.req_prod_pvt++; -} - static void xennet_make_frags(struct sk_buff *skb, struct net_device *dev, struct netif_tx_request *tx) { @@ -944,6 +931,19 @@ static irqreturn_t netif_int(int irq, vo spin_unlock_irqrestore(np-tx_lock, flags); return IRQ_HANDLED; +} + +static void xennet_move_rx_slot(struct netfront_info *np, struct sk_buff *skb, + grant_ref_t ref) +{ + int new = xennet_rxidx(np-rx.req_prod_pvt); + + BUG_ON(np-rx_skbs[new]); + np-rx_skbs[new] = skb; + np-grant_rx_ref[new] = ref; + RING_GET_REQUEST(np-rx, np-rx.req_prod_pvt)-id = new; + RING_GET_REQUEST(np-rx, np-rx.req_prod_pvt)-gref = ref; + np-rx.req_prod_pvt++; } static void handle_incoming_queue(struct net_device *dev, struct sk_buff_head *rxq) @@ -1169,7 +1169,8 @@ static RING_IDX xennet_fill_frags(struct return cons; } -static int xennet_set_skb_gso(struct sk_buff *skb, struct netif_extra_info *gso) +static int xennet_set_skb_gso(struct sk_buff *skb, + struct netif_extra_info *gso) { if (!gso-u.gso.size) { if (net_ratelimit()) @@ -1456,11 +1457,8 @@ static void netif_release_rx_bufs(struct if (!xen_feature(XENFEAT_auto_translated_physmap)) { /* Do all the remapping work and M2P updates. */ - mcl-op = __HYPERVISOR_mmu_update; - mcl-args[0] = (unsigned long)np-rx_mmu; - mcl-args[1] = mmu - np-rx_mmu; - mcl-args[2] = 0; - mcl-args[3] = DOMID_SELF; + MULTI_mmu_update(mcl, np-rx_mmu, mmu - np-rx_mmu, +0, DOMID_SELF); mcl++; HYPERVISOR_multicall(np-rx_mcl, mcl - np-rx_mcl); } -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 25/25] xen: Xen machine operations
Make the appropriate hypercalls to halt and reboot the virtual machine. Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] --- arch/i386/xen/enlighten.c | 43 +++ arch/i386/xen/smp.c |4 +--- 2 files changed, 44 insertions(+), 3 deletions(-) === --- a/arch/i386/xen/enlighten.c +++ b/arch/i386/xen/enlighten.c @@ -14,6 +14,7 @@ #include xen/interface/xen.h #include xen/interface/vcpu.h +#include xen/interface/sched.h #include xen/features.h #include xen/page.h @@ -28,6 +29,7 @@ #include asm/pgtable.h #include asm/smp.h #include asm/tlbflush.h +#include asm/reboot.h #include xen-ops.h #include mmu.h @@ -787,6 +789,45 @@ static const struct smp_ops xen_smp_ops }; #endif /* CONFIG_SMP */ +static void xen_reboot(int reason) +{ +#ifdef CONFIG_SMP + smp_send_stop(); +#endif + + if (HYPERVISOR_sched_op(SCHEDOP_shutdown, reason)) + BUG(); +} + +static void xen_restart(char *msg) +{ + xen_reboot(SHUTDOWN_reboot); +} + +static void xen_emergency_restart(void) +{ + xen_reboot(SHUTDOWN_reboot); +} + +static void xen_machine_halt(void) +{ + xen_reboot(SHUTDOWN_poweroff); +} + +static void xen_crash_shutdown(struct pt_regs *regs) +{ + xen_reboot(SHUTDOWN_crash); +} + +static const struct machine_ops __initdata xen_machine_ops = { + .restart = xen_restart, + .halt = xen_machine_halt, + .power_off = xen_machine_halt, + .shutdown = xen_machine_halt, + .crash_shutdown = xen_crash_shutdown, + .emergency_restart = xen_emergency_restart, +}; + /* First C function to be called on Xen boot */ static asmlinkage void __init xen_start_kernel(void) { @@ -800,6 +841,8 @@ static asmlinkage void __init xen_start_ /* Install Xen paravirt ops */ paravirt_ops = xen_paravirt_ops; + machine_ops = xen_machine_ops; + #ifdef CONFIG_SMP smp_ops = xen_smp_ops; #endif === --- a/arch/i386/xen/smp.c +++ b/arch/i386/xen/smp.c @@ -303,9 +303,7 @@ static void stop_self(void *v) void xen_smp_send_stop(void) { - cpumask_t mask = cpu_online_map; - cpu_clear(smp_processor_id(), mask); - xen_smp_call_function_mask(mask, stop_self, NULL, 0); + smp_call_function(stop_self, NULL, 0, 0); } void xen_smp_send_reschedule(int cpu) -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 21/25] xen: Add the Xen virtual network device driver.
The network device frontend driver allows the kernel to access network devices exported exported by a virtual machine containing a physical network device driver. Signed-off-by: Ian Pratt [EMAIL PROTECTED] Signed-off-by: Christian Limpach [EMAIL PROTECTED] Signed-off-by: Chris Wright [EMAIL PROTECTED] Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Cc: Jeff Garzik [EMAIL PROTECTED] Cc: Stephen Hemminger [EMAIL PROTECTED] --- drivers/net/Kconfig| 12 drivers/net/Makefile |2 drivers/net/xen-netfront.c | 1957 3 files changed, 1971 insertions(+) === --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -2508,6 +2508,18 @@ source drivers/atm/Kconfig source drivers/s390/net/Kconfig +config XEN_NETDEV_FRONTEND + tristate Xen network device frontend driver + depends on XEN + default y + help + The network device frontend driver allows the kernel to + access network devices exported exported by a virtual + machine containing a physical network device driver. The + frontend driver is intended for unprivileged guest domains; + if you are compiling a kernel for a Xen guest, you almost + certainly want to enable this. + config ISERIES_VETH tristate iSeries Virtual Ethernet driver support depends on PPC_ISERIES === --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -218,3 +218,5 @@ obj-$(CONFIG_FS_ENET) += fs_enet/ obj-$(CONFIG_FS_ENET) += fs_enet/ obj-$(CONFIG_NETXEN_NIC) += netxen/ + +obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o === --- /dev/null +++ b/drivers/net/xen-netfront.c @@ -0,0 +1,1957 @@ +/** + * Virtual network driver for conversing with remote driver backends. + * + * Copyright (c) 2002-2005, K A Fraser + * Copyright (c) 2005, XenSource Ltd + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation; or, when distributed + * separately from the Linux kernel or incorporated into other + * software packages, subject to the following license: + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this source file (the Software), to deal in the Software without + * restriction, including without limitation the rights to use, copy, modify, + * merge, publish, distribute, sublicense, and/or sell copies of the Software, + * and to permit persons to whom the Software is furnished to do so, subject to + * the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS + * IN THE SOFTWARE. + */ + +#include linux/module.h +#include linux/version.h +#include linux/kernel.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/skbuff.h +#include linux/ethtool.h +#include linux/in.h +#include linux/if_ether.h +#include linux/moduleparam.h +#include linux/mm.h +#include xen/xenbus.h +#include xen/interface/io/netif.h +#include xen/interface/memory.h +#ifdef CONFIG_XEN_BALLOON +#include xen/balloon.h +#endif +#include xen/interface/grant_table.h + +#include xen/events.h +#include xen/page.h +#include xen/grant_table.h + +/* + * Mutually-exclusive module options to select receive data path: + * rx_copy : Packets are copied by network backend into local memory + * rx_flip : Page containing packet data is transferred to our ownership + * For fully-virtualised guests there is no option - copying must be used. + * For paravirtualised guests, flipping is the default. + */ +static int rx_copy; +module_param(rx_copy, bool, 0); +MODULE_PARM_DESC(rx_copy, Copy packets from network card (rather than flip)); +static int rx_flip; +module_param(rx_flip, bool, 0); +MODULE_PARM_DESC(rx_flip, Flip packets from network card (rather than copy)); + +#define RX_COPY_THRESHOLD 256 + +#define GRANT_INVALID_REF 0 + +#define NET_TX_RING_SIZE __RING_SIZE((struct netif_tx_sring *)0, PAGE_SIZE) +#define NET_RX_RING_SIZE __RING_SIZE((struct netif_rx_sring *)0, PAGE_SIZE) + +struct netfront_info { + struct list_head list;
[PATCH 07/25] xen: Complete pagetable pinning for Xen
Xen has a notion of pinned pagetables, which are pagetables that remain read-only to the guest and are validated by the hypervisor. This makes context switches much cheaper, because the hypervisor doesn't need to revalidate the pagetable each time. This patch adds a PG_pinned flag for pagetable pages so we can tell if it has been pinned or not. This allows various pagetable update optimisations. This also adds a mm parameter to the alloc_pt pv_op, so that Xen can see if we're adding a page to a pinned pagetable. This is not necessary for alloc_pd or release_p[dt], which is fortunate because it isn't available at all callsites. This also adds a new paravirt hook which is called during setup once the zones and memory allocator have been initialized. When the init_mm pagetable is first built, the struct page array does not yet exist, and so there's nowhere to put he init_mm pagetable's PG_pinned flags. Once the zones are initialized and the struct page array exists, we can set the PG_pinned flags for those pages. This patch also adds the Xen support for pte pages allocated out of highmem (highpte), principly by implementing xen_kmap_atomic_pte. Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] Cc: Zach Amsden [EMAIL PROTECTED] --- arch/i386/kernel/setup.c|3 arch/i386/kernel/vmi.c |2 arch/i386/mm/init.c |2 arch/i386/mm/pageattr.c |2 arch/i386/xen/enlighten.c | 105 +++- arch/i386/xen/mmu.c | 280 +++ arch/i386/xen/mmu.h |2 arch/i386/xen/xen-ops.h |2 include/asm-i386/paravirt.h | 16 +- include/asm-i386/pgalloc.h |6 include/asm-i386/setup.h|4 include/linux/page-flags.h |5 12 files changed, 289 insertions(+), 140 deletions(-) === --- a/arch/i386/kernel/setup.c +++ b/arch/i386/kernel/setup.c @@ -607,9 +607,12 @@ void __init setup_arch(char **cmdline_p) sparse_init(); zone_sizes_init(); + /* * NOTE: at this point the bootmem allocator is fully available. */ + + paravirt_post_allocator_init(); dmi_scan_machine(); === --- a/arch/i386/kernel/vmi.c +++ b/arch/i386/kernel/vmi.c @@ -361,7 +361,7 @@ static void *vmi_kmap_atomic_pte(struct } #endif -static void vmi_allocate_pt(u32 pfn) +static void vmi_allocate_pt(struct mm_struct *mm, u32 pfn) { vmi_set_page_type(pfn, VMI_PAGE_L1); vmi_ops.allocate_page(pfn, VMI_PAGE_L1, 0, 0, 0); === --- a/arch/i386/mm/init.c +++ b/arch/i386/mm/init.c @@ -87,7 +87,7 @@ static pte_t * __init one_page_table_ini if (pmd_none(*pmd)) { pte_t *page_table = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE); - paravirt_alloc_pt(__pa(page_table) PAGE_SHIFT); + paravirt_alloc_pt(init_mm, __pa(page_table) PAGE_SHIFT); set_pmd(pmd, __pmd(__pa(page_table) | _PAGE_TABLE)); BUG_ON(page_table != pte_offset_kernel(pmd, 0)); } === --- a/arch/i386/mm/pageattr.c +++ b/arch/i386/mm/pageattr.c @@ -60,7 +60,7 @@ static struct page *split_large_page(uns address = __pa(address); addr = address LARGE_PAGE_MASK; pbase = (pte_t *)page_address(base); - paravirt_alloc_pt(page_to_pfn(base)); + paravirt_alloc_pt(init_mm, page_to_pfn(base)); for (i = 0; i PTRS_PER_PTE; i++, addr += PAGE_SIZE) { set_pte(pbase[i], pfn_pte(addr PAGE_SHIFT, addr == address ? prot : ref_prot)); === --- a/arch/i386/xen/enlighten.c +++ b/arch/i386/xen/enlighten.c @@ -8,6 +8,9 @@ #include linux/sched.h #include linux/bootmem.h #include linux/module.h +#include linux/mm.h +#include linux/page-flags.h +#include linux/highmem.h #include xen/interface/xen.h #include xen/features.h @@ -453,32 +456,59 @@ static void xen_write_cr3(unsigned long } } -static void xen_alloc_pt(u32 pfn) -{ - /* XXX pfn isn't necessarily a lowmem page */ +/* Early in boot, while setting up the initial pagetable, assume + everything is pinned. */ +static void xen_alloc_pt_init(struct mm_struct *mm, u32 pfn) +{ + BUG_ON(mem_map);/* should only be used early */ make_lowmem_page_readonly(__va(PFN_PHYS(pfn))); } -static void xen_alloc_pd(u32 pfn) -{ - make_lowmem_page_readonly(__va(PFN_PHYS(pfn))); -} - -static void xen_release_pd(u32 pfn) -{ - make_lowmem_page_readwrite(__va(PFN_PHYS(pfn))); -} - +/* This needs to make sure the new pte page is pinned iff its being + attached to a pinned pagetable. */ +static void xen_alloc_pt(struct mm_struct *mm,
Re: [PATCH 00/25] xen: Xen implementation for paravirt_ops
Andi Kleen wrote: On Monday 23 April 2007 23:56:38 Jeremy Fitzhardinge wrote: Hi Andi, It applies to 2.6.21-rc7 + your patches + the last batch of pv_ops patches I got most of those except for the broken sched_clock change. Er, we had a bit of back-and-forward with that. How did that end up? I posted. How much testing outside Jeremylabs has it gotten? Some beta testing before merging would be good, otherwise we'll just have a flood of fixes shortly when it is exposed to users. Yes. I'm just prepping a tree for xen-devel, and I primed people at the Xen Summit last week. This patch generally restricts itself to Xen-specific parts of the tree, though it does make a few small changes elsewhere. The general problem is that it is much more than just an architecture update. These patches include: - some helper routines for allocating address space and walking pagetables Needs review from mm people. These have been pretty well looked at already. They have been posted repeatedly, and I think all the comments have been sorted out. alloc_vm_area() will be a bit affected by Andrew's -mm patch to make vmalloc_sync_all a globally-visible arch export, but they merge nicely. - Xen interface header files - Core Xen implementation - Efficient late-pinning/early-unpinning pagetable handling The number of new paravirt hooks makes me thing of renaming it to everything_ops @| There's only one new op in this series, and I couldn't work out a way to avoid it, other than putting a #ifdef CONFIG_XEN in kernel/setup.c. The last patch posting didn't add any new hooks. Which ones are you referring to? - Virtualized time, including stolen time Can you let it be reviewed by the time people? (Thomas, Ingo, John, Roman etc.) Thomas has looked at and generally approves of the Xen clocksource/event code. The stolen time code is really only used to generate a few numbers in /proc, and so has very little direct impact on the rest of the kernel, and hasn't really attracted much interest as a result. I've posted the patch to implement sched_clock in terms of unstolen time to the various time people repeatedly, and nobody has responded, so I guess it doesn't irritate anyone too much; it would be nice to have some definite feedback though. - Xen console, based on hvc console - Xenbus That one would need to be reviewed first. It's so much code that I can't do it all myself. I put a specific plea for GregKH to look at this. - Netfront, the paravirtualized network device That one should go through the network device maintainer/netdev. Stephen Hemminger has looked at this in the past and we've addressed all his comments so far. But it would be nice to get some more net developers to review this; it was cc:d to netdev. - Blockfront, the paravirtualized block device And that needs a block device review and whoever maintains that (Jens?) He was cc:d. I'll ask him specifically. J - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 18/25] xen: Add Xen grant table support
Add Xen 'grant table' driver which allows granting of access to selected local memory pages by other virtual machines and, symmetrically, the mapping of remote memory pages which other virtual machines have granted access to. This driver is a prerequisite for many of the Xen virtual device drivers, which grant the 'device driver domain' restricted and temporary access to only those memory pages that are currently involved in I/O operations. Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] Signed-off-by: Ian Pratt [EMAIL PROTECTED] Signed-off-by: Christian Limpach [EMAIL PROTECTED] Signed-off-by: Chris Wright [EMAIL PROTECTED] --- drivers/xen/Makefile|1 drivers/xen/grant-table.c | 576 +++ include/xen/grant_table.h | 107 ++ include/xen/interface/grant_table.h | 112 +- 4 files changed, 777 insertions(+), 19 deletions(-) === --- a/drivers/xen/Makefile +++ b/drivers/xen/Makefile @@ -1,1 +1,2 @@ obj-y += hvc-console.o +obj-y += grant-table.o obj-y += hvc-console.o === --- /dev/null +++ b/drivers/xen/grant-table.c @@ -0,0 +1,576 @@ +/** + * grant_table.c + * + * Granting foreign access to our memory reservation. + * + * Copyright (c) 2005-2006, Christopher Clark + * Copyright (c) 2004-2005, K A Fraser + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation; or, when distributed + * separately from the Linux kernel or incorporated into other + * software packages, subject to the following license: + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this source file (the Software), to deal in the Software without + * restriction, including without limitation the rights to use, copy, modify, + * merge, publish, distribute, sublicense, and/or sell copies of the Software, + * and to permit persons to whom the Software is furnished to do so, subject to + * the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS + * IN THE SOFTWARE. + */ + +#include linux/module.h +#include linux/sched.h +#include linux/mm.h +#include linux/vmalloc.h + +#include xen/interface/xen.h +#include xen/page.h +#include xen/grant_table.h + +#include asm/pgtable.h +#include asm/uaccess.h +#include asm/sync_bitops.h + + +/* External tools reserve first few grant table entries. */ +#define NR_RESERVED_ENTRIES 8 +#define GNTTAB_LIST_END 0x +#define GREFS_PER_GRANT_FRAME (PAGE_SIZE / sizeof(struct grant_entry)) + +static grant_ref_t **gnttab_list; +static unsigned int nr_grant_frames; +static unsigned int boot_max_nr_grant_frames; +static int gnttab_free_count; +static grant_ref_t gnttab_free_head; +static DEFINE_SPINLOCK(gnttab_list_lock); + +static struct grant_entry *shared; + +static struct gnttab_free_callback *gnttab_free_callback_list; + +static int gnttab_expand(unsigned int req_entries); + +#define RPP (PAGE_SIZE / sizeof(grant_ref_t)) +#define gnttab_entry(entry) (gnttab_list[(entry) / RPP][(entry) % RPP]) + +static int get_free_entries(int count) +{ + unsigned long flags; + int ref, rc; + grant_ref_t head; + + spin_lock_irqsave(gnttab_list_lock, flags); + + if ((gnttab_free_count count) + ((rc = gnttab_expand(count - gnttab_free_count)) 0)) { + spin_unlock_irqrestore(gnttab_list_lock, flags); + return rc; + } + + ref = head = gnttab_free_head; + gnttab_free_count -= count; + while (count-- 1) + head = gnttab_entry(head); + gnttab_free_head = gnttab_entry(head); + gnttab_entry(head) = GNTTAB_LIST_END; + + spin_unlock_irqrestore(gnttab_list_lock, flags); + + return ref; +} + +#define get_free_entry() get_free_entries(1) + +static void do_free_callbacks(void) +{ + struct gnttab_free_callback *callback, *next; + + callback = gnttab_free_callback_list; + gnttab_free_callback_list = NULL; + + while (callback != NULL) { + next = callback-next; + if (gnttab_free_count = callback-count) { +
[PATCH 20/25] xen: Add Xen virtual block device driver.
The block device frontend driver allows the kernel to access block devices exported exported by a virtual machine containing a physical block device driver. Signed-off-by: Ian Pratt [EMAIL PROTECTED] Signed-off-by: Christian Limpach [EMAIL PROTECTED] Signed-off-by: Chris Wright [EMAIL PROTECTED] Cc: Arjan van de Ven [EMAIL PROTECTED] Cc: Greg KH [EMAIL PROTECTED] Cc: Jens Axboe [EMAIL PROTECTED] --- drivers/block/Kconfig|1 drivers/block/Makefile |1 drivers/block/xen/Kconfig| 14 drivers/block/xen/Makefile |5 drivers/block/xen/blkfront.c | 844 ++ drivers/block/xen/block.h| 135 ++ drivers/block/xen/vbd.c | 229 +++ include/linux/major.h|2 8 files changed, 1231 insertions(+) === --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -445,6 +445,7 @@ config CDROM_PKTCDVD_WCACHE don't do deferred write error handling yet. source drivers/s390/block/Kconfig +source drivers/block/xen/Kconfig config ATA_OVER_ETH tristate ATA over Ethernet support === --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -29,3 +29,4 @@ obj-$(CONFIG_BLK_DEV_SX8) += sx8.o obj-$(CONFIG_BLK_DEV_SX8) += sx8.o obj-$(CONFIG_BLK_DEV_UB) += ub.o +obj-$(CONFIG_XEN) += xen/ === --- /dev/null +++ b/drivers/block/xen/Kconfig @@ -0,0 +1,14 @@ +menu Xen block device drivers +depends on XEN + +config XEN_BLKDEV_FRONTEND + tristate Block device frontend driver + depends on XEN + default y + help + The block device frontend driver allows the kernel to access block + devices exported from a device driver virtual machine. Unless you + are building a dedicated device driver virtual machine, then you + almost certainly want to say Y here. + +endmenu === --- /dev/null +++ b/drivers/block/xen/Makefile @@ -0,0 +1,5 @@ + +obj-$(CONFIG_XEN_BLKDEV_FRONTEND) := xenblk.o + +xenblk-objs := blkfront.o vbd.o + === --- /dev/null +++ b/drivers/block/xen/blkfront.c @@ -0,0 +1,844 @@ +/** + * blkfront.c + * + * XenLinux virtual block device driver. + * + * Copyright (c) 2003-2004, Keir Fraser Steve Hand + * Modifications by Mark A. Williamson are (c) Intel Research Cambridge + * Copyright (c) 2004, Christian Limpach + * Copyright (c) 2004, Andrew Warfield + * Copyright (c) 2005, Christopher Clark + * Copyright (c) 2005, XenSource Ltd + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation; or, when distributed + * separately from the Linux kernel or incorporated into other + * software packages, subject to the following license: + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this source file (the Software), to deal in the Software without + * restriction, including without limitation the rights to use, copy, modify, + * merge, publish, distribute, sublicense, and/or sell copies of the Software, + * and to permit persons to whom the Software is furnished to do so, subject to + * the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS + * IN THE SOFTWARE. + */ + +#include linux/version.h +#include block.h +#include linux/cdrom.h +#include linux/sched.h +#include linux/interrupt.h +#include scsi/scsi.h +#include xen/xenbus.h +#include xen/interface/grant_table.h +#include xen/grant_table.h +#include xen/events.h +#include xen/page.h +#include asm/xen/hypervisor.h + +#define BLKIF_STATE_DISCONNECTED 0 +#define BLKIF_STATE_CONNECTED1 +#define BLKIF_STATE_SUSPENDED2 + +#define MAXIMUM_OUTSTANDING_BLOCK_REQS \ +(BLKIF_MAX_SEGMENTS_PER_REQUEST * BLK_RING_SIZE) +#define GRANT_INVALID_REF 0 + +static void connect(struct blkfront_info *); +static void blkfront_closing(struct xenbus_device *); +static int blkfront_remove(struct xenbus_device *); +static int
[PATCH 11/25] xen: Xen SMP guest support
This is a fairly straightforward Xen implementation of smp_ops. One thing this must to is carefully set up all the various sibling and core maps so that the smp scheduler setup works properly (the setup is very simple, since vcpus don't have any siblings or multiple cores). Xen has its own IPI mechanisms, and has no dependency on any APIC-based IPI. The smp_ops hooks and the flush_tlb_others pv_op allow a Xen guest to avoid all APIC code in arch/i386 (the only apic operation is a single apic_read for the apic version number). One subtle point which needs to be addressed is unpinning pagetables when another cpu may have a lazy tlb reference to the pagetable. Xen will not allow an in-use pagetable to be unpinned, so we must find any other cpus with a reference to the pagetable and get them to shoot down their references. Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] Cc: Benjamin LaHaise [EMAIL PROTECTED] Cc: Ingo Molnar [EMAIL PROTECTED] Cc: Andi Kleen [EMAIL PROTECTED] --- arch/i386/kernel/smp.c | 16 arch/i386/kernel/smpboot.c |4 arch/i386/xen/Makefile |6 arch/i386/xen/enlighten.c | 118 - arch/i386/xen/events.c | 78 +++ arch/i386/xen/mmu.c| 66 ++- arch/i386/xen/mmu.h|9 arch/i386/xen/setup.c |9 arch/i386/xen/smp.c| 419 arch/i386/xen/time.c |9 arch/i386/xen/xen-ops.h| 25 + include/asm-i386/mach-default/irq_vectors_limits.h |2 include/asm-i386/mmu_context.h | 17 include/asm-i386/processor.h |1 include/asm-i386/smp.h |2 include/xen/events.h | 27 + 16 files changed, 730 insertions(+), 78 deletions(-) === --- a/arch/i386/kernel/smp.c +++ b/arch/i386/kernel/smp.c @@ -23,6 +23,7 @@ #include asm/mtrr.h #include asm/tlbflush.h +#include asm/mmu_context.h #include mach_apic.h /* @@ -256,21 +257,6 @@ static struct mm_struct * flush_mm; static struct mm_struct * flush_mm; static unsigned long flush_va; static DEFINE_SPINLOCK(tlbstate_lock); - -/* - * We cannot call mmdrop() because we are in interrupt context, - * instead update mm-cpu_vm_mask. - * - * We need to reload %cr3 since the page tables may be going - * away from under us.. - */ -static inline void leave_mm (unsigned long cpu) -{ - if (per_cpu(cpu_tlbstate, cpu).state == TLBSTATE_OK) - BUG(); - cpu_clear(cpu, per_cpu(cpu_tlbstate, cpu).active_mm-cpu_vm_mask); - load_cr3(swapper_pg_dir); -} /* * === --- a/arch/i386/kernel/smpboot.c +++ b/arch/i386/kernel/smpboot.c @@ -151,7 +151,7 @@ void __init smp_alloc_memory(void) * a given CPU */ -static void __cpuinit smp_store_cpu_info(int id) +void __cpuinit smp_store_cpu_info(int id) { struct cpuinfo_x86 *c = cpu_data + id; @@ -785,7 +785,7 @@ static inline struct task_struct * alloc /* Initialize the CPU's GDT. This is either the boot CPU doing itself (still using the master per-cpu area), or a CPU doing it for a secondary which will soon come up. */ -static __cpuinit void init_gdt(int cpu) +__cpuinit void init_gdt(int cpu) { struct desc_struct *gdt = get_cpu_gdt_table(cpu); === --- a/arch/i386/xen/Makefile +++ b/arch/i386/xen/Makefile @@ -1,2 +1,4 @@ obj-y := enlighten.o setup.o events.o t -obj-y := enlighten.o setup.o events.o time.o \ - features.o mmu.o multicalls.o +obj-y := enlighten.o setup.o events.o time.o \ + features.o mmu.o multicalls.o + +obj-$(CONFIG_SMP) += smp.o === --- a/arch/i386/xen/enlighten.c +++ b/arch/i386/xen/enlighten.c @@ -13,6 +13,7 @@ #include linux/highmem.h #include xen/interface/xen.h +#include xen/interface/vcpu.h #include xen/features.h #include xen/page.h @@ -25,6 +26,8 @@ #include asm/setup.h #include asm/desc.h #include asm/pgtable.h +#include asm/smp.h +#include asm/tlbflush.h #include xen-ops.h #include mmu.h @@ -44,7 +47,7 @@ struct start_info *xen_start_info; struct start_info *xen_start_info; EXPORT_SYMBOL_GPL(xen_start_info); -static void xen_vcpu_setup(int cpu) +void xen_vcpu_setup(int cpu) { per_cpu(xen_vcpu, cpu) = HYPERVISOR_shared_info-vcpu_info[cpu]; } @@ -152,10 +155,10 @@ static void xen_safe_halt(void) static void xen_halt(void) { -#if 0 if (irqs_disabled()) HYPERVISOR_vcpu_op(VCPUOP_down,
[PATCH 22/25] xen: xen-netfront: use skb.cb for storing private data
Netfront's use of nh.raw and h.raw for storing page+offset is a bit hinky, and it breaks with upcoming network stack updates which reduce these fields to sub-pointer sizes. Fortunately, skb offers the cb field specifically for stashing this kind of info, so use it. Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] Cc: Herbert Xu [EMAIL PROTECTED] Cc: Chris Wright [EMAIL PROTECTED] Cc: Christian Limpach [EMAIL PROTECTED] --- drivers/net/xen-netfront.c | 18 +- 1 file changed, 13 insertions(+), 5 deletions(-) === --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -52,6 +52,13 @@ #include xen/page.h #include xen/grant_table.h +struct netfront_cb { + struct page *page; + unsigned offset; +}; + +#define NETFRONT_SKB_CB(skb) ((struct netfront_cb *)((skb)-cb)) + /* * Mutually-exclusive module options to select receive data path: * rx_copy : Packets are copied by network backend into local memory @@ -944,10 +951,11 @@ static void handle_incoming_queue(struct struct sk_buff *skb; while ((skb = __skb_dequeue(rxq)) != NULL) { - struct page *page = (struct page *)skb-nh.raw; + struct page *page = NETFRONT_SKB_CB(skb)-page; void *vaddr = page_address(page); - - memcpy(skb-data, vaddr + (skb-h.raw - skb-nh.raw), + unsigned offset = NETFRONT_SKB_CB(skb)-offset; + + memcpy(skb-data, vaddr + offset, skb_headlen(skb)); if (page != skb_shinfo(skb)-frags[0].page) @@ -1251,8 +1259,8 @@ err: } } - skb-nh.raw = (void *)skb_shinfo(skb)-frags[0].page; - skb-h.raw = skb-nh.raw + rx-offset; + NETFRONT_SKB_CB(skb)-page = skb_shinfo(skb)-frags[0].page; + NETFRONT_SKB_CB(skb)-offset = rx-offset; len = rx-status; if (len RX_COPY_THRESHOLD) -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 04/25] xen: Add XEN config options
Andi Kleen wrote: On Monday 23 April 2007 23:56:42 Jeremy Fitzhardinge wrote: The XEN config option enables the Xen paravirt_ops interface, which is installed when the kernel finds itself running under Xen. Xen is no longer a sub-architecture, so the X86_XEN subarch config option has gone. Xen is currently incompatible with PREEMPT, but this is fixed up later in the series. Shouldn't this be after the change that adds arch/i386/xen/Kconfig? Otherwise you break bisects It should be OK. The series should build and run at each patch (though I have to admit I haven't tested this). In general I've been adding config options for each feature as the feature itself is added. J - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 16/25] xen: Use the hvc console infrastructure for Xen console
Implement a Xen back-end for hvc console. From: Gerd Hoffmann [EMAIL PROTECTED] Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] --- arch/i386/xen/Kconfig |1 arch/i386/xen/events.c|3 - drivers/Makefile |3 + drivers/xen/Makefile |1 drivers/xen/hvc-console.c | 134 + include/xen/events.h |1 6 files changed, 142 insertions(+), 1 deletion(-) === --- a/arch/i386/xen/Kconfig +++ b/arch/i386/xen/Kconfig @@ -5,6 +5,7 @@ config XEN config XEN bool Enable support for Xen hypervisor depends on PARAVIRT + select HVC_DRIVER default y help This is the Linux Xen port. === --- a/arch/i386/xen/events.c +++ b/arch/i386/xen/events.c @@ -219,7 +219,7 @@ static int find_unbound_irq(void) return irq; } -static int bind_evtchn_to_irq(unsigned int evtchn) +int bind_evtchn_to_irq(unsigned int evtchn) { int irq; @@ -244,6 +244,7 @@ static int bind_evtchn_to_irq(unsigned i return irq; } +EXPORT_SYMBOL_GPL(bind_evtchn_to_irq); static int bind_ipi_to_irq(unsigned int ipi, unsigned int cpu) { === --- a/drivers/Makefile +++ b/drivers/Makefile @@ -14,6 +14,9 @@ obj-$(CONFIG_ACPI)+= acpi/ # was used and do nothing if so obj-$(CONFIG_PNP) += pnp/ obj-$(CONFIG_ARM_AMBA) += amba/ + +# Xen is the default console when running as a guest +obj-$(CONFIG_XEN) += xen/ # char/ comes before serial/ etc so that the VT console is the boot-time # default. === --- /dev/null +++ b/drivers/xen/Makefile @@ -0,0 +1,1 @@ +obj-y += hvc-console.o === --- /dev/null +++ b/drivers/xen/hvc-console.c @@ -0,0 +1,134 @@ +/* + * xen console driver interface to hvc_console.c + * + * (c) 2007 Gerd Hoffmann [EMAIL PROTECTED] + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include linux/console.h +#include linux/delay.h +#include linux/err.h +#include linux/init.h +#include linux/types.h + +#include asm/xen/hypervisor.h +#include xen/page.h +#include xen/events.h +#include xen/interface/io/console.h + +#include ../char/hvc_console.h + +#define HVC_COOKIE 0x58656e /* Xen in hex */ + +static struct hvc_struct *hvc; +static int xencons_irq; + +/* -- */ + +static inline struct xencons_interface *xencons_interface(void) +{ + return mfn_to_virt(xen_start_info-console.domU.mfn); +} + +static inline void notify_daemon(void) +{ + /* Use evtchn: this is called early, before irq is set up. */ + notify_remote_via_evtchn(xen_start_info-console.domU.evtchn); +} + +static int write_console(uint32_t vtermno, const char *data, int len) +{ + struct xencons_interface *intf = xencons_interface(); + XENCONS_RING_IDX cons, prod; + int sent = 0; + + cons = intf-out_cons; + prod = intf-out_prod; + mb(); + BUG_ON((prod - cons) sizeof(intf-out)); + + while ((sent len) ((prod - cons) sizeof(intf-out))) + intf-out[MASK_XENCONS_IDX(prod++, intf-out)] = data[sent++]; + + wmb(); + intf-out_prod = prod; + + notify_daemon(); + return sent; +} + +static int read_console(uint32_t vtermno, char *buf, int len) +{ + struct xencons_interface *intf = xencons_interface(); + XENCONS_RING_IDX cons, prod; + int recv = 0; + + cons = intf-in_cons; + prod = intf-in_prod; + mb(); + BUG_ON((prod - cons) sizeof(intf-in)); + + while (cons != prod recv len) + buf[recv++] = intf-in[MASK_XENCONS_IDX(cons++,intf-in)]; + + mb(); + intf-in_cons = cons; + + notify_daemon(); + return recv; +} + +static struct hv_ops hvc_ops = { + .get_chars = read_console, + .put_chars = write_console, +}; + +static int __init xen_init(void) +{ + struct hvc_struct *hp; + + if (!is_running_on_xen()) + return 0; + +
[PATCH 23/25] xen: Lockdep fixes for xen-netfront
netfront contains two locking problems found by lockdep: 1. rx_lock is a normal spinlock, and tx_lock is an irq spinlock. This means that in normal use, tx_lock may be taken by an interrupt routine while rx_lock is held. However, netif_disconnect_backend takes them in the order tx_lock-rx_lock, which could lead to a deadlock. Reverse them 2. rx_lock can also be taken in softirq context, so it should be taken/released with spin_(un)lock_bh. Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] Cc: Chris Wright [EMAIL PROTECTED] Cc: Christian Limpach [EMAIL PROTECTED] --- drivers/net/xen-netfront.c | 30 +++--- 1 file changed, 15 insertions(+), 15 deletions(-) === --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -515,14 +515,14 @@ static int network_open(struct net_devic memset(np-stats, 0, sizeof(np-stats)); - spin_lock(np-rx_lock); + spin_lock_bh(np-rx_lock); if (netfront_carrier_ok(np)) { network_alloc_rx_buffers(dev); np-rx.sring-rsp_event = np-rx.rsp_cons + 1; if (RING_HAS_UNCONSUMED_RESPONSES(np-rx)) netif_rx_schedule(dev); } - spin_unlock(np-rx_lock); + spin_unlock_bh(np-rx_lock); network_maybe_wake_tx(dev); @@ -1212,10 +1212,10 @@ static int netif_poll(struct net_device int pages_flipped = 0; int err; - spin_lock(np-rx_lock); + spin_lock_bh(np-rx_lock); if (unlikely(!netfront_carrier_ok(np))) { - spin_unlock(np-rx_lock); + spin_unlock_bh(np-rx_lock); return 0; } @@ -1356,7 +1356,7 @@ err: local_irq_restore(flags); } - spin_unlock(np-rx_lock); + spin_unlock_bh(np-rx_lock); return more_to_do; } @@ -1399,7 +1399,7 @@ static void netif_release_rx_bufs(struct skb_queue_head_init(free_list); - spin_lock(np-rx_lock); + spin_lock_bh(np-rx_lock); for (id = 0; id NET_RX_RING_SIZE; id++) { if ((ref = np-grant_rx_ref[id]) == GRANT_INVALID_REF) { @@ -1469,7 +1469,7 @@ static void netif_release_rx_bufs(struct while ((skb = __skb_dequeue(free_list)) != NULL) dev_kfree_skb(skb); - spin_unlock(np-rx_lock); + spin_unlock_bh(np-rx_lock); } static int network_close(struct net_device *dev) @@ -1579,8 +1579,8 @@ static int network_connect(struct net_de dev_info(dev-dev, has %sing receive path.\n, np-copying_receiver ? copy : flipp); + spin_lock_bh(np-rx_lock); spin_lock_irq(np-tx_lock); - spin_lock(np-rx_lock); /* * Recovery procedure: @@ -1632,8 +1632,8 @@ static int network_connect(struct net_de network_tx_buf_gc(dev); network_alloc_rx_buffers(dev); - spin_unlock(np-rx_lock); spin_unlock_irq(np-tx_lock); + spin_unlock_bh(np-rx_lock); return 0; } @@ -1689,7 +1689,7 @@ static ssize_t store_rxbuf_min(struct de if (target RX_MAX_TARGET) target = RX_MAX_TARGET; - spin_lock(np-rx_lock); + spin_lock_bh(np-rx_lock); if (target np-rx_max_target) np-rx_max_target = target; np-rx_min_target = target; @@ -1698,7 +1698,7 @@ static ssize_t store_rxbuf_min(struct de network_alloc_rx_buffers(netdev); - spin_unlock(np-rx_lock); + spin_unlock_bh(np-rx_lock); return len; } @@ -1732,7 +1732,7 @@ static ssize_t store_rxbuf_max(struct de if (target RX_MAX_TARGET) target = RX_MAX_TARGET; - spin_lock(np-rx_lock); + spin_lock_bh(np-rx_lock); if (target np-rx_min_target) np-rx_min_target = target; np-rx_max_target = target; @@ -1741,7 +1741,7 @@ static ssize_t store_rxbuf_max(struct de network_alloc_rx_buffers(netdev); - spin_unlock(np-rx_lock); + spin_unlock_bh(np-rx_lock); return len; } @@ -1885,11 +1885,11 @@ static void netif_disconnect_backend(str static void netif_disconnect_backend(struct netfront_info *info) { /* Stop old i/f to prevent errors whilst we rebuild the state. */ + spin_lock_bh(info-rx_lock); spin_lock_irq(info-tx_lock); - spin_lock(info-rx_lock); netfront_carrier_off(info); - spin_unlock(info-rx_lock); spin_unlock_irq(info-tx_lock); + spin_unlock_bh(info-rx_lock); if (info-irq) unbind_from_irqhandler(info-irq, info-netdev); -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 13/25] xen: xen: lazy-mmu operations
This patch uses the lazy-mmu hooks to batch mmu operations where possible. This is primarily useful for batching operations applied to active pagetables, which happens during mprotect, munmap, mremap and the like (mmap does not do bulk pagetable operations, so it isn't helped). Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] --- arch/i386/xen/enlighten.c | 56 +++- arch/i386/xen/mmu.c| 56 arch/i386/xen/multicalls.c |4 +-- 3 files changed, 78 insertions(+), 38 deletions(-) === --- a/arch/i386/xen/enlighten.c +++ b/arch/i386/xen/enlighten.c @@ -451,28 +451,38 @@ static void xen_apic_write(unsigned long static void xen_flush_tlb(void) { - struct mmuext_op op; - - op.cmd = MMUEXT_TLB_FLUSH_LOCAL; - if (HYPERVISOR_mmuext_op(op, 1, NULL, DOMID_SELF)) - BUG(); + struct mmuext_op *op; + struct multicall_space mcs = xen_mc_entry(sizeof(*op)); + + op = mcs.args; + op-cmd = MMUEXT_TLB_FLUSH_LOCAL; + MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF); + + xen_mc_issue(PARAVIRT_LAZY_MMU); } static void xen_flush_tlb_single(unsigned long addr) { - struct mmuext_op op; - - op.cmd = MMUEXT_INVLPG_LOCAL; - op.arg1.linear_addr = addr PAGE_MASK; - if (HYPERVISOR_mmuext_op(op, 1, NULL, DOMID_SELF)) - BUG(); + struct mmuext_op *op; + struct multicall_space mcs = xen_mc_entry(sizeof(*op)); + + op = mcs.args; + op-cmd = MMUEXT_INVLPG_LOCAL; + op-arg1.linear_addr = addr PAGE_MASK; + MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF); + + xen_mc_issue(PARAVIRT_LAZY_MMU); } static void xen_flush_tlb_others(const cpumask_t *cpus, struct mm_struct *mm, unsigned long va) { - struct mmuext_op op; + struct { + struct mmuext_op op; + cpumask_t mask; + } *args; cpumask_t cpumask = *cpus; + struct multicall_space mcs; /* * A couple of (to be removed) sanity checks: @@ -489,17 +499,21 @@ static void xen_flush_tlb_others(const c if (cpus_empty(cpumask)) return; + mcs = xen_mc_entry(sizeof(*args)); + args = mcs.args; + args-mask = cpumask; + args-op.arg2.vcpumask = args-mask; + if (va == TLB_FLUSH_ALL) { - op.cmd = MMUEXT_TLB_FLUSH_MULTI; - op.arg2.vcpumask = (void *)cpus; + args-op.cmd = MMUEXT_TLB_FLUSH_MULTI; } else { - op.cmd = MMUEXT_INVLPG_MULTI; - op.arg1.linear_addr = va; - op.arg2.vcpumask = (void *)cpus; - } - - if (HYPERVISOR_mmuext_op(op, 1, NULL, DOMID_SELF)) - BUG(); + args-op.cmd = MMUEXT_INVLPG_MULTI; + args-op.arg1.linear_addr = va; + } + + MULTI_mmuext_op(mcs.mc, args-op, 1, NULL, DOMID_SELF); + + xen_mc_issue(PARAVIRT_LAZY_MMU); } static unsigned long xen_read_cr2(void) === --- a/arch/i386/xen/mmu.c +++ b/arch/i386/xen/mmu.c @@ -56,12 +56,20 @@ void make_lowmem_page_readwrite(void *va void xen_set_pmd(pmd_t *ptr, pmd_t val) { - struct mmu_update u; - - u.ptr = virt_to_machine(ptr).maddr; - u.val = pmd_val_ma(val); - if (HYPERVISOR_mmu_update(u, 1, NULL, DOMID_SELF) 0) - BUG(); + struct multicall_space mcs; + struct mmu_update *u; + + preempt_disable(); + + mcs = xen_mc_entry(sizeof(*u)); + u = mcs.args; + u-ptr = virt_to_machine(ptr).maddr; + u-val = pmd_val_ma(val); + MULTI_mmu_update(mcs.mc, u, 1, NULL, DOMID_SELF); + + xen_mc_issue(PARAVIRT_LAZY_MMU); + + preempt_enable(); } /* @@ -104,20 +112,38 @@ void xen_set_pte_at(struct mm_struct *mm void xen_set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pteval) { - if ((mm != current-mm mm != init_mm) || - HYPERVISOR_update_va_mapping(addr, pteval, 0) != 0) - xen_set_pte(ptep, pteval); + if (mm == current-mm || mm == init_mm) { + if (xen_get_lazy_mode() == PARAVIRT_LAZY_MMU) { + struct multicall_space mcs; + mcs = xen_mc_entry(0); + + MULTI_update_va_mapping(mcs.mc, addr, pteval, 0); + xen_mc_issue(PARAVIRT_LAZY_MMU); + return; + } else + if (HYPERVISOR_update_va_mapping(addr, pteval, 0) == 0) + return; + } + xen_set_pte(ptep, pteval); } #ifdef CONFIG_X86_PAE void xen_set_pud(pud_t *ptr, pud_t val) { - struct mmu_update u; - - u.ptr =
[PATCH 17/25] xen: Add early printk support via hvc console
Add early printk support via hvc console, enable using earlyprintk=xen on the kernel command line. From: Gerd Hoffmann [EMAIL PROTECTED] Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] Acked-by: Ingo Molnar [EMAIL PROTECTED] --- arch/x86_64/kernel/early_printk.c |5 + drivers/xen/hvc-console.c | 25 + include/xen/hvc-console.h |6 ++ 3 files changed, 36 insertions(+) === --- a/arch/x86_64/kernel/early_printk.c +++ b/arch/x86_64/kernel/early_printk.c @@ -6,6 +6,7 @@ #include asm/io.h #include asm/processor.h #include asm/fcntl.h +#include xen/hvc-console.h /* Simple VGA output */ @@ -243,6 +244,10 @@ static int __init setup_early_printk(cha simnow_init(buf + 6); early_console = simnow_console; keep_early = 1; +#ifdef CONFIG_XEN + } else if (!strncmp(buf, xen, 3)) { + early_console = xenboot_console; +#endif } register_console(early_console); return 0; === --- a/drivers/xen/hvc-console.c +++ b/drivers/xen/hvc-console.c @@ -28,6 +28,7 @@ #include xen/page.h #include xen/events.h #include xen/interface/io/console.h +#include xen/hvc-console.h #include ../char/hvc_console.h @@ -132,3 +133,27 @@ module_init(xen_init); module_init(xen_init); module_exit(xen_fini); console_initcall(xen_cons_init); + +static void xenboot_write_console(struct console *console, const char *string, + unsigned len) +{ + unsigned int linelen, off = 0; + const char *pos; + + while (off len NULL != (pos = strchr(string+off, '\n'))) { + linelen = pos-string+off; + if (off + linelen len) + break; + write_console(0, string+off, linelen); + write_console(0, \r\n, 2); + off += linelen + 1; + } + if (off len) + write_console(0, string+off, len-off); +} + +struct console xenboot_console = { + .name = xenboot, + .write = xenboot_write_console, + .flags = CON_PRINTBUFFER | CON_BOOT, +}; === --- /dev/null +++ b/include/xen/hvc-console.h @@ -0,0 +1,6 @@ +#ifndef XEN_HVC_CONSOLE_H +#define XEN_HVC_CONSOLE_H + +extern struct console xenboot_console; + +#endif /* XEN_HVC_CONSOLE_H */ -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 03/25] xen: Add nosegneg capability to the vsyscall page notes
Add the nosegneg fake capabilty to the vsyscall page notes. This is used by the runtime linker to select a glibc version which then disables negative-offset accesses to the thread-local segment via %gs. These accesses require emulation in Xen (because segments are truncated to protect the hypervisor address space) and avoiding them provides a measurable performance boost. Signed-off-by: Ian Pratt [EMAIL PROTECTED] Signed-off-by: Christian Limpach [EMAIL PROTECTED] Signed-off-by: Chris Wright [EMAIL PROTECTED] Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] Acked-by: Zachary Amsden [EMAIL PROTECTED] Cc: Roland McGrath [EMAIL PROTECTED] Cc: Ulrich Drepper [EMAIL PROTECTED] --- arch/i386/kernel/vsyscall-note.S | 28 1 file changed, 28 insertions(+) === --- a/arch/i386/kernel/vsyscall-note.S +++ b/arch/i386/kernel/vsyscall-note.S @@ -23,3 +24,31 @@ 3: .balign 4; /* pad out section */ ASM_ELF_NOTE_BEGIN(.note.kernel-version, a, UTS_SYSNAME, 0) .long LINUX_VERSION_CODE ASM_ELF_NOTE_END + +#ifdef CONFIG_XEN +/* + * Add a special note telling glibc's dynamic linker a fake hardware + * flavor that it will use to choose the search path for libraries in the + * same way it uses real hardware capabilities like mmx. + * We supply nosegneg as the fake capability, to indicate that we + * do not like negative offsets in instructions using segment overrides, + * since we implement those inefficiently. This makes it possible to + * install libraries optimized to avoid those access patterns in someplace + * like /lib/i686/tls/nosegneg. Note that an /etc/ld.so.conf.d/file + * corresponding to the bits here is needed to make ldconfig work right. + * It should contain: + * hwcap 0 nosegneg + * to match the mapping of bit to name that we give here. + */ +#define NOTE_KERNELCAP_BEGIN(ncaps, mask) \ + ASM_ELF_NOTE_BEGIN(.note.kernelcap, a, GNU, 2) \ + .long ncaps, mask +#define NOTE_KERNELCAP(bit, name) \ + .byte bit; .asciz name +#define NOTE_KERNELCAP_END ASM_ELF_NOTE_END + +NOTE_KERNELCAP_BEGIN(1, 2) +NOTE_KERNELCAP(1, nosegneg) +NOTE_KERNELCAP_END +#endif + -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 15/25] xen: xen time fixups
1. make sure timer state is set up before bringing up CPU 2. make sure snapshot of 64-bit time values is atomic Be sure, however, that the clockevent source is registered on its home CPU. Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] --- arch/i386/xen/smp.c |4 +- arch/i386/xen/time.c| 93 +++ arch/i386/xen/xen-ops.h |3 + 3 files changed, 67 insertions(+), 33 deletions(-) === --- a/arch/i386/xen/smp.c +++ b/arch/i386/xen/smp.c @@ -78,10 +78,11 @@ static __cpuinit void cpu_bringup_and_id int cpu = smp_processor_id(); cpu_init(); - xen_setup_timer(); preempt_disable(); per_cpu(cpu_state, cpu) = CPU_ONLINE; + + xen_setup_cpu_clockevents(); /* We can take interrupts now: we're officially up. */ local_irq_enable(); @@ -275,6 +276,7 @@ int __cpuinit xen_cpu_up(unsigned int cp per_cpu(current_task, cpu) = idle; xen_vcpu_setup(cpu); irq_ctx_init(cpu); + xen_setup_timer(cpu); /* make sure interrupts start blocked */ per_cpu(xen_vcpu, cpu)-evtchn_upcall_mask = 1; === --- a/arch/i386/xen/time.c +++ b/arch/i386/xen/time.c @@ -40,6 +40,35 @@ static DEFINE_PER_CPU(u64, residual_stol static DEFINE_PER_CPU(u64, residual_stolen); static DEFINE_PER_CPU(u64, residual_blocked); +/* return an consistent snapshot of 64-bit time/counter value */ +static u64 get64(const u64 *p) +{ + u64 ret; + + if (BITS_PER_LONG 64) { + u32 *p32 = (u32 *)p; + u32 h, l; + + /* +* Read high then low, and then make sure high is +* still the same; this will only loop if low wraps +* and carries into high. +* XXX some clean way to make this endian-proof? +*/ + do { + h = p32[1]; + barrier(); + l = p32[0]; + barrier(); + } while (p32[1] != h); + + ret = (((u64)h) 32) | l; + } else + ret = *p; + + return ret; +} + /* * Runstate accounting */ @@ -53,24 +82,22 @@ static void get_runstate_snapshot(struct state = __get_cpu_var(runstate); do { - state_time = state-state_entry_time; + state_time = get64(state-state_entry_time); barrier(); *res = *state; barrier(); - } while(state-state_entry_time != state_time); -} - -static void setup_runstate_info(void) + } while(get64(state-state_entry_time) != state_time); +} + +static void setup_runstate_info(int cpu) { struct vcpu_register_runstate_memory_area area; - area.addr.v = __get_cpu_var(runstate); + area.addr.v = per_cpu(runstate, cpu); if (HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area, - smp_processor_id(), area)) + cpu, area)) BUG(); - - get_runstate_snapshot(__get_cpu_var(runstate_snapshot)); } static void do_stolen_accounting(void) @@ -185,12 +212,10 @@ unsigned long xen_cpu_khz(void) * Reads a consistent set of time-base values from Xen, into a shadow data * area. */ -static void get_time_values_from_xen(void) +static unsigned get_time_values_from_xen(void) { struct vcpu_time_info *src; struct shadow_time_info *dst; - - preempt_disable(); src = __get_cpu_var(xen_vcpu)-time; dst = __get_cpu_var(shadow_time); @@ -205,7 +230,7 @@ static void get_time_values_from_xen(voi rmb(); } while ((src-version 1) | (dst-version ^ src-version)); - preempt_enable(); + return dst-version; } /* @@ -249,7 +274,7 @@ static u64 get_nsec_offset(struct shadow static u64 get_nsec_offset(struct shadow_time_info *shadow) { u64 now, delta; - rdtscll(now); + now = native_read_tsc(); delta = now - shadow-tsc_timestamp; return scale_delta(delta, shadow-tsc_to_nsec_mul, shadow-tsc_shift); } @@ -258,10 +283,14 @@ static cycle_t xen_clocksource_read(void { struct shadow_time_info *shadow = get_cpu_var(shadow_time); cycle_t ret; - - get_time_values_from_xen(); - - ret = shadow-system_timestamp + get_nsec_offset(shadow); + unsigned version; + + do { + version = get_time_values_from_xen(); + barrier(); + ret = shadow-system_timestamp + get_nsec_offset(shadow); + barrier(); + } while(version != __get_cpu_var(xen_vcpu)-time.version); put_cpu_var(shadow_time); @@ -483,9 +512,8 @@ static irqreturn_t xen_timer_interrupt(i return ret; } -void
[PATCH 12/25] xen: Add support for preemption
Add Xen support for preemption. This is mostly a cleanup of existing preempt_enable/disable calls, or just comments to explain the current usage. Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] --- arch/i386/xen/Kconfig |2 arch/i386/xen/enlighten.c | 93 arch/i386/xen/mmu.c|4 + arch/i386/xen/multicalls.c | 11 ++--- arch/i386/xen/time.c | 22 -- 5 files changed, 88 insertions(+), 44 deletions(-) === --- a/arch/i386/xen/Kconfig +++ b/arch/i386/xen/Kconfig @@ -4,7 +4,7 @@ config XEN bool Enable support for Xen hypervisor - depends on PARAVIRT !PREEMPT + depends on PARAVIRT default y help This is the Linux Xen port. === --- a/arch/i386/xen/enlighten.c +++ b/arch/i386/xen/enlighten.c @@ -2,6 +2,7 @@ #include linux/init.h #include linux/smp.h #include linux/preempt.h +#include linux/hardirq.h #include linux/percpu.h #include linux/delay.h #include linux/start_kernel.h @@ -92,11 +93,10 @@ static unsigned long xen_save_fl(void) struct vcpu_info *vcpu; unsigned long flags; - preempt_disable(); vcpu = x86_read_percpu(xen_vcpu); + /* flag has opposite sense of mask */ flags = !vcpu-evtchn_upcall_mask; - preempt_enable(); /* convert to IF type flag -0 - 0x @@ -109,41 +109,56 @@ static void xen_restore_fl(unsigned long { struct vcpu_info *vcpu; - preempt_disable(); - /* convert from IF type flag */ flags = !(flags X86_EFLAGS_IF); + + /* There's a one instruction preempt window here. We need to + make sure we're don't switch CPUs between getting the vcpu + pointer and updating the mask. */ + preempt_disable(); vcpu = x86_read_percpu(xen_vcpu); vcpu-evtchn_upcall_mask = flags; + preempt_enable_no_resched(); + + /* Doesn't matter if we get preempted here, because any + pending event will get dealt with anyway. */ + if (flags == 0) { + preempt_check_resched(); barrier(); /* unmask then check (avoid races) */ if (unlikely(vcpu-evtchn_upcall_pending)) force_evtchn_callback(); - preempt_enable(); - } else - preempt_enable_no_resched(); + } } static void xen_irq_disable(void) { + /* There's a one instruction preempt window here. We need to + make sure we're don't switch CPUs between getting the vcpu + pointer and updating the mask. */ + preempt_disable(); + x86_read_percpu(xen_vcpu)-evtchn_upcall_mask = 1; + preempt_enable_no_resched(); +} + +static void xen_irq_enable(void) +{ struct vcpu_info *vcpu; - preempt_disable(); - vcpu = x86_read_percpu(xen_vcpu); - vcpu-evtchn_upcall_mask = 1; - preempt_enable_no_resched(); -} - -static void xen_irq_enable(void) -{ - struct vcpu_info *vcpu; - + + /* There's a one instruction preempt window here. We need to + make sure we're don't switch CPUs between getting the vcpu + pointer and updating the mask. */ preempt_disable(); vcpu = x86_read_percpu(xen_vcpu); vcpu-evtchn_upcall_mask = 0; + preempt_enable_no_resched(); + + /* Doesn't matter if we get preempted here, because any + pending event will get dealt with anyway. */ + barrier(); /* unmask then check (avoid races) */ if (unlikely(vcpu-evtchn_upcall_pending)) force_evtchn_callback(); - preempt_enable(); } static void xen_safe_halt(void) @@ -163,6 +178,8 @@ static void xen_halt(void) static void xen_set_lazy_mode(enum paravirt_lazy_mode mode) { + BUG_ON(preemptible()); + switch(mode) { case PARAVIRT_LAZY_NONE: BUG_ON(x86_read_percpu(xen_lazy_mode) == PARAVIRT_LAZY_NONE); @@ -262,12 +279,17 @@ static void xen_write_ldt_entry(struct d xmaddr_t mach_lp = virt_to_machine(lp); u64 entry = (u64)high 32 | low; + preempt_disable(); + xen_mc_flush(); if (HYPERVISOR_update_descriptor(mach_lp.maddr, entry)) BUG(); -} - -static int cvt_gate_to_trap(int vector, u32 low, u32 high, struct trap_info *info) + + preempt_enable(); +} + +static int cvt_gate_to_trap(int vector, u32 low, u32 high, + struct trap_info *info) { u8 type, dpl; @@ -295,11 +317,13 @@ static DEFINE_PER_CPU(struct Xgt_desc_st also update Xen. */ static void xen_write_idt_entry(struct desc_struct *dt, int entrynum, u32 low, u32 high) { - - int cpu = smp_processor_id(); unsigned long p = (unsigned long)dt[entrynum]; - unsigned long start =
[PATCH 14/25] xen: xen: deal with negative stolen time
Stolen time should never be negative; if it ever is, it probably indicates some other bug. However, if it does happen, then its better to just clamp it at zero, rather than trying to account for it as a huge positive number. Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] --- arch/i386/xen/time.c | 19 --- 1 file changed, 16 insertions(+), 3 deletions(-) === --- a/arch/i386/xen/time.c +++ b/arch/i386/xen/time.c @@ -77,7 +77,7 @@ static void do_stolen_accounting(void) { struct vcpu_runstate_info state; struct vcpu_runstate_info *snap; - u64 blocked, runnable, offline, stolen; + s64 blocked, runnable, offline, stolen; cputime_t ticks; get_runstate_snapshot(state); @@ -97,6 +97,10 @@ static void do_stolen_accounting(void) including any left-overs from last time. Passing NULL to account_steal_time accounts the time as stolen. */ stolen = runnable + offline + __get_cpu_var(residual_stolen); + + if (stolen 0) + stolen = 0; + ticks = 0; while(stolen = NS_PER_TICK) { ticks++; @@ -109,6 +113,10 @@ static void do_stolen_accounting(void) including any left-overs from last time. Passing idle to account_steal_time accounts the time as idle/wait. */ blocked += __get_cpu_var(residual_blocked); + + if (blocked 0) + blocked = 0; + ticks = 0; while(blocked = NS_PER_TICK) { ticks++; @@ -127,7 +135,8 @@ unsigned long long xen_sched_clock(void) { struct vcpu_runstate_info state; cycle_t now; - unsigned long long ret; + u64 ret; + s64 offset; /* * Ideally sched_clock should be called on a per-cpu basis @@ -142,9 +151,13 @@ unsigned long long xen_sched_clock(void) WARN_ON(state.state != RUNSTATE_running); + offset = now - state.state_entry_time; + if (offset 0) + offset = 0; + ret = state.time[RUNSTATE_blocked] + state.time[RUNSTATE_running] + - (now - state.state_entry_time); + offset; preempt_enable(); -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 09/25] xen: Account for time stolen by Xen
This accounts for the time Xen steals from our VCPUs. This accounting gets run on each timer interrupt, just as a way to get it run relatively often, and when interesting things are going on. Stolen time is not really used by much in the kernel; it is reported in /proc/stats, and that's about it. Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] Cc: john stultz [EMAIL PROTECTED] --- arch/i386/xen/time.c | 101 +- 1 file changed, 100 insertions(+), 1 deletion(-) === --- a/arch/i386/xen/time.c +++ b/arch/i386/xen/time.c @@ -2,6 +2,7 @@ #include linux/interrupt.h #include linux/clocksource.h #include linux/clockchips.h +#include linux/kernel_stat.h #include asm/xen/hypervisor.h #include asm/xen/hypercall.h @@ -14,6 +15,7 @@ #define XEN_SHIFT 22 #define TIMER_SLOP 10 /* Xen may fire a timer up to this many ns early */ +#define NS_PER_TICK(10ll / HZ) /* These are perodically updated in shared_info, and then copied here. */ struct shadow_time_info { @@ -26,6 +28,99 @@ struct shadow_time_info { static DEFINE_PER_CPU(struct shadow_time_info, shadow_time); +/* runstate info updated by Xen */ +static DEFINE_PER_CPU(struct vcpu_runstate_info, runstate); + +/* snapshots of runstate info */ +static DEFINE_PER_CPU(struct vcpu_runstate_info, runstate_snapshot); + +/* unused ns of stolen and blocked time */ +static DEFINE_PER_CPU(u64, residual_stolen); +static DEFINE_PER_CPU(u64, residual_blocked); + +/* + * Runstate accounting + */ +static void get_runstate_snapshot(struct vcpu_runstate_info *res) +{ + u64 state_time; + struct vcpu_runstate_info *state; + + preempt_disable(); + + state = __get_cpu_var(runstate); + + do { + state_time = state-state_entry_time; + barrier(); + *res = *state; + barrier(); + } while(state-state_entry_time != state_time); + + preempt_enable(); +} + +static void setup_runstate_info(void) +{ + struct vcpu_register_runstate_memory_area area; + + area.addr.v = __get_cpu_var(runstate); + + if (HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area, + smp_processor_id(), area)) + BUG(); + + get_runstate_snapshot(__get_cpu_var(runstate_snapshot)); +} + +static void do_stolen_accounting(void) +{ + struct vcpu_runstate_info state; + struct vcpu_runstate_info *snap; + u64 blocked, runnable, offline, stolen; + cputime_t ticks; + + get_runstate_snapshot(state); + + WARN_ON(state.state != RUNSTATE_running); + + snap = __get_cpu_var(runstate_snapshot); + + /* work out how much time the VCPU has not been runn*ing* */ + blocked = state.time[RUNSTATE_blocked] - snap-time[RUNSTATE_blocked]; + runnable = state.time[RUNSTATE_runnable] - snap-time[RUNSTATE_runnable]; + offline = state.time[RUNSTATE_offline] - snap-time[RUNSTATE_offline]; + + *snap = state; + + /* Add the appropriate number of ticks of stolen time, + including any left-overs from last time. Passing NULL to + account_steal_time accounts the time as stolen. */ + stolen = runnable + offline + __get_cpu_var(residual_stolen); + ticks = 0; + while(stolen = NS_PER_TICK) { + ticks++; + stolen -= NS_PER_TICK; + } + __get_cpu_var(residual_stolen) = stolen; + account_steal_time(NULL, ticks); + + /* Add the appropriate number of ticks of blocked time, + including any left-overs from last time. Passing idle to + account_steal_time accounts the time as idle/wait. */ + blocked += __get_cpu_var(residual_blocked); + ticks = 0; + while(blocked = NS_PER_TICK) { + ticks++; + blocked -= NS_PER_TICK; + } + __get_cpu_var(residual_blocked) = blocked; + account_steal_time(idle_task(smp_processor_id()), ticks); +} + + + +/* Get the CPU speed from Xen */ unsigned long xen_cpu_khz(void) { u64 cpu_khz = 100ULL 32; @@ -338,6 +433,8 @@ static irqreturn_t xen_timer_interrupt(i ret = IRQ_HANDLED; } + do_stolen_accounting(); + return ret; } @@ -363,6 +460,8 @@ static void xen_setup_timer(int cpu) evt-irq = irq; clockevents_register_device(evt); + setup_runstate_info(); + put_cpu_var(xen_clock_events); } @@ -375,7 +474,7 @@ __init void xen_time_init(void) clocksource_register(xen_clocksource); if (HYPERVISOR_vcpu_op(VCPUOP_stop_periodic_timer, cpu, NULL) == 0) { - /* Successfully turned off 100hz tick, so we have the + /* Successfully turned off 100Hz tick, so we have the vcpuop-based timer interface */ printk(KERN_DEBUG Xen: using vcpuop
[PATCH 00/25] xen: Xen implementation for paravirt_ops
Hi Andi, This series of patches implements the Xen paravirt-ops interface. It applies to 2.6.21-rc7 + your patches + the last batch of pv_ops patches I posted. This patch generally restricts itself to Xen-specific parts of the tree, though it does make a few small changes elsewhere. These patches include: - some helper routines for allocating address space and walking pagetables - Xen interface header files - Core Xen implementation - Efficient late-pinning/early-unpinning pagetable handling - Virtualized time, including stolen time - SMP support - Preemption support - Batched pagetable updates - Xen console, based on hvc console - Xenbus - Netfront, the paravirtualized network device - Blockfront, the paravirtualized block device Thanks, J -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 02/25] xen: Allocate and free vmalloc areas
Allocate/destroy a 'vmalloc' VM area: alloc_vm_area and free_vm_area The alloc function ensures that page tables are constructed for the region of kernel virtual address space and mapped into init_mm. Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] Signed-off-by: Ian Pratt [EMAIL PROTECTED] Signed-off-by: Christian Limpach [EMAIL PROTECTED] Signed-off-by: Chris Wright [EMAIL PROTECTED] Cc: Jan Beulich [EMAIL PROTECTED] Cc: Andi Kleen [EMAIL PROTECTED] --- include/linux/vmalloc.h |4 +++ mm/vmalloc.c| 51 +++ 2 files changed, 55 insertions(+) === --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -68,6 +68,10 @@ extern int map_vm_area(struct vm_struct struct page ***pages); extern void unmap_vm_area(struct vm_struct *area); +/* Allocate/destroy a 'vmalloc' VM area. */ +extern struct vm_struct *alloc_vm_area(unsigned long size); +extern void free_vm_area(struct vm_struct *area); + /* * Internals. Dont't use.. */ === --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -757,3 +757,54 @@ out_einval_locked: } EXPORT_SYMBOL(remap_vmalloc_range); +static int f(pte_t *pte, struct page *pmd_page, unsigned long addr, void *data) +{ + /* apply_to_page_range() does all the hard work. */ + return 0; +} + +/** + * alloc_vm_area - allocate a range of kernel address space + * @size: size of the area + * @returns: NULL on failure, vm_struct on success + * + * This function reserves a range of kernel address space, and + * allocates pagetables to map that range. No actual mappings + * are created. If the kernel address space is not shared + * between processes, it syncs the pagetable across all + * processes. + */ +struct vm_struct *alloc_vm_area(unsigned long size) +{ + struct vm_struct *area; + + area = get_vm_area(size, VM_IOREMAP); + if (area == NULL) + return NULL; + + /* +* This ensures that page tables are constructed for this region +* of kernel virtual address space and mapped into init_mm. +*/ + if (apply_to_page_range(init_mm, (unsigned long)area-addr, + area-size, f, NULL)) { + free_vm_area(area); + return NULL; + } + + /* Make sure the pagetables are constructed in process kernel + mappings */ + vmalloc_sync_all(); + + return area; +} +EXPORT_SYMBOL_GPL(alloc_vm_area); + +void free_vm_area(struct vm_struct *area) +{ + struct vm_struct *ret; + ret = remove_vm_area(area-addr); + BUG_ON(ret != area); + kfree(area); +} +EXPORT_SYMBOL_GPL(free_vm_area); -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH -mm 2/3] freezer: Introduce freezer_flags
On 04/24, Rafael J. Wysocki wrote: On Tuesday, 24 April 2007 00:55, Oleg Nesterov wrote: On 04/24, Rafael J. Wysocki wrote: Should I clear it in dup_task_struct() or is there a better place? I personally think we should do this in dup_task_struct(). In fact, I believe it is better to replace the *tsk = *orig; with some helper (like setup_thread_stack() below), and that helper clears -freezer_flags. Say, copy_task_struct(). Hmm, wouldn't that be overkill? copy_task_struct() would have to do *tsk = *orig anyway, and we only need to clear one field apart from this. Some other fields are cleared towards the end of dup_task_struct(), so perhaps we could clear freezer_flags in there too? Yes. And I strongly believe it is bad we don't have the helper which does some random stuf like p-did_exec = 0. The same for thread_info. Could you answer quickly where do we clear TIF_FREEZE currently? We don't. Oleg. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/25] xen: Add nosegneg capability to the vsyscall page notes
+ * It should contain: + * hwcap 0 nosegneg + * to match the mapping of bit to name that we give here. This needs to be hwcap 0 nosegneg to match: +NOTE_KERNELCAP_BEGIN(1, 2) +NOTE_KERNELCAP(1, nosegneg) +NOTE_KERNELCAP_END The actual bits you are using should be fine. (You're intentionally skipping bit 0 to work around hold glibc bugs, which you might want to add to the comments. Also a comment or perhaps using 11 syntax would make it more clear that 2 is the bit mask containing bit 1 and that's why it has to be 2, and not because of some other magical property of 2.) But if kernel packagers don't write the matching bit number in their ld.so.conf.d files, then ld.so.cache lookups won't work right. Thanks, Roland - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Loud pop coming from hard drive on reboot
Peter Zijlstra wrote: but I have an increasing seek error rate as well. I got the ST disk because thinkwiki suggested it. Apparently Seagate has their own definition of seek error rate. Large numbers are normal, or at least very common. Now I wonder if they have their own way of doing retract count... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE][PATCH] Kcli - Kernel command line interface.
On Mon, 23 Apr 2007 14:31:39 -0700 (PDT) Matt Ranon [EMAIL PROTECTED] wrote: (text reformatted to less than 80 cols. Please, we'll get along a lot better if you don't send 1000-column emails) The Jem team is pleased to announce the release of Kcli, an in-kernel command line interface. Kcli is intended for a special class of embedded Linux applications. The Linux kernel has become the defacto standard OS for embedded applications. This means that Linux is getting bent in some ways that may appear strange to some. One of these ways, is embedded applications that do not use user space. User space consists of a statically linked, one line program, that simply sleeps forever, transforming Linux into a classical embedded RTOS. VxWorks developers will understand what we are talking about, and they may recall how much they depend on the VxWorks shell. Kcli attempts to meet the need for a shell for this class of embedded Linux applications. Alas, we are not vxworks developers, and probably most of us know zero about the use-cases for this feature, why embedded systems find it valuable, etc. So it's up to you to tell us all this. Kcli provides a command line environment that runs in the kernel, and that can be extended with custom commands registered by other kernel modules. We have found Kcli invaluable for our development, and we are releasing the patch, in case others find it useful. Kcli is directly derived from libcli written by David Parrish and Brendan O'Dea, and the regular expression support is directly derived from diet libc written by Felix von Leitner. The Jem team fully understands that this kind of patch may not be appropriate for inclusion in the mainline kernel code. We have no expectation that it will be, and we leave that decision fully in the hands of those responsible. We don't have enough information to make that call. Nonetheless, we feel that others may find it useful, and we will also appreciate any appropriate feedback from the community. Kcli is standalone, and modifies no kernel files, except for the Kconfig and Makefile modifications required to wire it into the configuration and build. The obvious question is: what's _wrong_ with doing all this in some cut-down userspace environment like busybox? Why is this stuff better? Obviously some embedded developers have considered that option and have rejected it. But we do need to be told, at length, why that decision was made. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA SB600 works in 2.6.20.4 but not in 2.6.21-rc5 with irqpoll parameter
Karsten Vieth wrote: I can't report this problem from a new kernel, but i have the same problem with the kernel 2.6.20.1-33x from f7-test3. I managed to boot with these options: linux noapic acpi=off pci=nomsi irqpoll Can you narrow down the options? Hopefully pci=nomsi or similar should do it. irqpoll in particular is heavyweight and to be avoided if possible. Jeff - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Return EPERM not ECHILD on security_task_wait failure
On Thu, 15 Mar 2007, Roland McGrath wrote: This patch makes do_wait return -EPERM instead of -ECHILD if some children were ruled out solely because security_task_wait failed. What about using the return value from the security_task_wait hook (which should be -EACCES) ? - James -- James Morris [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] First glitch1 results, 2.6.21-rc7-git6-CFSv5 + SD 0.46
On Monday 23 April 2007 17:57, Bill Davidsen wrote: I am not sure a binary attachment will go thru, I will move to the web ste if not. I did a quick try of this script here. With SD 0.46 with X at nice 0 I was getting 1-2 frames per second. I decided to try cfs v5. The option disable auto renicing did not work so many threads other than X are now at -19... SD 0.46 1-2 FPS cfs v5 nice -19 219-233 FPS cfs v5 nice 0 1000-1996 Looks like, in this case, nice -19 for X is NOT a good idea. Kernel is 2.6.20.7 (gentoo) UP amd64 with HZ 300 voluntary prempt (a fully premptable kernel eventually locks up switching between 32 and 64 apps) Thanks, Ed Tomlinson - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] First glitch1 results, 2.6.21-rc7-git6-CFSv5 + SD 0.46
On Monday 23 April 2007 19:45, Ed Tomlinson wrote: On Monday 23 April 2007 17:57, Bill Davidsen wrote: I am not sure a binary attachment will go thru, I will move to the web ste if not. I did a quick try of this script here. With SD 0.46 with X at nice 0 I was getting 1-2 frames per second. I decided to try cfs v5. The option disable auto renicing did not work so many threads other than X are now at -19... SD 0.46 1-2 FPS cfs v5 nice -19 219-233 FPS cfs v5 nice 0 1000-1996 cfs v5 nice -10 60-65 FPS Looks like, in this case, nice -19 for X is NOT a good idea. Kernel is 2.6.20.7 (gentoo) UP amd64 with HZ 300 voluntary prempt (a fully premptable kernel eventually locks up switching between 32 and 64 apps) Thanks Ed - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: MODULE_MAINTAINER
On Mon, 2007-04-23 at 07:52 -0400, Robert P. J. Day wrote: On Mon, 23 Apr 2007, Rusty Russell wrote: On Mon, 2007-04-23 at 11:33 +0200, Rene Herman wrote: On 04/04/2007 06:38 PM, Rene Herman wrote: Rusty? Valid points have been made on both sides. I suggest: #define MODULE_MAINTAINER(_maintainer) \ MODULE_AUTHOR((Maintained by) _maintainer) why bring MODULE_AUTHOR into it? just define it in terms of MODULE_INFO: Because author is an established field. People might well search for it. This is fairly clear, and assuming that the maintainer has actually done any maintenance, they're an author too. Rusty. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question about Reiser4
Theodore Tso wrote: One of the big problems of using a filesystem as a DB is the system call overheads. If you use huge numbers of tiny files, then each attempt read an atom of information from the DB takes three system calls --- an open(), read(), and close(), with all of the overheads in terms of dentry and inode cache. Now, to be fair, there are probably a number of cases where open/lseek/readv/close and open/lseek/writev/close would be worth doing as a single system call. The big problem as far as I can see involves EINTR handling; such a system call has serious restartability implications. Of course, there are Ingo's syslets... -hpa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Remove obsolete label from ISDN4Linux (v3)
Am 22.04.2007 17:17 schrieb Alan Cox: Well once it ends up BROKEN perhaps patches will appear, or before that. If not well the pain factor will resolve the problem. No risk of deadlock. It'll progress to BROKEN which will either cause sufficient pain for someone to get off their arse and fix it, for enough of a vendors users to get the vendor to do the work or for someone who cares to pay a third party to do the work. Do I sense some hidden agenda there? No I'm speaking from experience - if a subsystem maintainer is too busy/working on other projects and the subsystem stops working it produces a rapid and sudden supply of new maintainers, unless nobody cares in which case it can go in the bitbucket. The isdn4linux subsystem will not progress to BROKEN unless somebody pushes it there. It has drivers using functions that will soon be deleted. That isn't so much as pushing more like getting fed up of pulling someone elses cart along. Do I understand you correctly? You deliberately want to move it to BROKEN to cause pain in the hope of forcing somebody other than the person who did the kernel change in the first place (quote stable_api_nonsense.txt) to do the fixing up? Am 22.04.2007 18:20 schrieb Alan Cox: Why, or rather how, were the writers of newer APIs _allowed_ to push *their* stuff into the kernel _without_ even bothering to convert the *existing* users of the older APIs in the kernel? This goes against Because to convert the existing ISDN4Linux heap into the new APIs would require someone with all the cards involved and a lot of time (as the card drivers need a *lot* of work by now to bring them up to todays work) Not true. None of the past kernel API changes were done by someone who had all the hardware for the affected drivers. I have personally acked changes to the driver I maintain from people who don't have the hardware, and the changes were fine. The one inventing a new kernel API to replace an old one is in the best position for actually replacing it in the existing users of the old API, and that's also what stable_api_nonsense.txt stipulates. Precedent, that implies it is a new behaviour - which it isn't. We regularly break old driver code when it is neccessary in order to make general progress. Grep for BROKEN in the kernel tree. I did grep for BROKEN in the 2.6.21-rc7 sources and couldn't find an instance of a driver that was still in active use being broken in order to make general progress. OTOH I remember several cases of drivers being kept alive even though they were in the way of progress, because there were still users relying on them. You, and anyone else who wants to, are free to work on I4L and fix it, improve it and make it better. You are turning the situation on its head. I4L works. Somebody wants to push through a kernel API change that would break it. In every other case I know, it was the responsibility of those doing the kernel API change to fix the in-tree users of that API. As long as they didn't finish that job, the old API would stay. Nobody advocates moving reiserfs to BROKEN for still using lock_kernel(), to cite a recent issue. So why isdn4linux? -- Tilman Schmidt E-Mail: [EMAIL PROTECTED] Bonn, Germany - Undetected errors are handled as if no error occurred. (IBM) - signature.asc Description: OpenPGP digital signature
[patch 1/7] libata: check for AN support
Check to see if an ATAPI device supports Asynchronous Notification. If so, enable it. Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED] Index: 2.6-git/drivers/ata/libata-core.c === --- 2.6-git.orig/drivers/ata/libata-core.c +++ 2.6-git/drivers/ata/libata-core.c @@ -70,6 +70,7 @@ const unsigned long sata_deb_timing_long static unsigned int ata_dev_init_params(struct ata_device *dev, u16 heads, u16 sectors); static unsigned int ata_dev_set_xfermode(struct ata_device *dev); +static unsigned int ata_dev_set_AN(struct ata_device *dev); static void ata_dev_xfermask(struct ata_device *dev); static unsigned int ata_print_id = 1; @@ -1744,6 +1745,23 @@ int ata_dev_configure(struct ata_device } dev-cdb_len = (unsigned int) rc; + /* +* check to see if this ATAPI device supports +* Asynchronous Notification +*/ + if ((ap-flags ATA_FLAG_AN) ata_id_has_AN(id)) + { + /* issue SET feature command to turn this on */ + rc = ata_dev_set_AN(dev); + if (rc) { + ata_dev_printk(dev, KERN_ERR, + unable to set AN\n); + rc = -EINVAL; + goto err_out_nosup; + } + dev-flags |= ATA_DFLAG_AN; + } + if (ata_id_cdb_intr(dev-id)) { dev-flags |= ATA_DFLAG_CDB_INTR; cdb_intr_string = , CDB intr; @@ -3525,6 +3543,42 @@ static unsigned int ata_dev_set_xfermode } /** + * ata_dev_set_AN - Issue SET FEATURES - SATA FEATURES + * with sector count set to indicate + * Asynchronous Notification feature + * @dev: Device to which command will be sent + * + * Issue SET FEATURES - SATA FEATURES command to device @dev + * on port @ap. + * + * LOCKING: + * PCI/etc. bus probe sem. + * + * RETURNS: + * 0 on success, AC_ERR_* mask otherwise. + */ +static unsigned int ata_dev_set_AN(struct ata_device *dev) +{ + struct ata_taskfile tf; + unsigned int err_mask; + + /* set up set-features taskfile */ + DPRINTK(set features - SATA features\n); + + ata_tf_init(dev, tf); + tf.command = ATA_CMD_SET_FEATURES; + tf.feature = SETFEATURES_SATA_ENABLE; + tf.flags |= ATA_TFLAG_ISADDR | ATA_TFLAG_DEVICE; + tf.protocol = ATA_PROT_NODATA; + tf.nsect = SATA_AN; + + err_mask = ata_exec_internal(dev, tf, NULL, DMA_NONE, NULL, 0); + + DPRINTK(EXIT, err_mask=%x\n, err_mask); + return err_mask; +} + +/** * ata_dev_init_params - Issue INIT DEV PARAMS command * @dev: Device to which command will be sent * @heads: Number of heads (taskfile parameter) Index: 2.6-git/include/linux/ata.h === --- 2.6-git.orig/include/linux/ata.h +++ 2.6-git/include/linux/ata.h @@ -194,6 +194,12 @@ enum { SETFEATURES_WC_ON = 0x02, /* Enable write cache */ SETFEATURES_WC_OFF = 0x82, /* Disable write cache */ + SETFEATURES_SATA_ENABLE = 0x10, /* Enable use of SATA feature */ + SETFEATURES_SATA_DISABLE = 0x90, /* Disable use of SATA feature */ + + /* SETFEATURE Sector counts for SATA features */ + SATA_AN = 0x05, /* Asynchronous Notification */ + /* ATAPI stuff */ ATAPI_PKT_DMA = (1 0), ATAPI_DMADIR= (1 2), /* ATAPI data dir: @@ -299,6 +305,8 @@ struct ata_taskfile { #define ata_id_queue_depth(id) (((id)[75] 0x1f) + 1) #define ata_id_removeable(id) ((id)[0] (1 7)) #define ata_id_has_dword_io(id)((id)[50] (1 0)) +#define ata_id_has_AN(id) \ + ((id[76] (~id[76])) ((id)[78] (1 5))) #define ata_id_iordy_disable(id) ((id)[49] (1 10)) #define ata_id_has_iordy(id) ((id)[49] (1 9)) #define ata_id_u32(id,n) \ Index: 2.6-git/include/linux/libata.h === --- 2.6-git.orig/include/linux/libata.h +++ 2.6-git/include/linux/libata.h @@ -136,6 +136,7 @@ enum { ATA_DFLAG_CDB_INTR = (1 2), /* device asserts INTRQ when ready for CDB */ ATA_DFLAG_NCQ = (1 3), /* device supports NCQ */ ATA_DFLAG_FLUSH_EXT = (1 4), /* do FLUSH_EXT instead of FLUSH */ + ATA_DFLAG_AN= (1 5), /* device supports Async notification */ ATA_DFLAG_CFG_MASK = (1 8) - 1, ATA_DFLAG_PIO = (1 8), /* device limited to PIO mode */ @@ -174,6 +175,7 @@ enum { ATA_FLAG_SETXFER_POLLING= (1 14), /* use polling for SETXFER */
[patch 5/7] genhd: send async notification on media change
Send an uevent to user space to indicate that a media change event has occurred. Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED] Index: 2.6-git/block/genhd.c === --- 2.6-git.orig/block/genhd.c +++ 2.6-git/block/genhd.c @@ -643,6 +643,25 @@ struct seq_operations diskstats_op = { .show = diskstats_show }; +static void media_change_notify_thread(struct work_struct *work) +{ + struct gendisk *gd = container_of(work, struct gendisk, async_notify); + char event[] = MEDIA_CHANGE=1; + char *envp[] = { event, NULL }; + + /* +* set enviroment vars to indicate which event this is for +* so that user space will know to go check the media status. +*/ + kobject_uevent_env(gd-kobj, KOBJ_CHANGE, envp); +} + +void genhd_media_change_notify(struct gendisk *disk) +{ + schedule_work(disk-async_notify); +} +EXPORT_SYMBOL_GPL(genhd_media_change_notify); + struct gendisk *alloc_disk(int minors) { return alloc_disk_node(minors, -1); @@ -672,6 +691,8 @@ struct gendisk *alloc_disk_node(int mino kobj_set_kset_s(disk,block_subsys); kobject_init(disk-kobj); rand_initialize_disk(disk); + INIT_WORK(disk-async_notify, + media_change_notify_thread); } return disk; } Index: 2.6-git/include/linux/genhd.h === --- 2.6-git.orig/include/linux/genhd.h +++ 2.6-git/include/linux/genhd.h @@ -66,6 +66,7 @@ struct partition { #include linux/smp.h #include linux/string.h #include linux/fs.h +#include linux/workqueue.h struct partition { unsigned char boot_ind; /* 0x80 - active */ @@ -139,6 +140,7 @@ struct gendisk { #else struct disk_stats dkstats; #endif + struct work_struct async_notify; }; /* Structure for sysfs attributes on block devices */ @@ -419,7 +421,7 @@ extern struct gendisk *alloc_disk_node(i extern struct gendisk *alloc_disk(int minors); extern struct kobject *get_disk(struct gendisk *disk); extern void put_disk(struct gendisk *disk); - +extern void genhd_media_change_notify(struct gendisk *disk); extern void blk_register_region(dev_t dev, unsigned long range, struct module *module, struct kobject *(*probe)(dev_t, int *, void *), -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 0/7] Asynchronous Notification for ATAPI devices (v2)
This patch series implements Asynchronous Notification (AN) for SATA ATAPI devices as defined in SATA 2.5 and AHCI 1.1 and higher. Drives which support this feature will send a notification when new media is inserted and removed, preventing the need for user space to poll for new media. This support is exposed to user space via a flag that will be set in /sys/block/sr*/capability_flags. If the flag is set, user space can disable polling for the new media, and the genhd driver will send a KOBJ_CHANGE event with the envp set to MEDIA_CHANGE_EVENT=1. Note that this patch only implements support for directly attached drives - AN with drives attached to a port multiplier requires additional changes. Thanks! Kristen -- -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 3/7] scsi: expose AN to user space
Get media change notification capability from disk and pass this information to genhd by setting appropriate flag. Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED] Index: 2.6-git/drivers/scsi/sr.c === --- 2.6-git.orig/drivers/scsi/sr.c +++ 2.6-git/drivers/scsi/sr.c @@ -601,6 +601,8 @@ static int sr_probe(struct device *dev) dev_set_drvdata(dev, cd); disk-flags |= GENHD_FL_REMOVABLE; + if (sdev-media_change_notify) + disk-flags |= GENHD_FL_MEDIA_CHANGE_NOTIFY; add_disk(disk); sdev_printk(KERN_DEBUG, sdev, Index: 2.6-git/include/scsi/scsi_device.h === --- 2.6-git.orig/include/scsi/scsi_device.h +++ 2.6-git/include/scsi/scsi_device.h @@ -124,7 +124,7 @@ struct scsi_device { unsigned fix_capacity:1;/* READ_CAPACITY is too high by 1 */ unsigned guess_capacity:1; /* READ_CAPACITY might be too high by 1 */ unsigned retry_hwerror:1; /* Retry HARDWARE_ERROR */ - + unsigned media_change_notify:1; /* dev supports async media notify */ unsigned int device_blocked;/* Device returned QUEUE_FULL. */ unsigned int max_device_blocked; /* what device_blocked counts down from */ Index: 2.6-git/drivers/scsi/sd.c === --- 2.6-git.orig/drivers/scsi/sd.c +++ 2.6-git/drivers/scsi/sd.c @@ -1706,6 +1706,9 @@ static int sd_probe(struct device *dev) if (sdp-removable) gd-flags |= GENHD_FL_REMOVABLE; + if (sdp-media_change_notify) + gd-flags |= GENHD_FL_MEDIA_CHANGE_NOTIFY; + dev_set_drvdata(dev, sdkp); add_disk(gd); -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 4/7] libata: expose AN to user space
If Asynchronous Notification of media change events is supported, pass that information up to the SCSI layer. Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED] Index: 2.6-git/drivers/ata/libata-scsi.c === --- 2.6-git.orig/drivers/ata/libata-scsi.c +++ 2.6-git/drivers/ata/libata-scsi.c @@ -899,6 +899,9 @@ static void ata_scsi_dev_config(struct s blk_queue_max_hw_segments(q, q-max_hw_segments - 1); } + if (dev-flags ATA_DFLAG_AN) + sdev-media_change_notify = 1; + if (dev-flags ATA_DFLAG_NCQ) { int depth; -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 6/7] SCSI: save disk in scsi_device
Give anyone who has access to scsi_device access to the genhd struct as well. Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED] Index: 2.6-git/drivers/scsi/sd.c === --- 2.6-git.orig/drivers/scsi/sd.c +++ 2.6-git/drivers/scsi/sd.c @@ -1711,6 +1711,7 @@ static int sd_probe(struct device *dev) dev_set_drvdata(dev, sdkp); add_disk(gd); + sdp-disk = gd; sdev_printk(KERN_NOTICE, sdp, Attached scsi %sdisk %s\n, sdp-removable ? removable : , gd-disk_name); Index: 2.6-git/drivers/scsi/sr.c === --- 2.6-git.orig/drivers/scsi/sr.c +++ 2.6-git/drivers/scsi/sr.c @@ -604,6 +604,7 @@ static int sr_probe(struct device *dev) if (sdev-media_change_notify) disk-flags |= GENHD_FL_MEDIA_CHANGE_NOTIFY; add_disk(disk); + sdev-disk = disk; sdev_printk(KERN_DEBUG, sdev, Attached scsi CD-ROM %s\n, cd-cdi.name); Index: 2.6-git/include/scsi/scsi_device.h === --- 2.6-git.orig/include/scsi/scsi_device.h +++ 2.6-git/include/scsi/scsi_device.h @@ -138,7 +138,7 @@ struct scsi_device { struct device sdev_gendev; struct class_device sdev_classdev; - + struct gendisk *disk; struct execute_work ew; /* used to get process context on put */ enum scsi_device_state sdev_state; -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 7/7] libata: send event when AN received
When we get an SDB FIS with the 'N' bit set, we should send an event to user space to indicate that there has been a media change. This will be done via the block device. Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED] Index: 2.6-git/drivers/ata/ahci.c === --- 2.6-git.orig/drivers/ata/ahci.c +++ 2.6-git/drivers/ata/ahci.c @@ -1147,6 +1147,25 @@ static void ahci_host_intr(struct ata_po return; } + if (status PORT_IRQ_SDB_FIS) { + /* +* if this is an ATAPI device with AN turned on, +* then we should interrogate the device to +* determine the cause of the interrupt +* +* for AN - this we should check the SDB FIS +* and find the I and N bits set +*/ + const u32 *f = pp-rx_fis + RX_FIS_SDB; + + /* check the 'N' bit in word 0 of the FIS */ + if (f[0] (1 15)) { + int port_addr = ((f[0] 0x0f00) 8); + struct ata_device *adev = ap-device[port_addr]; + if (adev-flags ATA_DFLAG_AN) + ata_scsi_media_change_notify(adev); + } + } if (ap-sactive) qc_active = readl(port_mmio + PORT_SCR_ACT); else Index: 2.6-git/include/linux/libata.h === --- 2.6-git.orig/include/linux/libata.h +++ 2.6-git/include/linux/libata.h @@ -737,6 +737,7 @@ extern void ata_host_init(struct ata_hos extern int ata_scsi_detect(struct scsi_host_template *sht); extern int ata_scsi_ioctl(struct scsi_device *dev, int cmd, void __user *arg); extern int ata_scsi_queuecmd(struct scsi_cmnd *cmd, void (*done)(struct scsi_cmnd *)); +extern void ata_scsi_media_change_notify(struct ata_device *atadev); extern void ata_sas_port_destroy(struct ata_port *); extern struct ata_port *ata_sas_port_alloc(struct ata_host *, struct ata_port_info *, struct Scsi_Host *); Index: 2.6-git/drivers/ata/libata-scsi.c === --- 2.6-git.orig/drivers/ata/libata-scsi.c +++ 2.6-git/drivers/ata/libata-scsi.c @@ -3057,6 +3057,22 @@ static void ata_scsi_remove_dev(struct a } /** + * ata_scsi_media_change_notify - send media change event + * @atadev: Pointer to the disk device with media change event + * + * Tell the block layer to send a media change notification + * event. + * + * LOCKING: + * interrupt context, may not sleep. + */ +void ata_scsi_media_change_notify(struct ata_device *atadev) +{ + genhd_media_change_notify(atadev-sdev-disk); +} +EXPORT_SYMBOL_GPL(ata_scsi_media_change_notify); + +/** * ata_scsi_hotplug - SCSI part of hotplug * @work: Pointer to ATA port to perform SCSI hotplug on * -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 2/7] genhd: expose AN to user space
Allow user space to determine if a disk supports Asynchronous Notification of media changes. This is done by adding a new sysfs file capability_flags, which is documented in (insert file name). This sysfs file will export all disk capabilities flags to user space. We also define a new flag to define the media change notification capability. Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED] Index: 2.6-git/block/genhd.c === --- 2.6-git.orig/block/genhd.c +++ 2.6-git/block/genhd.c @@ -370,7 +370,10 @@ static ssize_t disk_size_read(struct gen { return sprintf(page, %llu\n, (unsigned long long)get_capacity(disk)); } - +static ssize_t disk_capability_read(struct gendisk *disk, char *page) +{ + return sprintf(page, %x\n, disk-flags); +} static ssize_t disk_stats_read(struct gendisk * disk, char *page) { preempt_disable(); @@ -413,6 +416,10 @@ static struct disk_attribute disk_attr_s .attr = {.name = size, .mode = S_IRUGO }, .show = disk_size_read }; +static struct disk_attribute disk_attr_capability = { + .attr = {.name = capability_flags, .mode = S_IRUGO }, + .show = disk_capability_read +}; static struct disk_attribute disk_attr_stat = { .attr = {.name = stat, .mode = S_IRUGO }, .show = disk_stats_read @@ -453,6 +460,7 @@ static struct attribute * default_attrs[ disk_attr_removable.attr, disk_attr_size.attr, disk_attr_stat.attr, + disk_attr_capability.attr, #ifdef CONFIG_FAIL_MAKE_REQUEST disk_attr_fail.attr, #endif Index: 2.6-git/include/linux/genhd.h === --- 2.6-git.orig/include/linux/genhd.h +++ 2.6-git/include/linux/genhd.h @@ -94,6 +94,7 @@ struct hd_struct { #define GENHD_FL_REMOVABLE 1 #define GENHD_FL_DRIVERFS 2 +#define GENHD_FL_MEDIA_CHANGE_NOTIFY 4 #define GENHD_FL_CD8 #define GENHD_FL_UP16 #define GENHD_FL_SUPPRESS_PARTITION_INFO 32 Index: 2.6-git/Documentation/block/capability_flags.txt === --- /dev/null +++ 2.6-git/Documentation/block/capability_flags.txt @@ -0,0 +1,15 @@ +Generic Block Device Capability Flags +=== +This file documents the sysfs file block/disk/capability_flags + +capability_flags is a hex word indicating which capabilities a specific +disk supports. For more information on bits not listed here, see +include/linux/genhd.h + +Capability Value +--- +GENHD_FL_MEDIA_CHANGE_NOTIFY 4 + When this bit is set, the disk supports Asynchronous Notification + of media change events. These events will be broadcast to user + space via kernel uevent. + -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question about Reiser4
On Monday April 23, [EMAIL PROTECTED] wrote: Theodore Tso wrote: One of the big problems of using a filesystem as a DB is the system call overheads. If you use huge numbers of tiny files, then each attempt read an atom of information from the DB takes three system calls --- an open(), read(), and close(), with all of the overheads in terms of dentry and inode cache. Now, to be fair, there are probably a number of cases where open/lseek/readv/close and open/lseek/writev/close would be worth doing as a single system call. The big problem as far as I can see involves EINTR handling; such a system call has serious restartability implications. Of course, there are Ingo's syslets... Our you could think outside the circle: Store all your small files as symlinks, then use symlink to create them and readlink to read them. (You would probably end up use symlinkat and readlinkat). Only one system call instead of three. I guess you don't get meaningful permission bits then... I wonder if that really matters. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question about Reiser4
On Mon, Apr 23, 2007 at 04:53:03PM -0700, H. Peter Anvin wrote: Theodore Tso wrote: One of the big problems of using a filesystem as a DB is the system call overheads. If you use huge numbers of tiny files, then each attempt read an atom of information from the DB takes three system calls --- an open(), read(), and close(), with all of the overheads in terms of dentry and inode cache. Now, to be fair, there are probably a number of cases where open/lseek/readv/close and open/lseek/writev/close would be worth doing as a single system call. The big problem as far as I can see involves EINTR handling; such a system call has serious restartability implications. Sure, but Hans wants to change /etc/inetd.conf into /etc/inetd.conf.d, where you have: /etc/inetd.conf.d/telnet/port, /etc/inetd.conf.d/telnet/protocol, /etc/inetd.conf.d/telnet/wait, /etc/inetd.conf.d/telnet/userid, /etc/inetd.conf.d/telnet/daemon, etc. for each individual line in /etc/inetd.conf. (And where each file might only contains 2-4 characters each: i.e., 23, tcp, root, etc.) So it's not enough just to collapse open/pread/close into a single system call; in order to gain back the performance squandered by all of these itsy-bitsy tiny little files. You want to collapse the open/pread/close for many of these little files into a single system call, hence Hans's insistence on sys_reiser4(); otherwise his scheme doesn't work all that well at all. - Ted - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question about Reiser4
Neil Brown wrote: Our you could think outside the circle: Store all your small files as symlinks, then use symlink to create them and readlink to read them. (You would probably end up use symlinkat and readlinkat). Only one system call instead of three. I guess you don't get meaningful permission bits then... I wonder if that really matters. For some applications, oh yes it does. -hpa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH]Fix parsing kernelcore boot option for ia64
On Mon, 23 Apr 2007 19:32:46 +0100 [EMAIL PROTECTED] (Mel Gorman) wrote: I wasn't even aware of this kernelcore thing. It's pretty nasty-looking. yet another reminder that this code hasn't been properly reviewed in the past year or three. Just now, I'm making memory-unplug patches with current MOVABLE_ZONE code. So, I might be the first user of it on ia64. Anyway, I'll try to fix it. Can you review this patch and see does it fix the problem please? There was a second problem that showed up while testing this in relation to the bootmem allocator assumptions about zone boundary alignment. I'll follow up this mail with the patch in case you are seeing that problem. Subject: Fix parsing kernelcore boot option V2 cmdline_parse_kernelcore() should return the next pointer of boot option like memparse() doing. If not, it is cause of eternal loop on ia64 box. This patch is for 2.6.21-rc6-mm1. This patch changes the kernelcore command line parsing so that is compatible with both early_param() way of doing things and IA64. In my understanding, why ia64 doesn't use early_param() macro for mem= at el. is that it has to use mem= option at efi handling which is called before parse_early_param(). Current ia64's boot path is setup_arch() - efi handling - parse_early_param() - numa handling - pgdat/zone init kernelcore= option is just used at pgdat/zone initialization. (no arch dependent part...) So I think just adding == early_param(kernelcore,cmpdline_parse_kernelcore) == to ia64 is ok. -Kame - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Return EPERM not ECHILD on security_task_wait failure
On Thu, 15 Mar 2007, Roland McGrath wrote: This patch makes do_wait return -EPERM instead of -ECHILD if some children were ruled out solely because security_task_wait failed. What about using the return value from the security_task_wait hook (which should be -EACCES) ? As I said in some earlier discussion following my original patch, that would be fine with me. I haven't coded up that variant, but it's simple enough. Would you like to do it? Thanks, Roland - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question about Reiser4
Theodore Tso wrote: Now, to be fair, there are probably a number of cases where open/lseek/readv/close and open/lseek/writev/close would be worth doing as a single system call. The big problem as far as I can see involves EINTR handling; such a system call has serious restartability implications. Sure, but Hans wants to change /etc/inetd.conf into /etc/inetd.conf.d, where you have: /etc/inetd.conf.d/telnet/port, /etc/inetd.conf.d/telnet/protocol, /etc/inetd.conf.d/telnet/wait, /etc/inetd.conf.d/telnet/userid, /etc/inetd.conf.d/telnet/daemon, etc. for each individual line in /etc/inetd.conf. (And where each file might only contains 2-4 characters each: i.e., 23, tcp, root, etc.) So it's not enough just to collapse open/pread/close into a single system call; in order to gain back the performance squandered by all of these itsy-bitsy tiny little files. You want to collapse the open/pread/close for many of these little files into a single system call, hence Hans's insistence on sys_reiser4(); otherwise his scheme doesn't work all that well at all. Heh. sys_read_tree() -- walk a directory tree and return it as a data structure in memory :) -hpa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: kmem_cache_destroy doesn't - version 2.
On Monday April 23, [EMAIL PROTECTED] wrote: Would this work? Contains a solution somewhat along the lines of your thoughts on the subject. Concept seems sound. Code needs a kfree of the name returned by create_unique_id, and I think ID_STR_LENGTH needs to be at least 34. Maybe that should be allocated on the stack in sysfs_slab_add, rather than using kmalloc/free. Thanks, NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Return EPERM not ECHILD on security_task_wait failure
On Mon, 23 Apr 2007, Roland McGrath wrote: As I said in some earlier discussion following my original patch, that would be fine with me. I haven't coded up that variant, but it's simple enough. Would you like to do it? Sure. -- James Morris [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: kmem_cache_destroy doesn't - version 2.
On Tue, 24 Apr 2007, Neil Brown wrote: On Monday April 23, [EMAIL PROTECTED] wrote: Would this work? Contains a solution somewhat along the lines of your thoughts on the subject. Concept seems sound. Code needs a kfree of the name returned by create_unique_id, and I think ID_STR_LENGTH needs to be at least 34. Sysfs copies the string? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: kmem_cache_destroy doesn't - version 2.
On Monday April 23, [EMAIL PROTECTED] wrote: On Tue, 24 Apr 2007, Neil Brown wrote: On Monday April 23, [EMAIL PROTECTED] wrote: Would this work? Contains a solution somewhat along the lines of your thoughts on the subject. Concept seems sound. Code needs a kfree of the name returned by create_unique_id, and I think ID_STR_LENGTH needs to be at least 34. Sysfs copies the string? kobject_set_name copies the string, either into a small char array in the kobject, or into kmalloced space. kobject_set_name actually takes a format and arbitrary args and uses vsnprintf, so it has to make it's own copy. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Update the list information for kexec and kdump
On Mon, Apr 23, 2007 at 12:04:01PM -0600, Eric W. Biederman wrote: Simon Horman [EMAIL PROTECTED] writes: Update the list information for kexec and kdump Signed-off-by: Simon Horman [EMAIL PROTECTED] --- Is it too early for this change? It looks like the new list is working, and isn't likely to get overwhelmed with spam. I don't know if everyone has switched over yet but we can certainly update MAINTAINERS. Last time I checked there were 28 people in the kexec@ list. This isn't everyone, but it is getting there. May I add an Acked-by you ? Eric Index: linux-2.6/MAINTAINERS === --- linux-2.6.orig/MAINTAINERS 2007-04-23 17:34:30.0 +0900 +++ linux-2.6/MAINTAINERS 2007-04-23 17:34:47.0 +0900 @@ -1951,7 +1951,7 @@ P:Vivek Goyal M: [EMAIL PROTECTED] P: Haren Myneni M: [EMAIL PROTECTED] -L: [EMAIL PROTECTED] +L: [EMAIL PROTECTED] L: linux-kernel@vger.kernel.org W: http://lse.sourceforge.net/kdump/ S: Maintained @@ -2001,7 +2001,7 @@ P:Eric Biederman M: [EMAIL PROTECTED] W: http://www.xmission.com/~ebiederm/files/kexec/ L: linux-kernel@vger.kernel.org -L: [EMAIL PROTECTED] +L: [EMAIL PROTECTED] S: Maintained KPROBES -- Horms H: http://www.vergenet.net/~horms/ W: http://www.valinux.co.jp/en/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: AppArmor FAQ
David Wagner wrote: James Morris wrote: [...] you can change the behavior of the application and then bypass policy entirely by utilizing any mechanism other than direct filesystem access: IPC, shared memory, Unix domain sockets, local IP networking, remote networking etc. [...] Just look at their code and their own description of AppArmor. My gosh, you're right. What the heck? With all due respect to the developers of AppArmor, I can't help thinking that that's pretty lame. I think this raises substantial questions about the value of AppArmor. What is the point of having a jail if it leaves gaping holes that malicious code could use to escape? And why isn't this documented clearly, with the implications fully explained? I would like to hear the AppArmor developers defend this design decision. It was a simplicity trade off at the time, when AppArmor was mostly aimed at servers, and there was no HAL or DBUS. Now it is definitely a limitation that we are addressing. We are working on a mediation system for what kind of IPC a confined process can do http://forge.novell.com/pipermail/apparmor-dev/2007-April/000503.html When our IPC mediation system is code instead of vapor, it will also appear here for review. Meanwhile, AppArmor does not make IPC security any worse, confined processes are still subject to the usual Linux IPC restrictions. AppArmor actually makes the IPC situation somewhat more secure than stock Linux, e.g. normal DBUS deployment can be controlled through file access permissions. But we are not claiming AppArmor to be an IPC security enhancement, yet. The proposed set of patches is a self-contained access control system for file system access, and we would like it reviewed as such. Current AppArmor docs are quite explicit that AppArmor only mediates file access and POSIX.1e capabilities. Crispin -- Crispin Cowan, Ph.D. http://crispincowan.com/~crispin/ Director of Software Engineering http://novell.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [REPORT] cfs-v4 vs sd-0.44
I don't know if we've discussed this or not. Since both CFS and SD claim to be fair, I'd like to hear more opinions on the fairness aspect of these designs. In areas such as OS, networking, and real-time, fairness, and its more general form, proportional fairness, are well-defined terms. In fact, perfect fairness is not feasible since it requires all runnable threads to be running simultaneously and scheduled with infinitesimally small quanta (like a fluid system). So to evaluate if a new scheduling algorithm is fair, the common approach is to take the ideal fair algorithm (often referred to as Generalized Processor Scheduling or GPS) as a reference model and analyze if the new algorithm can achieve a constant error bound (different error metrics also exist). I understand that via experiments we can show a design is reasonably fair in the common case, but IMHO, to claim that a design is fair, there needs to be some kind of formal analysis on the fairness bound, and this bound should be proven to be constant. Even if the bound is not constant, at least this analysis can help us better understand and predict the degree of fairness that users would experience (e.g., would the system be less fair if the number of threads increases? What happens if a large number of threads dynamically join and leave the system?). tong - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: kmem_cache_destroy doesn't - version 2.
On Tue, 24 Apr 2007, Neil Brown wrote: kobject_set_name actually takes a format and arbitrary args and uses vsnprintf, so it has to make it's own copy. Ok then this should be fine... SLAB: Fix sysfs directory handling This fixes the problem that SLUB does not track the names of aliased slabs by changing the way that SLUB manages the files in /sys/slab. If the slab that is being operated on is not mergeable (usually the case if we are debugging) then do not create any aliases. If an alias exists that we conflict with then remove it before creating the directory for the unmergeable slab. If there is a true slab cache there and not an alias then we fail since there is a true duplication of slab cache names. So debugging allows the detection of slab name duplication as usual. If the slab is mergeable then we create a directory with a unique name created from the slab size, slab options and the pointer to the kmem_cache structure (disambiguation). All names referring to the slabs will then be created as symlinks to that unique name. These symlinks are not going to be removed on kmem_cache_destroy() since we only carry a counter for the number of aliases. If a new symlink is created then it may just replace an existing one. This means that one can create a gazillion slabs with the same name (if they all refer to mergeable caches). It will only increase the alias count. So we have the potential of not detecting duplicate slab names (there is actually no harm done by doing that). We will detect the duplications as as soon as debugging is enabled because we will then no longer generate symlinks and special unique names. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Index: linux-2.6.21-rc6/mm/slub.c === --- linux-2.6.21-rc6.orig/mm/slub.c 2007-04-23 13:08:41.0 -0700 +++ linux-2.6.21-rc6/mm/slub.c 2007-04-23 18:05:16.0 -0700 @@ -3307,16 +3307,68 @@ static struct kset_uevent_ops slab_ueven decl_subsys(slab, slab_ktype, slab_uevent_ops); +#define ID_STR_LENGTH 64 + +/* Create a unique string id for a slab cache: + * format + * :[flags-]size:[memory address of kmemcache] + */ +static char *create_unique_id(struct kmem_cache *s) +{ + char *name = kmalloc(ID_STR_LENGTH, GFP_KERNEL); + char *p = name; + + BUG_ON(!name); + + *p++ = ':'; + /* +* First flags affecting slabcache operations */ + if (s-flags SLAB_CACHE_DMA) + *p++ = 'd'; + if (s-flags SLAB_RECLAIM_ACCOUNT) + *p++ = 'a'; + if (s-flags SLAB_DESTROY_BY_RCU) + *p++ = 'r';\ + /* Debug flags */ + if (s-flags SLAB_RED_ZONE) + *p++ = 'Z'; + if (s-flags SLAB_POISON) + *p++ = 'P'; + if (s-flags SLAB_STORE_USER) + *p++ = 'U'; + if (p != name + 1) + *p++ = '-'; + p += sprintf(p,%07d:0x%p ,s-size, s); + BUG_ON(p name + ID_STR_LENGTH - 1); + return name; +} + static int sysfs_slab_add(struct kmem_cache *s) { int err; + const char *name; if (slab_state SYSFS) /* Defer until later */ return 0; + if (s-flags SLUB_NEVER_MERGE) { + /* +* Slabcache can never be merged so we can use the name proper. +* This is typically the case for debug situations. In that +* case we can catch duplicate names easily. +*/ + sysfs_remove_link(slab_subsys.kset.kobj, s-name); + name = s-name; + } else + /* +* Create a unique name for the slab as a target +* for the symlinks. +*/ + name = create_unique_id(s); + kobj_set_kset_s(s, slab_subsys); - kobject_set_name(s-kobj, s-name); + kobject_set_name(s-kobj, name); kobject_init(s-kobj); err = kobject_add(s-kobj); if (err) @@ -3326,6 +3378,10 @@ static int sysfs_slab_add(struct kmem_ca if (err) return err; kobject_uevent(s-kobj, KOBJ_ADD); + if (!(s-flags SLUB_NEVER_MERGE)) { + sysfs_slab_alias(s, s-name); + kfree(name); + } return 0; } @@ -3351,9 +3407,14 @@ static int sysfs_slab_alias(struct kmem_ { struct saved_alias *al; - if (slab_state == SYSFS) + if (slab_state == SYSFS) { + /* +* If we have a leftover link then remove it. +*/ + sysfs_remove_link(slab_subsys.kset.kobj, name); return sysfs_create_link(slab_subsys.kset.kobj, s-kobj, name); + } al = kmalloc(sizeof(struct saved_alias), GFP_KERNEL); if (!al) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a
Re: [ANNOUNCE][PATCH] Kcli - Kernel command line interface.
(text reformatted to less than 80 cols. Please, we'll get along a lot better if you don't send 1000-column emails) Sorry. I am afraid we are from a different background, and so very poorly versed in these things. My email client does not seem to have an option to tell it to format in 80 cols. So, hopefully, using CR, I am achieving the same effect. Let me know if it doesn't work, and I will have to switch to a different email client for conversing with the lkml. The obvious question is: what's _wrong_ with doing all this in some cut-down userspace environment like busybox? Why is this stuff better? Obviously some embedded developers have considered that option and have rejected it. But we do need to be told, at length, why that decision was made. There is nothing _wrong_ with doing it all in a cut-down userspace. It is a matter of personal preference, culture, and the application. That is what makes Linux so great, it is all about choice. We are developing devices that don't have a user space, and we don't see the point in including one just for debug purposes. We will not be offended if Kcli is not included into the kernel mainline, nor if Kcli compels people to call us stupid (as it already has) just because we are different and some people don't understand us. We are firm believers that the world, including the Linux kernel world, would be a nasty place if there was only _one_ way to do any given task. Additionally, we are almost certain that there will be others who think like we do, so we are reaching out to them. We also feel compelled to give _something_ back to the community that has given so much to us, and, for now, this is all we have. However, our reasons for Kcli are: 1) Our devices ship with no user space, and we want the development environment to be as close as possible to the final product. 2) Getting debug information with user space calls require context switches and data copies, which changes the real time profile and can mask bugs. 3) To use user space, we would need cross compiled libc's, special builds of gcc, root file systems, flash storage to store it all, and all sorts of things which make life a lot more complicated than it needs to be for us. We are quite capable of producing all these things, but, we just don't see the point in it. Our way, we just have a gcc capable of cross compiling the kernel and it is so simple. 4) For us, it is the opposite argument. We would need to be convinced that having user space is worth all the overhead. Not just CPU overhead, but all the overheads. 5) We like it in the kernel, we find it to be warm and fuzzy. Whereas, user space is a cold, dark, and rainy place, and we just don't want to go there. :) We do not claim to have come up with a _better_ way. We have just created something that we feel would be useful to others. MRanon. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: Use TLB batching for MADV_FREE. Adds another 10-15% extra performance to the MySQL sysbench results on my quad core system. Signed-off-by: Rik van Riel [EMAIL PROTECTED] --- Nick Piggin wrote: 3) because of this, we can treat any such accesses as happening simultaneously with the MADV_FREE and as illegal, aka undefined behaviour territory and we do not need to worry about them Yes, but I'm wondering if it is legal in all architectures. It's similar to trying to access memory during an munmap. You may be able to for a short time, but it'll come back to haunt you. The question is whether the architecture specific tlb flushing code will break or not. 4) because we flush the tlb before releasing the page table lock, other CPUs cannot remove this page from the address space - they will block on the page table lock before looking at this pte We don't when the ptl is split. Even then we do. Each invocation of zap_pte_range() only touches one page table page, and it flushes the TLB before releasing the page table lock. What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question about Reiser4
On Mon, Apr 23, 2007 at 05:31:29PM -0700, H. Peter Anvin wrote: Heh. sys_read_tree() -- walk a directory tree and return it as a data structure in memory :) But maybe you don't want every single file in the directory, but some subset of the files in the directory tree. So before you know it: sys_fs_sql(SELECT port,userid,daemon FROM /etc/inetd.conf.d WHERE protocol=='tcp', buf, sizeof(buf)); The question is where do you stop on the slippery slope, and is it really all that harder than simply parsing a /etc/gitconfig or /etc/e2fsck.conf file. There are plenty of parsers or database libraries already written, and many of them are quite efficient. And personally, I'd much rather edit a single /etc/gitconfig or /etc/e2fsck.conf file using emacs than have to cd through 3 or 4 levels of directories to edit each 2-3 byte file one at a time. But to each their own - Ted - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH try #2] Return access error not ECHILD on security_task_wait failure
From: Roland McGrath [EMAIL PROTECTED] wait* syscalls return -ECHILD even when an individual PID of a live child was requested explicitly, when security_task_wait denies the operation. This means that something like a broken SELinux policy can produce an unexpected failure that looks just like a bug with wait or ptrace or something. This patch makes do_wait return -EACCES (or other appropriate error returned from security_task_wait() instead of -ECHILD if some children were ruled out solely because security_task_wait failed. Signed-off-by: James Morris [EMAIL PROTECTED] --- Updated version, returns value from security_task_wait(). kernel/exit.c | 17 +++-- 1 files changed, 15 insertions(+), 2 deletions(-) diff --git a/kernel/exit.c b/kernel/exit.c index b55ed4c..9236924 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -1033,6 +1033,8 @@ asmlinkage void sys_exit_group(int error_code) static int eligible_child(pid_t pid, int options, struct task_struct *p) { + int err; + if (pid 0) { if (p-pid != pid) return 0; @@ -1066,8 +1068,9 @@ static int eligible_child(pid_t pid, int options, struct task_struct *p) if (delay_group_leader(p)) return 2; - if (security_task_wait(p)) - return 0; + err = security_task_wait(p); + if (err) + return err; return 1; } @@ -1449,6 +1452,7 @@ static long do_wait(pid_t pid, int options, struct siginfo __user *infop, DECLARE_WAITQUEUE(wait, current); struct task_struct *tsk; int flag, retval; + int allowed, denied; add_wait_queue(current-signal-wait_chldexit,wait); repeat: @@ -1457,6 +1461,7 @@ repeat: * match our criteria, even if we are not able to reap it yet. */ flag = 0; + allowed = denied = 0; current-state = TASK_INTERRUPTIBLE; read_lock(tasklist_lock); tsk = current; @@ -1472,6 +1477,12 @@ repeat: if (!ret) continue; + if (unlikely(ret 0)) { + denied = ret; + continue; + } + allowed = 1; + switch (p-state) { case TASK_TRACED: /* @@ -1570,6 +1581,8 @@ check_continued: goto repeat; } retval = -ECHILD; + if (unlikely(denied) !allowed) + retval = denied; end: current-state = TASK_RUNNING; remove_wait_queue(current-signal-wait_chldexit,wait); -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 16/25] xen: Use the hvc console infrastructure for Xen console
On Mon, Apr 23, 2007 at 02:56:54PM -0700, Jeremy Fitzhardinge wrote: Implement a Xen back-end for hvc console. From: Gerd Hoffmann [EMAIL PROTECTED] Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] --- arch/i386/xen/Kconfig |1 arch/i386/xen/events.c|3 - drivers/Makefile |3 + drivers/xen/Makefile |1 drivers/xen/hvc-console.c | 134 + include/xen/events.h |1 6 files changed, 142 insertions(+), 1 deletion(-) If you move the driver to drivers/char/hvc_xen.c instead, you won't have to do... +#include ../char/hvc_console.h ...this. Other single-platform backend hvc drivers are under drivers/char already. -Olof - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/25] xen: Add nosegneg capability to the vsyscall page notes
Roland McGrath wrote: + * It should contain: + * hwcap 0 nosegneg + * to match the mapping of bit to name that we give here. This needs to be hwcap 0 nosegneg to match: +NOTE_KERNELCAP_BEGIN(1, 2) +NOTE_KERNELCAP(1, nosegneg) +NOTE_KERNELCAP_END The actual bits you are using should be fine. (You're intentionally skipping bit 0 to work around hold glibc bugs, which you might want to add to the comments. Also a comment or perhaps using 11 syntax would make it more clear that 2 is the bit mask containing bit 1 and that's why it has to be 2, and not because of some other magical property of 2.) But if kernel packagers don't write the matching bit number in their ld.so.conf.d files, then ld.so.cache lookups won't work right. I have to admit I still don't really understand all this. Is it documented somewhere? What does hwcap 0 nosegneg actually mean? What does the 0 mean here? In the ELF note, what does the nosegneg string mean? How is it used? Is it compared to the nosegneg in ld.so.conf? How does this relate to the bitfields? Thanks, J - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Update the list information for kexec and kdump
Simon Horman [EMAIL PROTECTED] writes: On Mon, Apr 23, 2007 at 12:04:01PM -0600, Eric W. Biederman wrote: Simon Horman [EMAIL PROTECTED] writes: Update the list information for kexec and kdump Signed-off-by: Simon Horman [EMAIL PROTECTED] --- Is it too early for this change? It looks like the new list is working, and isn't likely to get overwhelmed with spam. I don't know if everyone has switched over yet but we can certainly update MAINTAINERS. Last time I checked there were 28 people in the kexec@ list. This isn't everyone, but it is getting there. May I add an Acked-by you ? Sure. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ChunkFS - measuring cross-chunk references
On Mon, 23 Apr 2007, Amit Gud wrote: On Mon, 23 Apr 2007, Arjan van de Ven wrote: The other thing which we should consider is that chunkfs really requires a 64-bit inode number space, which means either we only allow does it? I'd think it needs a chunk space number and a 32 bit local inode number ;) (same for blocks) For inodes, yes, either 64-bit inode or some field for the chunk id in which the inode is. But for block numbers, you don't. Because individual chunks manage part of the whole file system in an independent way. They have their block bitmaps starting at an offset. Inode bitmaps, however, remains same. In that sense, we also can do away without having chunk identifier encoded into inode number and chunkfs would still be fine with it. But we will then loose inode uniqueness property, which could well be OK as it is with other file systems in which inode number is not sufficient for unique identification of an inode. AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] powerpc pseries eeh: Convert to kthread API
The only reason for using threads here is to get the error recovery out of an interrupt context (where errors may be detected), and then, an hour later, decrement a counter (which is how we limit these to 6 per hour). Thread reaping is trivial, the thread just exits after an hour. In addition, it should be a thread and not done from within keventd because : - It can take a long time (well, relatively but still too long for a work queue) - The driver callbacks might need to use keventd or do flush_workqueue to synchronize with their own workqueues when doing an internal recovery. Since these are events rare, I've no particular concern about performance or resource consumption. The current code seems to work just fine. :-) I think moving to kthread's is cleaner (just a wrapper around kernel threads that simplify dealing with reaping them out mostly) and I agree with Christoph that it would be nice to be able to fire off kthreads from interrupt context.. in many cases, we abuse work queues for things that should really done from kthreads instead (basically anything that takes more than a couple hundred microsecs or so). Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 22/25] xen: xen-netfront: use skb.cb for storing private data
On Mon, Apr 23, 2007 at 02:57:00PM -0700, Jeremy Fitzhardinge wrote: Netfront's use of nh.raw and h.raw for storing page+offset is a bit hinky, and it breaks with upcoming network stack updates which reduce these fields to sub-pointer sizes. Fortunately, skb offers the cb field specifically for stashing this kind of info, so use it. Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] Cc: Herbert Xu [EMAIL PROTECTED] Cc: Chris Wright [EMAIL PROTECTED] Cc: Christian Limpach [EMAIL PROTECTED] Thanks Jeremy. The patch looks good. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
PROBLEM: Oops: 0002 [1] SMP
[1] Summary: Kernel Reports Oops: 0002 [1] SMP and the system becomes unstable [2] Full Description: Sometimes, randomly i get this Oops message and the system becomes unstable. By unstable i mean all applications segmentation faults when i execute (after the Oops). Sometimes X crashes, sometimes the machine just reboots (the reboot might be other problem tho). This happens with kernel 2.6.20 and with 2.6.21-rc7. Was happening with 2.6.20 so i tried 2.6.21-rc7 and this also happens. [EMAIL PROTECTED]:/var/log$ uname -a Linux sayao-desktop 2.6.21-rc7-sayao #2 SMP Mon Apr 16 22:11:36 BRT 2007 x86_64 GNU/Linux Here is the log: Apr 22 21:44:33 sayao-desktop kernel: [18641.553890] Unable to handle kernel paging request at 3e82 RIP: Apr 22 21:44:33 sayao-desktop kernel: [18641.553899] [__alloc_skb +188/321] __alloc_skb+0xbc/0x141 Apr 22 21:44:33 sayao-desktop kernel: [18641.553911] PGD 203027 PUD 0 Apr 22 21:44:33 sayao-desktop kernel: [18641.553915] Oops: 0002 [1] SMP Apr 22 21:44:33 sayao-desktop kernel: [18641.553919] CPU 0 Apr 22 21:44:33 sayao-desktop kernel: [18641.553922] Modules linked in: binfmt_misc rfcomm l2cap bluetooth i915 drm ppdev capability commoncap acpi_cpufreq cpufreq_userspace cpufreq_stats cpufreq_conservative cpufreq_ondemand cpufreq_powersave freq_table asus_acpi container sbs i2c_ec i2c_core battery video dock ac button ipv6 lp fuse snd_hda_intel snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq psmouse snd_timer snd_seq_device snd parport_pc parport shpchp serio_raw pcspkr soundcore snd_page_alloc iTCO_wdt iTCO_vendor_support pci_hotplug intel_agp af_packet evdev tsdev ext3 jbd mbcache sg ide_cd cdrom sd_mod ata_generic usbhid hid ata_piix libata scsi_mod e100 mii ehci_hcd generic piix uhci_hcd usbcore thermal processor fan Apr 22 21:44:33 sayao-desktop kernel: [18641.553997] Pid: 13805, comm: evolution Not tainted 2.6.21-rc7-sayao #2 Apr 22 21:44:33 sayao-desktop kernel: [18641.554001] RIP: 0010:[__alloc_skb+188/321] [__alloc_skb+188/321] __alloc_skb+0xbc/0x141 Apr 22 21:44:33 sayao-desktop kernel: [18641.554007] RSP: 0018:810033169bd8 EFLAGS: 00010246 Apr 22 21:44:33 sayao-desktop kernel: [18641.554011] RAX: 3e82 RBX: 0002 RCX: Apr 22 21:44:33 sayao-desktop kernel: [18641.554014] RDX: RSI: RDI: 81003b1bfa50 Apr 22 21:44:33 sayao-desktop kernel: [18641.554017] RBP: 3e80 R08: 0002 R09: Apr 22 21:44:33 sayao-desktop kernel: [18641.554021] R10: 81003b1bf980 R11: 00d0 R12: 81003b1bf980 Apr 22 21:44:33 sayao-desktop kernel: [18641.554024] R13: 81003f2109c0 R14: 04d0 R15: 3e80 Apr 22 21:44:33 sayao-desktop kernel: [18641.554028] FS: 2ad334669ea0() GS:8052f000() knlGS: Apr 22 21:44:33 sayao-desktop kernel: [18641.554032] CS: 0010 DS: ES: CR0: 80050033 Apr 22 21:44:33 sayao-desktop kernel: [18641.554035] CR2: 3e82 CR3: 27bb3000 CR4: 06e0 Apr 22 21:44:33 sayao-desktop kernel: [18641.554039] Process evolution (pid: 13805, threadinfo 810033168000, task 810009665000) Apr 22 21:44:33 sayao-desktop kernel: [18641.554042] Stack: 09665000 81002f9f5080 3e80 Apr 22 21:44:33 sayao-desktop kernel: [18641.554049] 04d0 810033169ce4 3e80 803a6d82 Apr 22 21:44:33 sayao-desktop kernel: [18641.554055] 0206 80507110 81dadc50 Apr 22 21:44:33 sayao-desktop kernel: [18641.554061] Call Trace: Apr 22 21:44:33 sayao-desktop kernel: [18641.554086] [sock_alloc_send_skb+130/478] sock_alloc_send_skb+0x82/0x1de Apr 22 21:44:33 sayao-desktop kernel: [18641.554126] [unix_stream_sendmsg+392/880] unix_stream_sendmsg+0x188/0x370 Apr 22 21:44:33 sayao-desktop kernel: [18641.554181] [sock_aio_write +293/313] sock_aio_write+0x125/0x139 Apr 22 21:44:33 sayao-desktop kernel: [18641.554247] [do_sync_write +207/277] do_sync_write+0xcf/0x115 Apr 22 21:44:33 sayao-desktop kernel: [18641.554287] [autoremove_wake_function+0/48] autoremove_wake_function+0x0/0x30 Apr 22 21:44:33 sayao-desktop kernel: [18641.554352] [vfs_write +228/348] vfs_write+0xe4/0x15c Apr 22 21:44:33 sayao-desktop kernel: [18641.554369] [sys_write+69/121] sys_write+0x45/0x79 Apr 22 21:44:33 sayao-desktop kernel: [18641.554393] [system_call +126/131] system_call+0x7e/0x83 Apr 22 21:44:33 sayao-desktop kernel: [18641.554434] Apr 22 21:44:33 sayao-desktop kernel: [18641.554436] Apr 22 21:44:33 sayao-desktop kernel: [18641.554437] Code: c7 00 01 00 00 00 66 c7 40 04 00 00 66 c7 40 06 00 00 66 c7 Apr 22 21:44:33 sayao-desktop kernel: [18641.554453] RIP [__alloc_skb +188/321] __alloc_skb+0xbc/0x141 Apr 22 21:44:33 sayao-desktop kernel: [18641.554458] RSP 810033169bd8 Apr 22
Re: [report] renicing X, cfs-v5 vs sd-0.46
On Monday 23 April 2007, Niel Lambrechts wrote: Gene Heskett wrote: This message prompted me to do some checking in re context switches myself, and I've come to the conclusion that there could be a bug in vmstat itself. Perhaps. perhaps not. :) Run singly the context switching is reasonable even for a -19 niceness of x, its only showing about 200 or so on the first loop of vmstat. But throw in the -n 1 arguments and it goes crazy on the second and subsequent loops. man vmstat: The first report produced gives averages since the last reboot. Additional reports give information on a sampling period of length delay. I missed that, concentrating on finding the method of telling it the delay I guess. So then the next question is, over what period is that obviously lower figure being averaged over? Certainly not over a 1 second period else it would then be much higher, as seen by the figures after the initial delay. The time slice spec'd in /proc/sys/kernel/sched_granularity_ns, which here is currently 500 or 5 milliseconds? If that was the case, the first answer would be in the area of 15, not 200. So educate me, off list if you would like and have the time. Thanks Niel. -- Cheers, Gene There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order. -Ed Howdershelt (Author) Sweet sixteen is beautiful Bess, And her voice is changing -- from No to Yes. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: PageLRU can be non-atomic bit operation
At 22:42 07/04/23, Hugh Dickins wrote: On Mon, 23 Apr 2007, Hisashi Hifumi wrote: No. The PG_lru flag bit is just one bit amongst many others: what of concurrent operations changing other bits in that same unsigned long e.g. trying to lock the page by setting PG_locked? There are some places where such micro-optimizations can be made (typically while first allocating the page); but in general, no. In i386 and x86_64, btsl is used to change page flag. In this case, if btsl without lock prefix set PG_locked and PG_lru flag concurrently, does only one operation succeed ? That's right: on an SMP machine, without the lock prefix, the operation is no longer atomic: what's stored back may be missing the result of one or the other of the racing operations. In the case that changing the same bit concurrently, lock prefix or other spinlock is needed. But, I think that concurrent bit operation on different bits is just like OR operation , so lock prefix is not needed. AMD instruction manual says about bts that , Copies a bit, specified by bit index in a register or 8-bit immediate value (second operand), from a bit string (first operand), also called the bit base, to the carry flag (CF) of the rFLAGS register, and then sets the bit in the bit string to 1. BTS instruction is read-modify-write instruction on bit unit. So concurrent bit operation on different bits may be possible. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
This should fix the MADV_FREE code for PPC's hashed tlb. Signed-off-by: Rik van Riel [EMAIL PROTECTED] --- Nick Piggin wrote: Nick Piggin wrote: 3) because of this, we can treat any such accesses as happening simultaneously with the MADV_FREE and as illegal, aka undefined behaviour territory and we do not need to worry about them Yes, but I'm wondering if it is legal in all architectures. It's similar to trying to access memory during an munmap. You may be able to for a short time, but it'll come back to haunt you. The question is whether the architecture specific tlb flushing code will break or not. I guess we'll need to call tlb_remove_tlb_entry() inside the MADV_FREE code to keep powerpc happy. Thanks for pointing this one out. Even then we do. Each invocation of zap_pte_range() only touches one page table page, and it flushes the TLB before releasing the page table lock. What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS. Oh dear. I see it now... The tlb end things inside zap_pte_range() are actually noops and the actual tlb flush only happens inside zap_page_range(). I guess the fact that munmap gets the mmap_sem for writing should save us, though... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. --- linux-2.6.20.x86_64/mm/memory.c.noppc 2007-04-23 21:50:09.0 -0400 +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 21:48:59.0 -0400 @@ -679,6 +679,7 @@ static unsigned long zap_pte_range(struc } ptep_test_and_clear_dirty(vma, addr, pte); ptep_test_and_clear_young(vma, addr, pte); + tlb_remove_tlb_entry(tlb, pte, addr); SetPageLazyFree(page); if (PageActive(page)) deactivate_tail_page(page);
Re: [REPORT] cfs-v4 vs sd-0.44
On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote: I don't know if we've discussed this or not. Since both CFS and SD claim to be fair, I'd like to hear more opinions on the fairness aspect of these designs. In areas such as OS, networking, and real-time, fairness, and its more general form, proportional fairness, are well-defined terms. In fact, perfect fairness is not feasible since it requires all runnable threads to be running simultaneously and scheduled with infinitesimally small quanta (like a fluid system). So to evaluate if a Unfortunately, fairness is rather non-formal in this context and probably isn't strictly desirable given how hack much of Linux userspace is. Until there's a method of doing directed yields, like what Will has prescribed a kind of allotment to thread doing work for another a completely strict mechanism, it is probably problematic with regards to corner cases. X for example is largely non-thread safe. Until they can get their xcb framework in place and addition thread infrastructure to do hand off properly, it's going to be difficult schedule for it. It's well known to be problematic. You announced your scheduler without CCing any of the relevant people here (and risk being completely ignored in lkml traffic): http://lkml.org/lkml/2007/4/20/286 What is your opinion of both CFS and SDL ? How can you work be useful to either scheduler mentioned or to the Linux kernel on its own ? I understand that via experiments we can show a design is reasonably fair in the common case, but IMHO, to claim that a design is fair, there needs to be some kind of formal analysis on the fairness bound, and this bound should be proven to be constant. Even if the bound is not constant, at least this analysis can help us better understand and predict the degree of fairness that users would experience (e.g., would the system be less fair if the number of threads increases? What happens if a large number of threads dynamically join and leave the system?). Will has been thinking about this, but you have to also consider the practicalities of your approach versus Con's and Ingo's. I'm all for things like proportional scheduling and the extensions needed to do it properly. It would be highly relevant to some version of the -rt patch if not that patch directly. bill - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: AppArmor FAQ
Crispin Cowan wrote: David Wagner wrote: James Morris wrote: [...] you can change the behavior of the application and then bypass policy entirely by utilizing any mechanism other than direct filesystem access: IPC, shared memory, Unix domain sockets, local IP networking, remote networking etc. [...] Just look at their code and their own description of AppArmor. My gosh, you're right. What the heck? With all due respect to the developers of AppArmor, I can't help thinking that that's pretty lame. I think this raises substantial questions about the value of AppArmor. What is the point of having a jail if it leaves gaping holes that malicious code could use to escape? And why isn't this documented clearly, with the implications fully explained? I would like to hear the AppArmor developers defend this design decision. It was a simplicity trade off at the time, when AppArmor was mostly aimed at servers, and there was no HAL or DBUS. Now it is definitely a limitation that we are addressing. We are working on a mediation system for what kind of IPC a confined process can do http://forge.novell.com/pipermail/apparmor-dev/2007-April/000503.html Except servers use IPC and need this access control as well. Without IPC and network restrictions you can't protect database servers, ldap servers, print servers, ssh agents, virus scanning servers, spam scanning servers, etc from attackers with knowledge of how to abuse the IPC. When our IPC mediation system is code instead of vapor, it will also appear here for review. Meanwhile, AppArmor does not make IPC security any worse, confined processes are still subject to the usual Linux IPC restrictions. AppArmor actually makes the IPC situation somewhat more secure than stock Linux, e.g. normal DBUS deployment can be controlled through file access permissions. But we are not claiming AppArmor to be an IPC security enhancement, yet. Without a security interface in DBUS similar to SELinux' apparmor won't be able to control who can talk to who across DBUS, only who can connect to DBUS directly. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] powerpc pseries eeh: Convert to kthread API
Benjamin Herrenschmidt [EMAIL PROTECTED] writes: The only reason for using threads here is to get the error recovery out of an interrupt context (where errors may be detected), and then, an hour later, decrement a counter (which is how we limit these to 6 per hour). Thread reaping is trivial, the thread just exits after an hour. In addition, it should be a thread and not done from within keventd because : - It can take a long time (well, relatively but still too long for a work queue) - The driver callbacks might need to use keventd or do flush_workqueue to synchronize with their own workqueues when doing an internal recovery. Since these are events rare, I've no particular concern about performance or resource consumption. The current code seems to work just fine. :-) I think moving to kthread's is cleaner (just a wrapper around kernel threads that simplify dealing with reaping them out mostly) and I agree with Christoph that it would be nice to be able to fire off kthreads from interrupt context.. in many cases, we abuse work queues for things that should really done from kthreads instead (basically anything that takes more than a couple hundred microsecs or so). On that note does anyone have a problem is we manage the irq spawning safe kthreads the same way that we manage the work queue entries. i.e. by a structure allocated by the caller? Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: This should fix the MADV_FREE code for PPC's hashed tlb. Signed-off-by: Rik van Riel [EMAIL PROTECTED] --- Nick Piggin wrote: Nick Piggin wrote: 3) because of this, we can treat any such accesses as happening simultaneously with the MADV_FREE and as illegal, aka undefined behaviour territory and we do not need to worry about them Yes, but I'm wondering if it is legal in all architectures. It's similar to trying to access memory during an munmap. You may be able to for a short time, but it'll come back to haunt you. The question is whether the architecture specific tlb flushing code will break or not. I guess we'll need to call tlb_remove_tlb_entry() inside the MADV_FREE code to keep powerpc happy. Thanks for pointing this one out. Even then we do. Each invocation of zap_pte_range() only touches one page table page, and it flushes the TLB before releasing the page table lock. What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS. Oh dear. I see it now... The tlb end things inside zap_pte_range() are actually noops and the actual tlb flush only happens inside zap_page_range(). I guess the fact that munmap gets the mmap_sem for writing should save us, though... What about an unmap_mapping_range, or another MADV_FREE or MADV_DONTNEED? --- linux-2.6.20.x86_64/mm/memory.c.noppc 2007-04-23 21:50:09.0 -0400 +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 21:48:59.0 -0400 @@ -679,6 +679,7 @@ static unsigned long zap_pte_range(struc } ptep_test_and_clear_dirty(vma, addr, pte); ptep_test_and_clear_young(vma, addr, pte); + tlb_remove_tlb_entry(tlb, pte, addr); SetPageLazyFree(page); if (PageActive(page)) deactivate_tail_page(page); -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Remove open coded implementations of memclear_highpage flush
On 4/24/07, Christoph Lameter [EMAIL PROTECTED] wrote: There are a series of open coded reimplementation of memclear_highpage_flush all over the page cache code. Call memclear_highpage_flush in those locations. Consolidates code and eases maintenance. If I remember right, a very similar patchset was recently submitted that Andrew merged in -mm(?). It also renamed memclear_highpage_flush to something like zero_user_page (though I wonder how good a name that is considering it takes an offset and not the whole page) and deprecated the old name. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Remove open coded implementations of memclear_highpage flush
On Tue, 24 Apr 2007 07:49:45 +0530 Satyam Sharma [EMAIL PROTECTED] wrote: On 4/24/07, Christoph Lameter [EMAIL PROTECTED] wrote: There are a series of open coded reimplementation of memclear_highpage_flush all over the page cache code. Call memclear_highpage_flush in those locations. Consolidates code and eases maintenance. If I remember right, a very similar patchset was recently submitted that Andrew merged in -mm(?). yup. It also renamed memclear_highpage_flush to something like zero_user_page (though I wonder how good a name that is considering it takes an offset and not the whole page) It's not a great name, but the fact that you must provide it with `offset' and `length' arguments rather clears up any confusion ;) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Remove open coded implementations of memclear_highpage flush
On Tue, 24 Apr 2007, Satyam Sharma wrote: On 4/24/07, Christoph Lameter [EMAIL PROTECTED] wrote: There are a series of open coded reimplementation of memclear_highpage_flush all over the page cache code. Call memclear_highpage_flush in those locations. Consolidates code and eases maintenance. If I remember right, a very similar patchset was recently submitted that Andrew merged in -mm(?). It also renamed memclear_highpage_flush to something like zero_user_page (though I wonder how good a name that is considering it takes an offset and not the whole page) and deprecated the old name. My latest tree from Andrew does not have any of this. URL of patch? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: PageLRU can be non-atomic bit operation
On Tue, 24 Apr 2007 10:54:27 +0900 Hisashi Hifumi [EMAIL PROTECTED] wrote: In the case that changing the same bit concurrently, lock prefix or other spinlock is needed. But, I think that concurrent bit operation on different bits is just like OR operation , so lock prefix is not needed. AMD instruction manual says about bts that , Copies a bit, specified by bit index in a register or 8-bit immediate value (second operand), from a bit string (first operand), also called the bit base, to the carry flag (CF) of the rFLAGS register, and then sets the bit in the bit string to 1. BTS instruction is read-modify-write instruction on bit unit. So concurrent bit operation on different bits may be possible. This is ia64's __set_bit() hehe.. == static __inline__ void __set_bit (int nr, volatile void *addr) { *((__u32 *) addr + (nr 5)) |= (1 (nr 31)); } == Bye. -Kame - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE][PATCH] Kcli - Kernel command line interface.
Hi Matt, On 4/24/07, Matt Ranon [EMAIL PROTECTED] wrote: The obvious question is: what's _wrong_ with doing all this in some cut-down userspace environment like busybox? Why is this stuff better? Obviously some embedded developers have considered that option and have rejected it. But we do need to be told, at length, why that decision was made. There is nothing _wrong_ with doing it all in a cut-down userspace. It is a matter of personal preference, culture, and the application. That is what makes Linux so great, it is all about choice. We are developing devices that don't have a user space, and we don't see the point in including one just for debug purposes. We will not be offended if Kcli is not included into the kernel mainline, nor if Kcli compels people to call us stupid (as it already has) just because we are different and some people don't understand us. We are firm believers that the world, including the Linux kernel world, would be a nasty place if there was only _one_ way to do any given task. Additionally, we are almost certain that there will be others who think like we do, so we are reaching out to them. We also feel compelled to give _something_ back to the community that has given so much to us, and, for now, this is all we have. I'm afraid you might've misunderstood the (rather caustic, sometimes) general nature of comments on lkml :-) But I guess you only have everything to gain if you use features that have been developed (and are being *maintained* in the current kernel) that already do the kind of stuff you want done. You might have your reasons for being so anxious to avoid any userspace at all, but quoting famous words, continuing to maintain Kcli out-of-tree could soon turn out to be an act for self-flagellation for you :-) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] powerpc pseries eeh: Convert to kthread API
On Mon, 2007-04-23 at 20:08 -0600, Eric W. Biederman wrote: Benjamin Herrenschmidt [EMAIL PROTECTED] writes: The only reason for using threads here is to get the error recovery out of an interrupt context (where errors may be detected), and then, an hour later, decrement a counter (which is how we limit these to 6 per hour). Thread reaping is trivial, the thread just exits after an hour. In addition, it should be a thread and not done from within keventd because : - It can take a long time (well, relatively but still too long for a work queue) - The driver callbacks might need to use keventd or do flush_workqueue to synchronize with their own workqueues when doing an internal recovery. Since these are events rare, I've no particular concern about performance or resource consumption. The current code seems to work just fine. :-) I think moving to kthread's is cleaner (just a wrapper around kernel threads that simplify dealing with reaping them out mostly) and I agree with Christoph that it would be nice to be able to fire off kthreads from interrupt context.. in many cases, we abuse work queues for things that should really done from kthreads instead (basically anything that takes more than a couple hundred microsecs or so). On that note does anyone have a problem is we manage the irq spawning safe kthreads the same way that we manage the work queue entries. i.e. by a structure allocated by the caller? Not sure... I can see places where I might want to spawn an arbitrary number of these without having to preallocate structures... and if I allocate on the fly, then I need a way to free that structure when the kthread is reaped which I don't think we have currently, do we ? (In fact, I could use that for other things too now that I'm thinking of it ... I might have a go at providing optional kthread destructors). Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] change kernel threads to ignore signals instead of blocking them
On Fri, 13 Apr 2007 11:31:16 +0400 Oleg Nesterov [EMAIL PROTECTED] wrote: On top of Eric's kthread-dont-depend-on-work-queues-take-2.patch Currently kernel threads use sigprocmask(SIG_BLOCK) to protect against signals. This doesn't prevent the signal delivery, this only blocks signal_wake_up(). Every killall -33 kthreadd means a struct siginfo leak. Change kthreadd_setup() to set all handlers to SIG_IGN instead of blocking them (make a new helper ignore_signals() for that). If the kernel thread needs some signal, it should use allow_signal() anyway, and in that case it should not use CLONE_SIGHAND. Note that we can't change daemonize() (should die!) in the same way, because it can be used along with CLONE_SIGHAND. This means that allow_signal() still should unblock the signal to work correctly with daemonize()ed threads. However, disallow_signal() doesn't block the signal any longer but ignores it. NOTE: with or without this patch the kernel threads are not protected from handle_stop_signal(), this seems harmless, but not good. I'm seeing 500 zombied instances of khelper (from udev startup). It only happens when the utrace patches are applied. Presumably an interaction between utrace and one of these kthread changes. I'll drop utrace for now. I don't think it's getting much help from being in -mm at present and it's getting increasingly painful to keep it merged against all the other stuff which is happening. Roland, I'll squirt all the extra utrace patches which I have in your direction. Please merge them or hang on to them for later on. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: PageLRU can be non-atomic bit operation
Hisashi Hifumi wrote: At 22:42 07/04/23, Hugh Dickins wrote: On Mon, 23 Apr 2007, Hisashi Hifumi wrote: No. The PG_lru flag bit is just one bit amongst many others: what of concurrent operations changing other bits in that same unsigned long e.g. trying to lock the page by setting PG_locked? There are some places where such micro-optimizations can be made (typically while first allocating the page); but in general, no. In i386 and x86_64, btsl is used to change page flag. In this case, if btsl without lock prefix set PG_locked and PG_lru flag concurrently, does only one operation succeed ? That's right: on an SMP machine, without the lock prefix, the operation is no longer atomic: what's stored back may be missing the result of one or the other of the racing operations. In the case that changing the same bit concurrently, lock prefix or other spinlock is needed. But, I think that concurrent bit operation on different bits is just like OR operation , so lock prefix is not needed. AMD instruction manual says about bts that , Copies a bit, specified by bit index in a register or 8-bit immediate value (second operand), from a bit string (first operand), also called the bit base, to the carry flag (CF) of the rFLAGS register, and then sets the bit in the bit string to 1. BTS instruction is read-modify-write instruction on bit unit. So concurrent bit operation on different bits may be possible. No matter what actual instruction is used, the SetPageLRU operation (ie. without the double underscore prefix) must be atomic, and the __SetPageLRU operation *can* be non-atomic if that would be faster. As Hugh points out, we must have atomic ops here, so changing the generic code to use the __ version is wrong. However if there is a faster way that i386 can perform the atomic variant, then doing so will speed up the generic code without breaking other architectures. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Nick Piggin wrote: What the tlb flush used to be able to assume is that the page has been removed from the pagetables when they are put in the tlb flush batch. I think this is still the case, to a degree. There should be no harm in removing the TLB entries after the page table has been unlocked, right? Or is something like the attached really needed? From what I can see, the page table lock should be enough synchronization between unmap_mapping_range, MADV_FREE and MADV_DONTNEED. I don't see why we need the attached, but in case you find a good reason, here's my signed-off-by line for Andrew :) Signed-off-by: Rik van Riel [EMAIL PROTECTED] -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. --- linux-2.6.20.x86_64/mm/memory.c.flushme 2007-04-23 22:26:06.0 -0400 +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 22:42:06.0 -0400 @@ -628,6 +628,7 @@ static unsigned long zap_pte_range(struc long *zap_work, struct zap_details *details) { struct mm_struct *mm = tlb-mm; + unsigned long start_addr = addr; pte_t *pte; spinlock_t *ptl; int file_rss = 0; @@ -726,6 +727,11 @@ static unsigned long zap_pte_range(struc add_mm_rss(mm, file_rss, anon_rss); arch_leave_lazy_mmu_mode(); + if (details details-madv_free) { + /* Protect against MADV_DONTNEED or unmap_mapping_range */ + tlb_finish_mmu(tlb, start_addr, addr); + tlb = tlb_gather_mmu(mm, 0); + } pte_unmap_unlock(pte - 1, ptl); return addr;
Re: Remove open coded implementations of memclear_highpage flush
On 4/24/07, Christoph Lameter [EMAIL PROTECTED] wrote: On Tue, 24 Apr 2007, Satyam Sharma wrote: If I remember right, a very similar patchset was recently submitted that Andrew merged in -mm(?). It also renamed memclear_highpage_flush to something like zero_user_page (though I wonder how good a name that is considering it takes an offset and not the whole page) and deprecated the old name. My latest tree from Andrew does not have any of this. URL of patch? fs-deprecate-memclear_highpage_flush.patch (and friends, search for zero_user_page) in ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/broken-out-2007-04-11-02-24.tar.gz - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [mmc] alternative TI FM MMC/SD driver for 2.6.21-rc7
I am not in any way argue that your driver architecture is wrong or that you should change anything. My point was simple. [tifm_sd] can only work with [tifm_7xx1]. If you add support for let's say [tifm_8xx2] in the future, which would have port offsets different that [tifm_7xx1], you would also need a completely new modules for slots (sd, ms, etc). Does not this constitutes an unbounded speculation? And then, what would you propose to do with adapters that have SD support disabled? There are quite a few of those in the wild, as of right now (SD support is provided by bundled SDHCI on such systems, if at all). Similar argument goes for other media types as well - many controllers have xD support disabled too (I think you have one of those - Sony really values its customers). After all, it is not healthy to have dead code in the kernel. On the other hand, if TI puts out a controller which is functionally identical, but has different register map, it wouldn't be hard to refactor the code. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
On Friday April 20, [EMAIL PROTECTED] wrote: Scale writeback cache per backing device, proportional to its writeout speed. So it works like this: We account for writeout in full pages. When a page has the Writeback flag cleared, we account that as a successfully retired write for the relevant bdi. By using floating averages we keep track of how many writes each bdi has retired 'recently' where the unit of time in which we understand 'recently' is a single page written. We keep a floating average for each bdi, and a floating average for the total writeouts (that 'average' is, of course, 1.) Using these numbers we can calculate what faction of 'recently' retired writes were retired by each bdi (get_writeout_scale). Multiplying this fraction by the system-wide number of pages that are allowed to be dirty before write-throttling, we get the number of pages that the bdi can have dirty before write-throttling the bdi. I note that the same fraction is *not* applied to background_thresh. Should it be? I guess not - there would be interesting starting transients, as a bdi which had done no writeout would not be allowed any dirty pages, so background writeout would start immediately, which isn't what you want... or is it? For each bdi we also track the number of (dirty, writeback, unstable) pages and do not allow this to exceed the limit set for this bdi. The calculations involving 'reserve' in get_dirty_limits are a little confusing. It looks like you calculating how much total head-room there is for the bdi (pages that the system can still dirty - pages this bdi has dirty) and making sure the number returned in pbdi_dirty doesn't allow more than that to be used. This is probably a reasonable thing to do but it doesn't feel like the right place. I think get_dirty_limits should return the raw threshold, and balance_dirty_pages should do both tests - the bdi-local test and the system-wide test. Currently you have a rather odd situation where + if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh) + break; might included numbers obtained with bdi_stat_sum being compared with numbers obtained with bdi_stat. With these patches, the VM still (I think) assumes that each BDI has a reasonable queue limit, so that writeback_inodes will block on a full queue. If a BDI has a very large queue, balance_dirty_pages will simply turn lots of DIRTY pages into WRITEBACK pages and then think We've done our duty without actually blocking at all. With the extra accounting that we now have, I would like to see balance_dirty_pages dirty pages wait until RECLAIMABLE+WRITEBACK is actually less than 'threshold'. This would probably mean that we would need to support per-bdi background_writeout to smooth things out. Maybe that it fodder for another patch-set. You set: + vm_cycle_shift = 1 + ilog2(vm_total_pages); Can you explain that? My experience is that scaling dirty limits with main memory isn't what we really want. When you get machines with very large memory, the amount that you want to be dirty is more a function of the speed of your IO devices, rather than the amount of memory, otherwise you can sometimes see large filesystem lags ('sync' taking minutes?) I wonder if it makes sense to try to limit the dirty data for a bdi to the amount that it can write out in some period of time - maybe 3 seconds. Probably configurable. You seem to have almost all the infrastructure in place to do that, and I think it could be a valuable feature. At least, I think vm_cycle_shift should be tied (loosely) to dirty_ratio * vm_total_pages ?? On the whole, looks good! Thanks, NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
On Mon, 23 Apr 2007 22:53:49 -0400 Rik van Riel [EMAIL PROTECTED] wrote: I don't see why we need the attached, but in case you find a good reason, here's my signed-off-by line for Andrew :) Andew is in a defensive crouch trying to work his way through all the bugs he's been sent. After I've managed to release 2.6.21-rc7-mm1 (say, December) I expect I'll drop the MADV_FREE stuff, give you a run at creating a new patch series. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] powerpc pseries eeh: Convert to kthread API
Benjamin Herrenschmidt [EMAIL PROTECTED] writes: Not sure... I can see places where I might want to spawn an arbitrary number of these without having to preallocate structures... and if I allocate on the fly, then I need a way to free that structure when the kthread is reaped which I don't think we have currently, do we ? (In fact, I could use that for other things too now that I'm thinking of it ... I might have a go at providing optional kthread destructors). Well the basic problem is that for any piece of code that can be modular we need a way to ensure all threads it has running are shutdown when we remove the module. Which means a fire and forget model however simple is unfortunately the wrong thing. Now we might be able to wrap this in some kind of manager construct, so you don't have to manage each thread individually, but we still have the problem of ensuring all of the threads exit when we terminate the module. Further in general it doesn't make sense to grab a module reference and call that sufficient because we would like to request that the module exits. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 23/25] xen: Lockdep fixes for xen-netfront
Jeremy Fitzhardinge [EMAIL PROTECTED] wrote: @@ -1212,10 +1212,10 @@ static int netif_poll(struct net_device int pages_flipped = 0; int err; - spin_lock(np-rx_lock); + spin_lock_bh(np-rx_lock); if (unlikely(!netfront_carrier_ok(np))) { - spin_unlock(np-rx_lock); + spin_unlock_bh(np-rx_lock); You don't need to disable BH in netif_poll since it's always called with BH disabled. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523
Jiri Kosina [EMAIL PROTECTED] wrote: Hmm, *sigh*. I guess the patch below fixes the problem, but it is a masterpiece in the field of ugliness. And I am not sure whether it is completely correct either. Are there any immediate ideas for better solution with respect to how struct sock locking works? Please cc such patches to netdev. Thanks. diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c index 71f5cfb..c5c93cd 100644 --- a/net/bluetooth/hci_sock.c +++ b/net/bluetooth/hci_sock.c @@ -656,7 +656,10 @@ static int hci_sock_dev_event(struct notifier_block *this, unsigned long event, /* Detach sockets from device */ read_lock(hci_sk_list.lock); sk_for_each(sk, node, hci_sk_list.head) { - lock_sock(sk); + if (in_atomic()) + bh_lock_sock(sk); + else + lock_sock(sk); This doesn't do what you think it does. bh_lock_sock can still succeed even with lock_sock held by someone else. Does this need to occur immediately when an event occurs? If not I'd suggest moving this into a workqueue. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Linus Torvalds wrote: On Mon, 23 Apr 2007, Ingo Molnar wrote: The give scheduler money transaction can be both an implicit transaction (for example when writing to UNIX domain sockets or blocking on a pipe, etc.), or it could be an explicit transaction: sched_yield_to(). This latter i've already implemented for CFS, but it's much less useful than the really significant implicit ones, the ones which will help X. Yes. It would be wonderful to get it working automatically, so please say something about the implementation.. The perfect situation would be that when somebody goes to sleep, any extra points it had could be given to whoever it woke up last. Note that for something like X, it means that the points are 100% ephemeral: it gets points when a client sends it a request, but it would *lose* the points again when it sends the reply! So it would only accumulate scheduling points while multiuple clients are actively waiting for it, which actually sounds like exactly the right thing. However, I don't really see how to do it well, especially since the kernel cannot actually match up the client that gave some scheduling points to the reply that X sends back. There are subtle semantics with these kinds of things: especially if the scheduling points are only awarded when a process goes to sleep, if X is busy and continues to use the CPU (for another client), it wouldn't give any scheduling points back to clients and they really do accumulate with the server. Which again sounds like it would be exactly the right thing (both in the sense that the server that runs more gets more points, but also in the sense that we *only* give points at actual scheduling events). But how do you actually *give/track* points? A simple last woken up by this process thing that triggers when it goes to sleep? It might work, but on the other hand, especially with more complex things (and networking tends to be pretty complex) the actual wakeup may be done by a software irq. Do we just say it ran within the context of X, so we assume X was the one that caused it? It probably would work, but we've generally tried very hard to avoid accessing current from interrupt context, including bh's. Within reason, it's not the number of clients that X has that causes its CPU bandwidth use to sky rocket and cause problems. It's more to to with what type of clients they are. Most GUIs (even ones that are constantly updating visual data (e.g. gkrellm -- I can open quite a large number of these without increasing X's CPU usage very much)) cause very little load on the X server. The exceptions to this are the various terminal emulators (e.g. xterm, gnome-terminal, etc.) when being used to run output intensive command line programs e.g. try ls -lR / in an xterm. The other way (that I've noticed) X's CPU usage bandwidth sky rocket is when you grab a large window and wiggle it about a lot and hopefully this doesn't happen a lot so the problem that needs to be addressed is the one caused by text output on xterm and its ilk. So I think that an elaborate scheme for distributing points between X and its clients would be overkill. A good scheduler will make sure other tasks such as audio streamers get CPU when they need it with good responsiveness even when X takes off by giving them higher priority because their CPU bandwidth use is low. The one problem that might still be apparent in these cases is the mouse becoming jerky while X is working like crazy to spew out text too fast for anyone to read. But the only way to fix that is to give X more bandwidth but if it's already running at about 95% of a CPU that's unlikely to help. To fix this you would probably need to modify X so that it knows re-rendering the cursor is more important than rendering text in an xterm. In normal circumstances, the re-rendering of the mouse happens quickly enough for the user to experience good responsiveness because X's normal CPU use is low enough for it to be given high priority. Just because the O(1) tried this model and failed doesn't mean that the model is bad. O(1) was a flawed implementation of a good model. Peter PS Doing a kernel build in an xterm isn't an example of high enough output to cause a problem as (on my system) it only raises X's consumption from 0 to 2% to 2 to 5%. The type of output that causes the problem is usually flying past too fast to read. -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Permanent Kgdb integration into the kernel - lets get with it. (Dave: How do FreeBSD folks maintain the KGDB stub?)
On Sat, 2007-04-21 at 11:48 +0200, Andi Kleen wrote: Lots of people want kgdb. One person is famously less keen on it, but we'll be able to talk him around, as long as the patches aren't daft. The big question is if the kgdb developers seriously want mainline. At least in the past this definitely wasn't the case. I haven't seen any email from kgdb developers saying they didn't want kgdb to be part of mainline. Happen to have any e-mail demonstrating that? It's appears to me that: 1. Jason Wessel is putting a lot of effort at that right now. 2. Tom Rini worked hard at this just a few months ago. 3. George Anzinger was working hard at this a year or two with the mm series and as likely disappointed when it wasn't put into the mainline. As I recall the reason Linus gave was that there were two competing patches and he wanted that be resolved before integrating it into the mainline. So George worked with Amit at SourceForge over that past year or two and it's now integrated. If they're not open to change requests from mainline reviewers we don't even need to bother to start the whole exercise. What issue are there of have been that your referring to? Once KGDB is part of KORG can't it's maintenance and support be a kernel wide responsibility. If someone breaks kgdb shouldn't that be backed out until the KORG developers fixes the problem? Centralizing the responsibility for KGDB seems like mistake. I doubt the FreeBSD folks rip out the KGDB support of a kernel hacker breaks KGDB and then leaves a group of KGDB developers to sort out the problem. Seems it should be cough as a mm patch with Andrew tossing out the patch if it breaks KGDB. Kgdb developers could try to give Andrew a heads up if this occurs and he didn't notice it. Once KGDB is integrated the maintenance should be minimal and changes that break KGDB are likely best addressed by the developer that just broke it. At least that what I'd think is an optimal approach. Perhaps Dave O'Brien could tell us how the FreeBSD folks take care for KGDB. Just putting their stuff onto korg isn't enough. Yep, and once it's integrated into korg it should finally become a permanent part of the kernel and I suspect maintained by all kernel developers. New KGDB features could be developed at SourceForge but maintaining kernel coherence seems like a global responsibility. Like running fault injection on your code before checking it in. Maybe I'm totally out to lunch on this; perhaps Dave O'Brien can straighten me our if I'm wrong or the Linux kernel core responsibility paradigm are incompatible with this. I'd prefer Linux being just as good as NetBSD with Debugging support; current presentations like: http://foss.in/2005/slides/netbsd-linux.pdf show our current support as being much worse. Let's fix it. You developed a kgdb proxy for Keith Owens kdb and I suspect you would like to have KGDB being part of the kernel mainline as long as it's done well. I doubt anyone would argue with that. Perhaps it's possible to eventually setup KGDB so it can be debugged with kdb. Once KGDB is mainline that are plenty of issues that can be addressed; for example taking a kernel core dump after dropping into kgdb and having the registers show up correctly in Dave Anderson's crash utility. -piet -Andi -- Piet DelaneyPhone: (408) 200-5256 Blue Lane Technologies Fax: (408) 200-5299 10450 Bubb Rd. Cupertino, Ca. 95014Email: [EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part
Re: BUG: Null pointer dereference in fs/open.c
This bug occurs in linux-2.6.20 and 2.6.21-rc7-git5, and does not occur in linux-2.6.19-git22. After running pktsetup 0 /dev/hdd, I get (timestamps removed): pktcdvd: pkt_get_last_written failed BUG: unable to handle kernel NULL pointer dereference at virtual address 000e printing eip: c0173f69 *pde = Oops: [#1] PREEMPT Modules linked in: snd_ca0106 snd_ac97_codec ac97_bus 8139cp 8139too iTCO_wdt CPU:0 EIP:0060:[c0173f69]Not tainted VLI EFLAGS: 00010203 (2.6.21-rc7-git5 #22) EIP is at do_sys_open+0x59/0xd0 eax: 0002 ebx: 4020 ecx: 0001 edx: 0002 esi: df1e3000 edi: 0003 ebp: de17bfa4 esp: de17bf84 ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068 Process vol_id (pid: 4273, ti=de17b000 task=df4143f0 task.ti=de17b000) Stack: c013d2a5 ff9c 0002 c059cea3 bfb6bf64 8000 b7f60ff4 de17bfb0 c017401c de17b000 c01041c6 bfb6bf64 8000 8000 b7f60ff4 bfb6a798 0005 007b 007b 0005 Call Trace: [c010521a] show_trace_log_lvl+0x1a/0x30 [c01052d9] show_stack_log_lvl+0xa9/0xd0 [c010551c] show_registers+0x21c/0x3a0 [c01057a4] die+0x104/0x260 [c04c5947] do_page_fault+0x277/0x610 [c04c408c] error_code+0x74/0x7c [c017401c] sys_open+0x1c/0x20 [c01041c6] sysenter_past_esp+0x5f/0x99 === Code: ff 85 c0 89 c7 78 77 8b 45 08 89 d9 89 f2 89 04 24 8b 45 e8 e8 69 ff ff ff 3d 00 f0 ff ff 89 45 ec 77 71 8b 55 ec bb 20 00 00 40 8b 42 0c 8b 48 30 89 4d f0 0f b7 51 66 81 e2 00 f0 00 00 81 fa EIP: [c0173f69] do_sys_open+0x59/0xd0 SS:ESP 0068:de17bf84 from fs/open.c, comments added: // do_sys_open is consistently called with dfd=0xff9c, // filename=/dev/.tmp-254-0, flags=0x8000, mode=0) long do_sys_open(int dfd, const char __user *filename, int flags, int mode) { char *tmp = getname(filename); int fd = PTR_ERR(tmp); if (!IS_ERR(tmp)) { fd = get_unused_fd(); if (fd = 0) { // do_filp_open consistently returns 2, in this case struct file *f = do_filp_open(dfd, tmp, flags, mode); // IS_ERR always returns 0 for this command if (IS_ERR(f)) { put_unused_fd(fd); fd = PTR_ERR(f); } else { // null pointer dereference occurs here fsnotify_open(f-f_path.dentry); fd_install(fd, f); } } putname(tmp); } return fd; } I was able to workaround this, by testing if do_filp_open was returning 2 or not, but obviously this is a very temporal solution to a very specific circumstance. If there is any more information I can provide, let me know. William Heimbigner [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/