[PATCH] x86/platform/uv: Include clocksource.h for clocksource_touch_watchdog()
* Andy Lutomirski <l...@kernel.org> wrote: > diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h > index f80d70009ff8..6d7d0e52ed5a 100644 > --- a/arch/x86/include/asm/fixmap.h > +++ b/arch/x86/include/asm/fixmap.h > @@ -19,7 +19,6 @@ > #include > #include > #include > -#include > #ifdef CONFIG_X86_32 > #include > #include So this change triggered a build failure on 64-bit allmodconfig - fixed via the patch below. Your change unearthed a latent bug, a missing header inclusion. Thanks, Ingo > >From d51953b0873358d13b189996e6976dfa12a9b59d Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mi...@kernel.org> Date: Fri, 11 Dec 2015 09:01:30 +0100 Subject: [PATCH] x86/platform/uv: Include clocksource.h for clocksource_touch_watchdog() This build failure triggers on 64-bit allmodconfig: arch/x86/platform/uv/uv_nmi.c:493:2: error: implicit declaration of function ‘clocksource_touch_watchdog’ [-Werror=implicit-function-declaration] which is caused by recent changes exposing a missing clocksource.h include in uv_nmi.c: cc1e24fdb064 x86/vdso: Remove pvclock fixmap machinery this file got clocksource.h indirectly via fixmap.h - that stealth route of header inclusion is now gone. Cc: Borislav Petkov <b...@alien8.de> Cc: H. Peter Anvin <h...@zytor.com> Cc: Linus Torvalds <torva...@linux-foundation.org> Cc: Thomas Gleixner <t...@linutronix.de> Signed-off-by: Ingo Molnar <mi...@kernel.org> --- arch/x86/platform/uv/uv_nmi.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/platform/uv/uv_nmi.c b/arch/x86/platform/uv/uv_nmi.c index 327f21c3bde1..8dd80050d705 100644 --- a/arch/x86/platform/uv/uv_nmi.c +++ b/arch/x86/platform/uv/uv_nmi.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
* Paolo Bonziniwrote: > > > On 10/12/2015 00:12, Andy Lutomirski wrote: > > From: Andy Lutomirski > > > > The pvclock vdso code was too abstracted to understand easily and > > excessively paranoid. Simplify it for a huge speedup. > > > > This opens the door for additional simplifications, as the vdso no > > longer accesses the pvti for any vcpu other than vcpu 0. > > > > Before, vclock_gettime using kvm-clock took about 45ns on my machine. > > With this change, it takes 29ns, which is almost as fast as the pure TSC > > implementation. > > > > Signed-off-by: Andy Lutomirski > > --- > > arch/x86/entry/vdso/vclock_gettime.c | 81 > > > > 1 file changed, 46 insertions(+), 35 deletions(-) > > > > diff --git a/arch/x86/entry/vdso/vclock_gettime.c > > b/arch/x86/entry/vdso/vclock_gettime.c > > index ca94fa649251..c325ba1bdddf 100644 > > --- a/arch/x86/entry/vdso/vclock_gettime.c > > +++ b/arch/x86/entry/vdso/vclock_gettime.c > > @@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info > > *get_pvti(int cpu) > > > > static notrace cycle_t vread_pvclock(int *mode) > > { > > - const struct pvclock_vsyscall_time_info *pvti; > > + const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti; > > cycle_t ret; > > - u64 last; > > - u32 version; > > - u8 flags; > > - unsigned cpu, cpu1; > > - > > + u64 tsc, pvti_tsc; > > + u64 last, delta, pvti_system_time; > > + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift; > > > > /* > > -* Note: hypervisor must guarantee that: > > -* 1. cpu ID number maps 1:1 to per-CPU pvclock time info. > > -* 2. that per-CPU pvclock time info is updated if the > > -*underlying CPU changes. > > -* 3. that version is increased whenever underlying CPU > > -*changes. > > +* Note: The kernel and hypervisor must guarantee that cpu ID > > +* number maps 1:1 to per-CPU pvclock time info. > > +* > > +* Because the hypervisor is entirely unaware of guest userspace > > +* preemption, it cannot guarantee that per-CPU pvclock time > > +* info is updated if the underlying CPU changes or that that > > +* version is increased whenever underlying CPU changes. > > * > > +* On KVM, we are guaranteed that pvti updates for any vCPU are > > +* atomic as seen by *all* vCPUs. This is an even stronger > > +* guarantee than we get with a normal seqlock. > > +* > > +* On Xen, we don't appear to have that guarantee, but Xen still > > +* supplies a valid seqlock using the version field. > > + > > +* We only do pvclock vdso timing at all if > > +* PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to > > +* mean that all vCPUs have matching pvti and that the TSC is > > +* synced, so we can just look at vCPU 0's pvti. > > */ > > - do { > > - cpu = __getcpu() & VGETCPU_CPU_MASK; > > - /* TODO: We can put vcpu id into higher bits of pvti.version. > > -* This will save a couple of cycles by getting rid of > > -* __getcpu() calls (Gleb). > > -*/ > > - > > - pvti = get_pvti(cpu); > > - > > - version = __pvclock_read_cycles(>pvti, , ); > > - > > - /* > > -* Test we're still on the cpu as well as the version. > > -* We could have been migrated just after the first > > -* vgetcpu but before fetching the version, so we > > -* wouldn't notice a version change. > > -*/ > > - cpu1 = __getcpu() & VGETCPU_CPU_MASK; > > - } while (unlikely(cpu != cpu1 || > > - (pvti->pvti.version & 1) || > > - pvti->pvti.version != version)); > > - > > - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT))) > > + > > + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) { > > *mode = VCLOCK_NONE; > > + return 0; > > + } > > + > > + do { > > + version = pvti->version; > > + > > + /* This is also a read barrier, so we'll read version first. */ > > + tsc = rdtsc_ordered(); > > + > > + pvti_tsc_to_system_mul = pvti->tsc_to_system_mul; > > + pvti_tsc_shift = pvti->tsc_shift; > > + pvti_system_time = pvti->system_time; > > + pvti_tsc = pvti->tsc_timestamp; > > + > > + /* Make sure that the version double-check is last. */ > > + smp_rmb(); > > + } while (unlikely((version & 1) || version != pvti->version)); > > + > > + delta = tsc - pvti_tsc; > > + ret = pvti_system_time + > > + pvclock_scale_delta(delta, pvti_tsc_to_system_mul, > > + pvti_tsc_shift); > > > > /* refer to tsc.c read_tsc() comment for rationale */ > > last = gtod->cycle_last; > > > > Reviewed-by: Paolo Bonzini Thanks. I've
Re: [GIT PULL 0/5] perf/core improvements and fixes
* Arnaldo Carvalho de Melowrote: > Hi Ingo, > > Please consider pulling, > > - Arnaldo > > The following changes since commit 43e41adc9e8c36545888d78fed2ef8d102a938dc: > > perf record: Add ability to sample call branches (2015-10-20 10:30:55 +0200) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git > tags/perf-core-for-mingo > > for you to fetch changes up to e3d006ce8180a0c025ce66bdc89bbc125f85be57: > > perf annotate: Add debug message for out of bounds sample (2015-10-21 > 18:12:37 -0300) > > > perf/core improvements and fixes: > > User visible: > > - Print branch filter state with verbose mode (Andi Kleen) > > - Fix core dump caused by per-socket/core system-wide stat (Kan Liang) > > - Update libtraceevent KVM plugin (Paolo Bonzini) > > Developer stuff: > > - Add fixdep to 'tools/build' .gitignore (Yunlong Song) > > Signed-off-by: Arnaldo Carvalho de Melo > > > Andi Kleen (1): > perf evsel: Print branch filter state with -vv > > Arnaldo Carvalho de Melo (1): > perf annotate: Add debug message for out of bounds sample > > Kan Liang (1): > perf cpu_map: Fix core dump caused by per-socket/core system-wide stat > > Paolo Bonzini (1): > tools lib traceevent: update KVM plugin > > Yunlong Song (1): > perf build: Add fixdep to .gitignore > > tools/build/.gitignore| 1 + > tools/lib/traceevent/plugin_kvm.c | 25 + > tools/perf/util/annotate.c| 5 - > tools/perf/util/cpumap.c | 2 +- > tools/perf/util/evsel.c | 1 + > 5 files changed, 24 insertions(+), 10 deletions(-) > create mode 100644 tools/build/.gitignore Pulled, thanks a lot Arnaldo! Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
* Andy Lutomirskiwrote: > > These could still be open coded in an inlined fashion, like the scheduler > > usage. > > We could have a raw_rdmsr for those. > > OTOH, I'm still not 100% convinced that this warn-but-don't-die behavior is > worth the effort. This isn't a frequent source of bugs to my knowledge, and > we > don't try to recover from incorrect cr writes, out-of-bounds MMIO, etc, so do > we > really gain much by rigging a recovery mechanism for rdmsr and wrmsr failures > for code that doesn't use the _safe variants? It's just the general principle really: don't crash the kernel on bootup. There's few things more user hostile than that. Also, this would maintain the status quo: since we now (accidentally) don't crash the kernel on distro kernels (but silently and unsafely ignore the faulting instruction), we should not regress that behavior (by adding the chance to crash again), but improve upon it. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
* Peter Zijlstra <pet...@infradead.org> wrote: > On Mon, Sep 21, 2015 at 09:36:15AM -0700, Linus Torvalds wrote: > > On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar <mi...@kernel.org> wrote: > > > > > > Linus, what's your preference? > > > > So quite frankly, is there any reason we don't just implement > > native_read_msr() as just > > > >unsigned long long native_read_msr(unsigned int msr) > >{ > > int err; > > unsigned long long val; > > > > val = native_read_msr_safe(msr, ); > > WARN_ON_ONCE(err); > > return val; > >} > > > > Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be > > done with it. I don't see the downside. > > > > How many msr reads are so critical that the function call > > overhead would matter? Get rid of the inline version of the _safe() > > thing too, and put that thing there too. > > There are a few in the perf code, and esp. on cores without a stack engine > the > call overhead is noticeable. Also note that the perf MSRs are generally > optimized MSRs and less slow (we cannot say fast, they're still MSRs) than > regular MSRs. These could still be open coded in an inlined fashion, like the scheduler usage. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
* Denys Vlasenko <dvlas...@redhat.com> wrote: > On 09/28/2015 09:58 AM, Ingo Molnar wrote: > > > > * Denys Vlasenko <dvlas...@redhat.com> wrote: > > > >> On 09/26/2015 09:50 PM, H. Peter Anvin wrote: > >>> NAK. We really should map the GDT read-only on all 64 bit systems, > >>> since we can't hide the address from SLDT. Same with the IDT. > >> > >> Sorry, I don't understand your point. > > > > So the problem is that right now the SGDT instruction (which is > > unprivileged) > > leaks the real address of the kernel image: > > > > fomalhaut:~> ./sgdt > > SGDT: 88303fd89000 / 007f > > > > that '88303fd89000' is a kernel address. > > Thank you. > I do know that SGDT and friends are unprivileged on x86 > and thus they allow userspace (and guest kernels in paravirt) > learn things they don't need to know. > > I don't see how making GDT page-aligned and page-sized > changes anything in this regard. SGDT will still work, > and still leak GDT address. Well, as I try to explain it in the other part of my mail, doing so enables us to remap the GDT to a less security sensitive virtual address that does not leak the kernel's randomized address: > > Your observation in the changelog and your patch: > > > >>>> It is page-sized because of paravirt. [...] > > > > ... conflicts with the intention to mark (remap) the primary GDT address > > read-only > > on native kernels as well. > > > > So what we should do instead is to use the page alignment properly and > > remap the > > GDT to a read-only location, and load that one. > > If we'd have a small GDT (i.e. what my patch does), we still can remap the > entire page which contains small GDT, and simply don't care that some other > data > is also visible through that RO page. That's generally considered fragile: suppose an attacker has a limited information leak that can read absolute addresses with system privilege but he doesn't know the kernel's randomized base offset. With a 'partial page' mapping there could be function pointers near the GDT, part of the page the GDT happens to be on, that leak this information. (Same goes for crypto keys or other critical information (like canary information, salts, etc.) accidentally ending up nearby.) Arguably it's a bit tenuous, but when playing remapping games it's generally considered good to be page aligned and page sized, with zero padding. > > This would have a couple of advantages: > > > > - This would give kernel address randomization more teeth on x86. > > > > - An additional advantage would be that rootkits overwriting the GDT would > > have > >a bit more work to do. > > > > - A third advantage would be that for NUMA systems we could 'mirror' the > > GDT into > >node-local memory and load those. This makes GDT load cache-misses a bit > > less > >expensive. > > GDT is per-cpu. Isn't per-cpu memory already NUMA-local? Indeed it is: fomalhaut:~> for ((cpu=1; cpu<9; cpu++)); do taskset $cpu ./sgdt ; done SGDT: 88103fa09000 / 007f SGDT: 88103fa29000 / 007f SGDT: 88103fa29000 / 007f SGDT: 88103fa49000 / 007f SGDT: 88103fa49000 / 007f SGDT: 88103fa49000 / 007f SGDT: 88103fa29000 / 007f SGDT: 88103fa69000 / 007f I confused it with the IDT, which is still global. This also means that the GDT in itself does not leak kernel addresses at the moment, except it leaks the layout of the percpu area. So my suggestion would be to: - make the GDT unconditionally page aligned and sized, then remap it to a read-only address unconditionally as well, like we do it for the IDT. - make the IDT per CPU as well, for performance reasons. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
* Denys Vlasenkowrote: > On 09/26/2015 09:50 PM, H. Peter Anvin wrote: > > NAK. We really should map the GDT read-only on all 64 bit systems, > > since we can't hide the address from SLDT. Same with the IDT. > > Sorry, I don't understand your point. So the problem is that right now the SGDT instruction (which is unprivileged) leaks the real address of the kernel image: fomalhaut:~> ./sgdt SGDT: 88303fd89000 / 007f that '88303fd89000' is a kernel address. fomalhaut:~> cat sgdt.c #include #include int main(void) { struct gdt_desc { unsigned short limit; unsigned long addr; } __attribute__((packed)) gdt_desc = { -1, -1 }; asm volatile("sgdt %0": "=m" (gdt_desc)); printf("SGDT: %016lx / %04x\n", gdt_desc.addr, gdt_desc.limit); return 0; } Your observation in the changelog and your patch: > >> It is page-sized because of paravirt. [...] ... conflicts with the intention to mark (remap) the primary GDT address read-only on native kernels as well. So what we should do instead is to use the page alignment properly and remap the GDT to a read-only location, and load that one. This would have a couple of advantages: - This would give kernel address randomization more teeth on x86. - An additional advantage would be that rootkits overwriting the GDT would have a bit more work to do. - A third advantage would be that for NUMA systems we could 'mirror' the GDT into node-local memory and load those. This makes GDT load cache-misses a bit less expensive. The IDT is already remapped: fomalhaut:~> ./sidt Sidt: ff57b000 / 0fff fomalhaut:~> cat sidt.c #include #include int main(void) { struct idt_desc { unsigned short limit; unsigned long addr; } __attribute__((packed)) idt_desc = { -1, -1 }; asm volatile("sidt %0": "=m" (idt_desc)); printf("Sidt: %016lx / %04x\n", idt_desc.addr, idt_desc.limit); return 0; } Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
* Linus Torvalds <torva...@linux-foundation.org> wrote: > On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar <mi...@kernel.org> wrote: > > > > Linus, what's your preference? > > So quite frankly, is there any reason we don't just implement > native_read_msr() as just > >unsigned long long native_read_msr(unsigned int msr) >{ > int err; > unsigned long long val; > > val = native_read_msr_safe(msr, ); > WARN_ON_ONCE(err); > return val; >} > > Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be > done with it. I don't see the downside. Absolutely! > How many msr reads are so critical that the function call overhead > would > matter? Get rid of the inline version of the _safe() thing too, and put that > thing there too. Only a very low number of them is performance critical (because even hw-accelerated MSR accesses are generally slow so we try to avoid MSR accesses in fast paths as much as possible, via shadowing, etc.) - and in the few cases where we have to access an MSR in a fast path we can do those separately. I'm only worried about the 'default' APIs, i.e. rdmsr() that is used throughout arch/x86/ over a hundred times, not about performance critical code paths that get enough testing and enough attention in general. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
* Andy Lutomirskiwrote: > On Sep 20, 2015 5:15 PM, "Linus Torvalds" > wrote: > > > > On Sun, Sep 20, 2015 at 5:02 PM, Andy Lutomirski wrote: > > > This demotes an OOPS and likely panic due to a failed non-"safe" MSR > > > access to a WARN_ON_ONCE and a return of zero (in the RDMSR case). > > > We still write a pr_info entry unconditionally for debugging. > > > > No, this is wrong. > > > > If you really want to do something like this, then just make all MSR reads > > safe. So the only difference between "safe" and "unsafe" is that the unsafe > > version just doesn't check the return value, and silently just returns zero > > for reads (or writes nothing). > > > > To quote Obi-Wan: "Use the exception table, Luke". > > > > Because decoding instructions is just too ugly. We'll do it for CPU errata > > where we might have to do it for user space code too (ie the AMD prefetch > > mess), but for code that _we_ control? Hell no. > > > > So NAK on this. > > My personal preference is to just not do this at all. A couple people > disagree. > If we make the unsafe variants not oops, then I think we want to have the > nice > loud warning, since these issues are bugs if they happen. > > We could certainly use the exception table for this, but it'll result in > bigger > core, since each MSR access will need an exception table entry and an > associated > fixup to call some helper that warns and sets the result to zero. > > I'd be happy to implement that, but only if it'll be applied. Otherwise I'd > rather just drop this patch and keep the rest of the series. Linus, what's your preference? Due to the bug mentioned earlier in this thread all MSR reads are currently 'safe' on all the major Linux distros (which all have CONFIG_PARAVIRT=y), i.e. by 'fixing' them we'd reintroduce random crashes into various fragile pieces of code... To add insult to injury, the current 'silently safe by accident' MSR code isn't so safe: because it leaves the result of the read uninitialized... To fix this all I'd really like to have: - safe MSR reads by default (i.e. never boot crash the kernel on some rare condition - which to most users is either a silent boot hang or an instant restart). Historicaly we had a stream of 'silly boot crashes' due to MSR reads that generate a #GPF. They make Linux less usable around the edges, especially in the x86 non-server (desktop) space where most hardware vendors are either openly Linux hostile, or, at best, Linux oblivious. - proper result-zeroing behavior on exceptions - and we should also generate _some_ sort of warning when MSR exceptions happen in an 'unintended' fashion. Maybe the warning could be put under a (default-enabled) config option for the size conscious. Or we could extend exception table entry encoding to include a 'warning bit', to not bloat the kernel. If the exception handler code encounters such an exception it would generate a one-time warning for that entry, but otherwise not crash the kernel and continue execution with an all-zeroes result for the MSR read. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
* Andy Lutomirskiwrote: > This demotes an OOPS and likely panic due to a failed non-"safe" MSR > access to a WARN_ON_ONCE and a return of poisoned values (in the > RDMSR case). We still write a pr_info entry unconditionally for > debugging. > > To be clear, this type of failure should *not* happen. This patch > exists to minimize the chance of nasty undebuggable failures due on > systems that used to work due to a now-fixed CONFIG_PARAVIRT=y bug. > + if (opcode == 0x320f) { > + /* RDMSR */ > + pr_info("bad kernel RDMSR from non-existent MSR 0x%x", > + (unsigned int)regs->cx); > + if (!panic_on_oops) { > + WARN_ON_ONCE(true); > + > + /* Patch it up with deterministic poison. */ > + regs->ax = 0x5aadc0de; > + regs->dx = 0x8badf00d; > + regs->ip += 2; > + return true; IMHO this should really not poison the result, but use zero as the result. The poison might randomly indicate 'present' feature in various registers that might be accessed in a buggy way. Don't send the code further down into la-la-land by giving it a 'success'. And yes, zero can mean success too, but we have to pick a side here ... The warning will be enough to fix these ups, people (and in particular distro testing people) will be watching out for them. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] x86/paravirt: Fix baremetal paravirt MSR ops
* Andy Lutomirskiwrote: > Setting CONFIG_PARAVIRT=y has an unintended side effect: it silently > turns all rdmsr and wrmsr operations into the safe variants without > any checks that the operations actually succeed. > > This is IMO awful: it papers over bugs. In particular, KVM gueests > might be unwittingly depending on this behavior because > CONFIG_KVM_GUEST currently depends on CONFIG_PARAVIRT. I'm not > aware of any such problems, but applying this series would be a good > way to shake them out. > > Fix it so that the MSR operations work the same on CONFIG_PARAVIRT=n > and CONFIG_PARAVIRT=y as long as Xen isn't being used. The Xen > maintainers are welcome to make a similar change on top of this. > > Since there's plenty of time before the next merge window, I think > we should apply and fix anything that breaks. No, I think we should at most generate a warning instead, and not crash the kernel via rdmsr()! Most big distro kernels on bare metal have CONFIG_PARAVIRT=y (I checked Ubuntu and Fedora), so we are potentially exposing a lot of users to problems. Crashing the bootup on an unknown MSR is bad. Many MSR reads and writes are non-critical and returning the 'safe' result is much better than crashing or hanging the bootup. ( We should double check that rdmsr()/wrmsr() results are never left uninitialized, but are set to zero or so, for cases where the return code is not checked. ) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] x86/paravirt: Fix baremetal paravirt MSR ops
* Andy Lutomirski <l...@amacapital.net> wrote: > On Thu, Sep 17, 2015 at 12:19 AM, Ingo Molnar <mi...@kernel.org> wrote: > > > > * Andy Lutomirski <l...@kernel.org> wrote: > > > >> Setting CONFIG_PARAVIRT=y has an unintended side effect: it silently > >> turns all rdmsr and wrmsr operations into the safe variants without > >> any checks that the operations actually succeed. > >> > >> This is IMO awful: it papers over bugs. In particular, KVM gueests > >> might be unwittingly depending on this behavior because > >> CONFIG_KVM_GUEST currently depends on CONFIG_PARAVIRT. I'm not > >> aware of any such problems, but applying this series would be a good > >> way to shake them out. > >> > >> Fix it so that the MSR operations work the same on CONFIG_PARAVIRT=n > >> and CONFIG_PARAVIRT=y as long as Xen isn't being used. The Xen > >> maintainers are welcome to make a similar change on top of this. > >> > >> Since there's plenty of time before the next merge window, I think > >> we should apply and fix anything that breaks. > > > > No, I think we should at most generate a warning instead, and not crash the > > kernel > > via rdmsr()! > > > > Most big distro kernels on bare metal have CONFIG_PARAVIRT=y (I checked > > Ubuntu and > > Fedora), so we are potentially exposing a lot of users to problems. > > > > Crashing the bootup on an unknown MSR is bad. Many MSR reads and writes are > > non-critical and returning the 'safe' result is much better than crashing or > > hanging the bootup. > > > > Should we do that for CONFIG_PARAVIRT=n, too? Absolutely. PARAVIRT=n should not behave differently from PARAVIRT=y on bare metal. > It would be straightforward to rig this up (temporarily?) on top of these > patches. To keep bloat down, we might want to implement it in > do_general_protection rather than sticking it in native_read_msr. Fair enough. > wrmsr is a different beast, since we can fail due to writing the wrong value > to > an otherwise valid MSR. Given that MSR screwups can very easily be security > holes, I'm not sure that warning and blindly continuing on an unchecked > failed > wrmsr is a good idea. So the fact is that right now we are silently ignoring failures there, and have been doing that for some time. The right first step is to live with that and start generating low-key, once-per-bootup warnings at most, and see how frequent (and how serious) they are. We could add a (default disabled) CONFIG_PANIC_ON_UNKNOWN_MSR=y option if that's really a serious concern. > In any event, I think it's nuts that CONFIG_PARAVIRT changes this > behavior. We should pick something sane and stick with it. Absolutely - and as it happens, the 'does not crash the kernel' PARAVIRT=y accidental behavior is actually quite close to what we wanted for a long time, so let's make it official - and add a warning to make sure we are aware of problems. But don't turn 'potential problems' into showstopper bugs such as a hard to debug early boot crash, which to most Linux users means a black screen on bootup! > > ( We should double check that rdmsr()/wrmsr() results are never left > > uninitialized, but are set to zero or so, for cases where the return code > > is not > > checked. ) > > It sure looks like native_read_msr_safe doesn't clear the output if the rdmsr > fails. Yeah, that should be fixed, to make such soft failures deterministic. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[tip:x86/asm] x86/asm/tsc: Add rdtscll() merge helper
Commit-ID: 99770737ca7e3ebc14e66460a69b7032de9421e1 Gitweb: http://git.kernel.org/tip/99770737ca7e3ebc14e66460a69b7032de9421e1 Author: Ingo Molnar mi...@kernel.org AuthorDate: Fri, 21 Aug 2015 08:33:53 +0200 Committer: Ingo Molnar mi...@kernel.org CommitDate: Fri, 21 Aug 2015 08:35:42 +0200 x86/asm/tsc: Add rdtscll() merge helper Some in-flight code makes use of the old rdtscll() (now removed), provide a wrapper for a kernel cycle to smooth the transition to rdtsc(). ( We use the safest variant, rdtsc_ordered(), which has barriers - this adds another incentive to remove the wrapper in the future. ) Cc: Andy Lutomirski l...@kernel.org Cc: Borislav Petkov b...@suse.de Cc: Andy Lutomirski l...@amacapital.net Cc: Borislav Petkov b...@alien8.de Cc: Brian Gerst brge...@gmail.com Cc: Denys Vlasenko dvlas...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: Huang Rui ray.hu...@amd.com Cc: John Stultz john.stu...@linaro.org Cc: Len Brown l...@kernel.org Cc: Linus Torvalds torva...@linux-foundation.org Cc: Peter Zijlstra pet...@infradead.org Cc: Ralf Baechle r...@linux-mips.org Cc: Thomas Gleixner t...@linutronix.de Cc: kvm ML kvm@vger.kernel.org Link: http://lkml.kernel.org/r/dddbf98a2af53312e9aa73a5a2b1622fe5d6f52b.1434501121.git.l...@kernel.org Signed-off-by: Ingo Molnar mi...@kernel.org --- arch/x86/include/asm/msr.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h index 131eec2..54e9f08 100644 --- a/arch/x86/include/asm/msr.h +++ b/arch/x86/include/asm/msr.h @@ -152,6 +152,9 @@ static __always_inline unsigned long long rdtsc_ordered(void) return rdtsc(); } +/* Deprecated, keep it for a cycle for easier merging: */ +#define rdtscll(now) do { (now) = rdtsc_ordered(); } while (0) + static inline unsigned long long native_read_pmc(int counter) { DECLARE_ARGS(val, low, high); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm:Return -ENOMEM directly for the function, kvm_create_lapic
* Nicholas Krause xerofo...@gmail.com wrote: In order to make code paths easier to read in the function, kvm_create_lapic we return -ENOMEM when unable to allocate memory for a kvm_lapic structure pointer directly. This makes the code easier to read and cleaner then jumping to a goto label at the end of the function's body for returning just the error code, -ENOMEM. Signed-off-by: Nicholas Krause xerofo...@gmail.com --- arch/x86/kvm/lapic.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 629af0f..88d0cce 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1687,7 +1687,7 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu) apic = kzalloc(sizeof(*apic), GFP_KERNEL); if (!apic) - goto nomem; + return -ENOMEM; vcpu-arch.apic = apic; @@ -1718,7 +1718,6 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu) return 0; nomem_free_apic: kfree(apic); -nomem: return -ENOMEM; } NAK! You just half destroyed the nice error handling cascade of labels. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/2] perf/kvm: Port perf kvm to powerpc
* Hemant Kumar hem...@linux.vnet.ibm.com wrote: # perf kvm stat report -p 60515 Analyze events for pid(s) 60515, all VCPUs: VM-EXITSamples Samples% Time%Min Time Max Time Avg time H_DATA_STORAGE 500635.30% 0.13% 1.94us 49.46us 12.37us ( +- 0.52% ) HV_DECREMENTER 445731.43% 0.02% 0.72us 16.14us 1.91us ( +- 0.96% ) SYSCALL 269018.97% 0.10% 2.84us528.24us 18.29us ( +- 3.75% ) RETURN_TO_HOST 178912.61%99.76% 1.58us 672791.91us 27470.23us ( +- 3.00% ) EXTERNAL240 1.69% 0.00%0.69us 10.67us 1.33us ( +- 5.34% ) Where is the last line misaligned? Copy paste error or does perf kvm produce it in such a way? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for stable] x86/spinlocks/paravirt: Fix memory corruption on unlock
* Christian Borntraeger borntrae...@de.ibm.com wrote: By all means! You'll first need to cherry-pick these commits: 927609d622a3 kernel: tighten rules for ACCESS ONCE c5b19946eb76 kernel: Fix sparse warning for ACCESS_ONCE dd36929720f4 kernel: make READ_ONCE() valid on const arguments If you go before 3.19, you will also need 230fa253df63 kernel: Provide READ_ONCE and ASSIGN_ONCE 43239cbe79fc kernel: Change ASSIGN_ONCE(val, x) to WRITE_ONCE(x, val) The affected spinlock code went over several iterations post v3.18, which I think makes the spinlock change too risky and complex to backport so far back. So it's not necessay to backport these READ_ONCE() changes. That's the minimum set you will need for backporting, due to overlapping changes to the ACCESS_ONCE() definition. and then apply this commit: d6abfdb20223 x86/spinlocks/paravirt: Fix memory corruption on unlock the alternative might be to replace READ_ONCE with ACCESS_ONCE when doing the backport. Doing changes to patches when doing a backport is a big no-no IMHO. Either there is a clean sequence of upstream commit IDs to cherry-pick, or it should not be backported in most cases. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for stable] x86/spinlocks/paravirt: Fix memory corruption on unlock
* Greg KH gre...@linuxfoundation.org wrote: It's: d6abfdb20223 x86/spinlocks/paravirt: Fix memory corruption on unlock Yes, This is the original patch. Please note I have taken out the READ_ONCE changes from the original patch to avoid build warnings mentioned below. (Those READ_ONCE changes were cosmetic and was not present in the previous versions) You'll also need this fix from Linus to avoid (harmless) build warnings: dd36929720f4 kernel: make READ_ONCE() valid on const arguments So this may not be absolutely necessary with the current patch. I'd prefer to be as close as possible to the upstream patch. So if applying both of these patches will work, I'd much rather do that. Changing patches when backporting them to stable for no good reason than to clean things up, just confuses everyone involved. Let's keep our messy history :) By all means! You'll first need to cherry-pick these commits: 927609d622a3 kernel: tighten rules for ACCESS ONCE c5b19946eb76 kernel: Fix sparse warning for ACCESS_ONCE dd36929720f4 kernel: make READ_ONCE() valid on const arguments That's the minimum set you will need for backporting, due to overlapping changes to the ACCESS_ONCE() definition. and then apply this commit: d6abfdb20223 x86/spinlocks/paravirt: Fix memory corruption on unlock I've double checked that these commits will cherry-pick fine on top of v3.19, in that order, and that an x86-64 defconfig+kvmconfig+PARAVIRT_SPINLOCK=y kernel builds fine without warnings. I've not boot tested the changes, so if anything breaks it's all your fault - while if it works just fine then I'll be glad to take credit for that. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for stable] x86/spinlocks/paravirt: Fix memory corruption on unlock
* Greg KH gre...@linuxfoundation.org wrote: On Tue, Feb 24, 2015 at 02:54:59PM +0530, Raghavendra K T wrote: Paravirt spinlock clears slowpath flag after doing unlock. As explained by Linus currently it does: prev = *lock; add_smp(lock-tickets.head, TICKET_LOCK_INC); /* add_smp() is a full mb() */ if (unlikely(lock-tickets.tail TICKET_SLOWPATH_FLAG)) __ticket_unlock_slowpath(lock, prev); which is *exactly* the kind of things you cannot do with spinlocks, because after you've done the add_smp() and released the spinlock for the fast-path, you can't access the spinlock any more. Exactly because a fast-path lock might come in, and release the whole data structure. Linus suggested that we should not do any writes to lock after unlock(), and we can move slowpath clearing to fastpath lock. So this patch implements the fix with: 1. Moving slowpath flag to head (Oleg): Unlocked locks don't care about the slowpath flag; therefore we can keep it set after the last unlock, and clear it again on the first (try)lock. -- this removes the write after unlock. note that keeping slowpath flag would result in unnecessary kicks. By moving the slowpath flag from the tail to the head ticket we also avoid the need to access both the head and tail tickets on unlock. 2. use xadd to avoid read/write after unlock that checks the need for unlock_kick (Linus): We further avoid the need for a read-after-release by using xadd; the prev head value will include the slowpath flag and indicate if we need to do PV kicking of suspended spinners -- on modern chips xadd isn't (much) more expensive than an add + load. Result: setup: 16core (32 cpu +ht sandy bridge 8GB 16vcpu guest) benchmark overcommit %improve kernbench 1x -0.13 kernbench 2x0.02 dbench 1x -1.77 dbench 2x -0.63 [Jeremy: hinted missing TICKET_LOCK_INC for kick] [Oleg: Moving slowpath flag to head, ticket_equals idea] [PeterZ: Detailed changelog] Reported-by: Sasha Levin sasha.le...@oracle.com Suggested-by: Linus Torvalds torva...@linux-foundation.org Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com Reviewed-by: Oleg Nesterov o...@redhat.com Acked-by: David Vrabel david.vra...@citrix.com --- arch/x86/include/asm/spinlock.h | 94 - arch/x86/kernel/kvm.c | 7 ++- arch/x86/xen/spinlock.c | 7 ++- 3 files changed, 58 insertions(+), 50 deletions(-) Changes for stable: - Don't replace the ACCESS_ONCE to READ_ONCE which would cause horraneous Compiler warnings (Linus, David Vbriel, PeterZ, Ingo) What is the git commit id of this in Linus's tree? What stable tree(s) do you want this applied to? It's: d6abfdb20223 x86/spinlocks/paravirt: Fix memory corruption on unlock You'll also need this fix from Linus to avoid (harmless) build warnings: dd36929720f4 kernel: make READ_ONCE() valid on const arguments Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: typo of the comment
* john_gong john_g...@yeah.net wrote: hi Paolo, i find a typo of the comment. From 09d5df31f0930e8e3eb10ad60a3debc53d6ce992 Mon Sep 17 00:00:00 2001 From: john_gong john_g...@yeah.net Date: Fri, 7 Nov 2014 07:32:17 +0800 Subject: [PATCH] modify a typo of the comment Signed-off-by: john_gong john_g...@yeah.net --- arch/x86/kvm/x86.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ef432f8..42bb2c8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -214,7 +214,7 @@ void kvm_define_shared_msr(unsigned slot, u32 msr) if (slot = shared_msrs_global.nr) shared_msrs_global.nr = slot + 1; shared_msrs_global.msrs[slot] = msr; -/* we need ensured the shared_msr_global have been updated */ +/* we need ensured the shared_msrs_global have been updated */ smp_wmb(); That's not the only typo in that sentence though. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] x86: structs for cpuid info in x86
* Nadav Amit nadav.a...@gmail.com wrote: On 9/16/14 4:22 PM, Ingo Molnar wrote: * Nadav Amit na...@cs.technion.ac.il wrote: The code that deals with x86 cpuid fields is hard to follow since it performs many bit operations and does not refer to cpuid field explicitly. To eliminate the need of openning a spec whenever dealing with cpuid fields, this patch-set introduces structs that reflect the various cpuid functions. Thanks for reviewing the patch-set. Nadav Amit (3): x86: Adding structs to reflect cpuid fields x86: Use new cpuid structs in cpuid functions KVM: x86: Using cpuid structs in KVM arch/x86/include/asm/cpuid_def.h | 163 +++ arch/x86/kernel/cpu/common.c | 56 -- arch/x86/kvm/cpuid.c | 36 + 3 files changed, 219 insertions(+), 36 deletions(-) create mode 100644 arch/x86/include/asm/cpuid_def.h I personally like bitfields in theory (they provide type clarity and abstract robustness, compared to open-coded bitmask numeric literals that are often used in cpuid using code, obfuscating cpuid usage), with the big caveat that for many years I didn't like bitfields in practice: older versions of GCC did a really poor job of optimizing them. So such a series would only be acceptable if it's demonstrated that both 'latest' and 'reasonably old' GCC versions do a good job in that department, compared to the old open-coded bitmask ops ... Comparing the 'size vmlinux' output of before/after kernels would probably be a good start in seeing the impact of such a change. If those results are positive then this technique could be propagated to all cpuid using code in arch/x86/, of which there's plenty. Thanks for the quick response. I was not aware GCC behaves this way. I made some small experiments with GCC-4.8 and GCC-4.4 and in brief my conclusions are: 1. The assembled code of bitmask and bitfields is indeed different. 2. GCC-4.8 and GCC-4.4 behave pretty much the same, yet GCC-4.8 appears to make better instructions reordering. 3. Loading/storing a single bitfield seems to be pretty much optimized (marginal advantage from code size point-of-view for bitmask, same number of instructions). 4. Loading/storing multiple bitfields seems to be somewhat under-optimized - multiple accesses to the original value result in ~30% more instructions and code-size. That's better than what I remembered. So you are correct - bitfields are less optimized. Nonetheless, since cpuid data is mostly used during startup, and otherwise a single bitfield is usually accessed in each function - I wonder whether it worth keeping the optimized but obfuscate code. Obviously, I can guess your answer to this question... So with the condition that you are actively watching out for performance critical code paths, I think the type clarity (i.e. bitfields) is a win. If hpa, tglx or Linus objects I'll yield to that objection though. Opinions, objections? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] x86: structs for cpuid info in x86
* Nadav Amit na...@cs.technion.ac.il wrote: The code that deals with x86 cpuid fields is hard to follow since it performs many bit operations and does not refer to cpuid field explicitly. To eliminate the need of openning a spec whenever dealing with cpuid fields, this patch-set introduces structs that reflect the various cpuid functions. Thanks for reviewing the patch-set. Nadav Amit (3): x86: Adding structs to reflect cpuid fields x86: Use new cpuid structs in cpuid functions KVM: x86: Using cpuid structs in KVM arch/x86/include/asm/cpuid_def.h | 163 +++ arch/x86/kernel/cpu/common.c | 56 -- arch/x86/kvm/cpuid.c | 36 + 3 files changed, 219 insertions(+), 36 deletions(-) create mode 100644 arch/x86/include/asm/cpuid_def.h I personally like bitfields in theory (they provide type clarity and abstract robustness, compared to open-coded bitmask numeric literals that are often used in cpuid using code, obfuscating cpuid usage), with the big caveat that for many years I didn't like bitfields in practice: older versions of GCC did a really poor job of optimizing them. So such a series would only be acceptable if it's demonstrated that both 'latest' and 'reasonably old' GCC versions do a good job in that department, compared to the old open-coded bitmask ops ... Comparing the 'size vmlinux' output of before/after kernels would probably be a good start in seeing the impact of such a change. If those results are positive then this technique could be propagated to all cpuid using code in arch/x86/, of which there's plenty. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] softlockup: make detector be aware of task switch of processes hogging cpu
* Don Zickus dzic...@redhat.com wrote: From: chai wen chaiw.f...@cn.fujitsu.com For now, soft lockup detector warns once for each case of process softlockup. But the thread 'watchdog/n' may not always get the cpu at the time slot between the task switch of two processes hogging that cpu to reset soft_watchdog_warn. An example would be two processes hogging the cpu. Process A causes the softlockup warning and is killed manually by a user. Process B immediately becomes the new process hogging the cpu preventing the softlockup code from resetting the soft_watchdog_warn variable. This case is a false negative of warn only once for a process, as there may be a different process that is going to hog the cpu. Resolve this by saving/checking the pid of the hogging process and use that to reset soft_watchdog_warn too. Signed-off-by: chai wen chaiw.f...@cn.fujitsu.com [modified the comment and changelog to be more specific] Signed-off-by: Don Zickus dzic...@redhat.com --- kernel/watchdog.c | 20 ++-- 1 files changed, 18 insertions(+), 2 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 4c2e11c..6d0a891 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -42,6 +42,7 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync); static DEFINE_PER_CPU(bool, soft_watchdog_warn); static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts); static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt); +static DEFINE_PER_CPU(pid_t, softlockup_warn_pid_saved); #ifdef CONFIG_HARDLOCKUP_DETECTOR static DEFINE_PER_CPU(bool, hard_watchdog_warn); static DEFINE_PER_CPU(bool, watchdog_nmi_touch); @@ -317,6 +318,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) */ duration = is_softlockup(touch_ts); if (unlikely(duration)) { + pid_t pid = task_pid_nr(current); + /* * If a virtual machine is stopped by the host it can look to * the watchdog like a soft lockup, check to see if the host @@ -326,8 +329,20 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) return HRTIMER_RESTART; /* only warn once */ - if (__this_cpu_read(soft_watchdog_warn) == true) + if (__this_cpu_read(soft_watchdog_warn) == true) { + + /* + * Handle the case where multiple processes are + * causing softlockups but the duration is small + * enough, the softlockup detector can not reset + * itself in time. Use pids to detect this. + */ + if (__this_cpu_read(softlockup_warn_pid_saved) != pid) { So I agree with the motivation of this improvement, but is this implementation namespace-safe? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] watchdog: fix print-once on enable
* Don Zickus dzic...@redhat.com wrote: --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -522,6 +522,9 @@ static void watchdog_nmi_disable(unsigned int cpu) /* should be in cleanup, but blocks oprofile */ perf_event_release_kernel(event); } + if (cpu == 0) + /* watchdog_nmi_enable() expects this to be zero initially. */ + cpu0_err = 0; return; } While at it I also removed the stray 'return;'. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] watchdog: control hard lockup detection default
* Don Zickus dzic...@redhat.com wrote: The running kernel still has the ability to enable/disable at any time with /proc/sys/kernel/nmi_watchdog us usual. However even when the default has been overridden /proc/sys/kernel/nmi_watchdog will initially show '1'. To truly turn it on one must disable/enable it, i.e. echo 0 /proc/sys/kernel/nmi_watchdog echo 1 /proc/sys/kernel/nmi_watchdog This looks like a bug, why is this so? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] watchdog: fix print-once on enable
* Don Zickus dzic...@redhat.com wrote: From: Ulrich Obergfell uober...@redhat.com This patch avoids printing the message 'enabled on all CPUs, ...' multiple times. For example, the issue can occur in the following scenario: 1) watchdog_nmi_enable() fails to enable PMU counters and sets cpu0_err. 2) 'echo [0|1] /proc/sys/kernel/nmi_watchdog' is executed to disable and re-enable the watchdog mechanism 'on the fly'. 3) If watchdog_nmi_enable() succeeds to enable PMU counters, each CPU will print the message because step1 left behind a non-zero cpu0_err. if (!IS_ERR(event)) { if (cpu == 0 || cpu0_err) pr_info(enabled on all CPUs, ...) The patch avoids this by clearing cpu0_err in watchdog_nmi_disable(). Signed-off-by: Ulrich Obergfell uober...@redhat.com Signed-off-by: Andrew Jones drjo...@redhat.com Signed-off-by: Don Zickus dzic...@redhat.com --- kernel/watchdog.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 6d0a891..0838685 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -522,6 +522,9 @@ static void watchdog_nmi_disable(unsigned int cpu) /* should be in cleanup, but blocks oprofile */ perf_event_release_kernel(event); } + if (cpu == 0) + /* watchdog_nmi_enable() expects this to be zero initially. */ + cpu0_err = 0; Looks good except a small detail: two-line blocks need curly braces too, even if it's just a single C statement. I've fixed this up in the commit. thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] watchdog: control hard lockup detection default
* Don Zickus dzic...@redhat.com wrote: From: Ulrich Obergfell uober...@redhat.com In some cases we don't want hard lockup detection enabled by default. An example is when running as a guest. Introduce watchdog_enable_hardlockup_detector(bool) So, the name watchdog_enable_hardlockup_detector_enable(false) is both too long and also really confusing (because first it suggests enablement, then disables it), so I renamed it to hardlockup_detector_set(), which allows two natural variants: hardlockup_detector_set(false); ... hardlockup_detector_set(true); Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] softlockup: make detector be aware of task switch of processes hogging cpu
* Don Zickus dzic...@redhat.com wrote: On Mon, Aug 18, 2014 at 11:03:19AM +0200, Ingo Molnar wrote: * Don Zickus dzic...@redhat.com wrote: From: chai wen chaiw.f...@cn.fujitsu.com For now, soft lockup detector warns once for each case of process softlockup. But the thread 'watchdog/n' may not always get the cpu at the time slot between the task switch of two processes hogging that cpu to reset soft_watchdog_warn. An example would be two processes hogging the cpu. Process A causes the softlockup warning and is killed manually by a user. Process B immediately becomes the new process hogging the cpu preventing the softlockup code from resetting the soft_watchdog_warn variable. This case is a false negative of warn only once for a process, as there may be a different process that is going to hog the cpu. Resolve this by saving/checking the pid of the hogging process and use that to reset soft_watchdog_warn too. Signed-off-by: chai wen chaiw.f...@cn.fujitsu.com [modified the comment and changelog to be more specific] Signed-off-by: Don Zickus dzic...@redhat.com --- kernel/watchdog.c | 20 ++-- 1 files changed, 18 insertions(+), 2 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 4c2e11c..6d0a891 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -42,6 +42,7 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync); static DEFINE_PER_CPU(bool, soft_watchdog_warn); static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts); static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt); +static DEFINE_PER_CPU(pid_t, softlockup_warn_pid_saved); #ifdef CONFIG_HARDLOCKUP_DETECTOR static DEFINE_PER_CPU(bool, hard_watchdog_warn); static DEFINE_PER_CPU(bool, watchdog_nmi_touch); @@ -317,6 +318,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) */ duration = is_softlockup(touch_ts); if (unlikely(duration)) { + pid_t pid = task_pid_nr(current); + /* * If a virtual machine is stopped by the host it can look to * the watchdog like a soft lockup, check to see if the host @@ -326,8 +329,20 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) return HRTIMER_RESTART; /* only warn once */ - if (__this_cpu_read(soft_watchdog_warn) == true) + if (__this_cpu_read(soft_watchdog_warn) == true) { + + /* + * Handle the case where multiple processes are + * causing softlockups but the duration is small + * enough, the softlockup detector can not reset + * itself in time. Use pids to detect this. + */ + if (__this_cpu_read(softlockup_warn_pid_saved) != pid) { So I agree with the motivation of this improvement, but is this implementation namespace-safe? What namespace are you worried about colliding with? I thought softlockup_ would provide the safety?? Maybe I am missing something obvious. :-( I meant PID namespaces - a PID in itself isn't guaranteed to be unique across the system. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] watchdog: control hard lockup detection default
* Don Zickus dzic...@redhat.com wrote: On Mon, Aug 18, 2014 at 11:16:44AM +0200, Ingo Molnar wrote: * Don Zickus dzic...@redhat.com wrote: The running kernel still has the ability to enable/disable at any time with /proc/sys/kernel/nmi_watchdog us usual. However even when the default has been overridden /proc/sys/kernel/nmi_watchdog will initially show '1'. To truly turn it on one must disable/enable it, i.e. echo 0 /proc/sys/kernel/nmi_watchdog echo 1 /proc/sys/kernel/nmi_watchdog This looks like a bug, why is this so? It is, but it always has been there in the case of the PMU not being able to provide a resource for the hardlockup. This change just exposes it more. There seems to be two issues: 1) When it's impossible to enable the hardlockup detector, it should default to -1 or so, and attempts to set it should return a -EINVAL or so. Bootup messages should also indicate when it's not possible to enable it but a user requests it. 2) The softlockup and hardlockup detection control variables should be in separate flags, inside and outside the kernel - they (should) not relate to each other. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] watchdog: control hard lockup detection default
* Don Zickus dzic...@redhat.com wrote: 2) The softlockup and hardlockup detection control variables should be in separate flags, inside and outside the kernel - they (should) not relate to each other. They did because years ago I thought we wanted to keep them as one entity instead of two. I would have to re-work the code to do this (and add more knobs). I presume you would want those changes done before taking this patchset? Yeah, fixing/cleaning up things would be nice before spreading the pain via new features/control mechanisms. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] softlockup: make detector be aware of task switch of processes hogging cpu
* Don Zickus dzic...@redhat.com wrote: So I agree with the motivation of this improvement, but is this implementation namespace-safe? What namespace are you worried about colliding with? I thought softlockup_ would provide the safety?? Maybe I am missing something obvious. :-( I meant PID namespaces - a PID in itself isn't guaranteed to be unique across the system. Ah, I don't think we thought about that. Is there a better way to do this? Is there a domain id or something that can be OR'd with the pid? What is always unique is the task pointer itself. We use pids when we interface with user-space - but we don't really do that here, right? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v9 05/19] qspinlock: Optimize for smaller NR_CPUS
* Waiman Long waiman.l...@hp.com wrote: On 04/18/2014 03:46 AM, Ingo Molnar wrote: * Waiman Longwaiman.l...@hp.com wrote: On 04/17/2014 11:58 AM, Peter Zijlstra wrote: On Thu, Apr 17, 2014 at 11:03:57AM -0400, Waiman Long wrote: +static __always_inline void +clear_pending_set_locked(struct qspinlock *lock, u32 val) +{ + struct __qspinlock *l = (void *)lock; + + ACCESS_ONCE(l-locked_pending) = 1; +} @@ -157,8 +251,13 @@ static inline int trylock_pending(struct qspinlock *lock, u32 *pval) * we're pending, wait for the owner to go away. * * *,1,1 - *,1,0 + * + * this wait loop must be a load-acquire such that we match the + * store-release that clears the locked bit and create lock + * sequentiality; this because not all try_clear_pending_set_locked() + * implementations imply full barriers. You renamed the function referred in the above comment. Sorry, will fix the comments. I suggest not renaming the function instead. try_clear_pending_set_locked() tells the intent in a clearer fashion. Thanks, Ingo I usually use the word try if there is a possibility of failure. However, the function will always succeed, albeit by waiting a bit in some cases. That is why I remove try from the name. Fair enough! Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v9 03/19] qspinlock: Add pending bit
* Waiman Long waiman.l...@hp.com wrote: Because the qspinlock needs to touch a second cacheline; add a pending bit and allow a single in-word spinner before we punt to the second cacheline. Signed-off-by: Peter Zijlstra pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com This patch should have a From: Peter in it as well, right? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v9 05/19] qspinlock: Optimize for smaller NR_CPUS
* Waiman Long waiman.l...@hp.com wrote: On 04/17/2014 11:58 AM, Peter Zijlstra wrote: On Thu, Apr 17, 2014 at 11:03:57AM -0400, Waiman Long wrote: +static __always_inline void +clear_pending_set_locked(struct qspinlock *lock, u32 val) +{ + struct __qspinlock *l = (void *)lock; + + ACCESS_ONCE(l-locked_pending) = 1; +} @@ -157,8 +251,13 @@ static inline int trylock_pending(struct qspinlock *lock, u32 *pval) * we're pending, wait for the owner to go away. * * *,1,1 - *,1,0 +* +* this wait loop must be a load-acquire such that we match the +* store-release that clears the locked bit and create lock +* sequentiality; this because not all try_clear_pending_set_locked() +* implementations imply full barriers. You renamed the function referred in the above comment. Sorry, will fix the comments. I suggest not renaming the function instead. try_clear_pending_set_locked() tells the intent in a clearer fashion. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 01/10] qspinlock: A generic 4-byte queue spinlock implementation
* Waiman Long waiman.l...@hp.com wrote: On 04/04/2014 09:00 AM, Peter Zijlstra wrote: So I'm just not ever going to pick up this patch; I spend a week trying to reverse engineer this; I posted a 7 patch series creating the equivalent, but in a gradual and readable fashion: http://lkml.kernel.org/r/20140310154236.038181...@infradead.org You keep on ignoring that; I'll keep on ignoring your patches. I might at some point rewrite some of your pv stuff on top to get this moving again, but I'm not really motivated to work with you atm. Peter, I am sorry that I have focused recently on making the qspinlock patch works with virtualization and it is easier for me to based off on my patch initially. Now the PV part is almost done, I will apply them on top of your patch. Hopefully, I will get a new patch out sometime next week. Note that it's not a patch that PeterZ posted, but a series of 7 finegrained patches, each properly documented and commented. Please preserve that work, build on top of it, and don't just ignore it! Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 01/10] qspinlock: A generic 4-byte queue spinlock implementation
* Waiman Long waiman.l...@hp.com wrote: On 04/04/2014 12:57 PM, Konrad Rzeszutek Wilk wrote: On Fri, Apr 04, 2014 at 03:00:12PM +0200, Peter Zijlstra wrote: So I'm just not ever going to pick up this patch; I spend a week trying to reverse engineer this; I posted a 7 patch series creating the equivalent, but in a gradual and readable fashion: http://lkml.kernel.org/r/20140310154236.038181...@infradead.org You keep on ignoring that; I'll keep on ignoring your patches. I might at some point rewrite some of your pv stuff on top to get this moving again, but I'm not really motivated to work with you atm. Uh? Did you CC also xen-de...@lists.xenproject.org on your patches Peter? I hadn't had a chance to see or comment on them :-( Peter's patch is a rewrite of my patches 1-4, there is no PV or unfair lock support in there. It is a fine grained split-up, which does one thing at a time, so it all becomes reviewable and mergable (and the claimed effects become testable!). Please use that as a base. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] tree-wide: clean up no longer required #include linux/init.h
* Paul Gortmaker paul.gortma...@windriver.com wrote: On Feb 4, 2014 3:52 PM, Paul Gortmaker paul.gortma...@windriver.com wrote: We've had this in linux-next for 2+ weeks (thanks Stephen!) as a linux-stable like queue of patches, and as can be seen here: https://git.kernel.org/pub/scm/linux/kernel/git/paulg/init.git Argh, above link is meant for cloning, not viewing. This should be better... https://git.kernel.org/cgit/linux/kernel/git/paulg/init.git/ So, if you meant Linus to pull it, you probably want to cite a real Git URI along the lines of: git://git.kernel.org/pub/scm/linux/kernel/git/paulg/init.git ( Otherwise your pull request might be ignored or worse, you might get an honest reply, due to the https transport being considered evil that no free man outside of corporate firewalls should ever consider and all that. ) Nice cleanups btw. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] tree-wide: clean up no longer required #include linux/init.h
* Stephen Rothwell s...@canb.auug.org.au wrote: Hi Ingo, On Wed, 5 Feb 2014 07:06:33 +0100 Ingo Molnar mi...@kernel.org wrote: So, if you meant Linus to pull it, you probably want to cite a real Git URI along the lines of: git://git.kernel.org/pub/scm/linux/kernel/git/paulg/init.git Paul provided the proper git url further down in the mail along with the usual pull request message (I guess he should have put that bit at the top). Yeah, indeed, and it even comes with a signed tag, which is an extra nice touch: git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux.git tags/init-cleanup (I guess the https was mentioned first to lower expectations.) Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND V13 14/14] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor
* H. Peter Anvin h...@zytor.com wrote: Raghavendra... Even with this latest patch this branch is broken: :(.discard+0x6108): multiple definition of `__pcpu_unique_lock_waiting' arch/x86/xen/built-in.o:(.discard+0x23): first defined here CC drivers/firmware/google/gsmi.o arch/x86/kernel/built-in.o:(.discard+0x6108): multiple definition of `__pcpu_unique_lock_waiting' arch/x86/xen/built-in.o:(.discard+0x23): first defined here CC sound/core/seq/oss/seq_oss_init.o This is trivially reproducible by doing a build with make allyesconfig. Please fix and *verify* it is fixed before resubmitting. I will be away so Ingo will have to handle the resubmission. Would be nice to have a delta fix patch against tip:x86/spinlocks, which I'll then backmerge into that series via rebasing it. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor
* Gleb Natapov g...@redhat.com wrote: On Fri, Aug 02, 2013 at 11:25:39AM +0200, Ingo Molnar wrote: Ingo, Do you have any concerns reg this series? please let me know if this looks good now to you. I'm inclined to NAK it for excessive quotation - who knows how many people left the discussion in disgust? Was it done to drive away as many reviewers as possible? Anyway, see my other reply, the measurement results seem hard to interpret and inconclusive at the moment. That result was only for patch 18 of the series, not pvspinlock in general. Okay - I've re-read the performance numbers and they are impressive, so no objections from me. The x86 impact seems to be a straightforward API change, with most of the changes on the virtualization side. So: Acked-by: Ingo Molnar mi...@kernel.org I guess you'd want to carry this in the KVM tree or so - maybe in a separate branch because it changes Xen as well? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor
* Gleb Natapov g...@redhat.com wrote: On Mon, Aug 05, 2013 at 11:46:03AM +0200, Ingo Molnar wrote: Acked-by: Ingo Molnar mi...@kernel.org I guess you'd want to carry this in the KVM tree or so - maybe in a separate branch because it changes Xen as well? It changes KVM host and guest side, XEN and common x86 spinlock code. I think it would be best to merge common x86 spinlock bits and guest side KVM/XEN bits through tip tree and host KVM part will go through KVM tree. If this is OK with you, Ingo, and XEN folks Raghavendra can send two separate patch series one for the tip and one for KVM host side. Sure, that's fine - if the initial series works fine in isolation as well (i.e. won't break anything). Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor
* Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: On 08/01/2013 02:34 PM, Raghavendra K T wrote: On 08/01/2013 01:15 PM, Gleb Natapov wrote: Shall I consider this as an ack for kvm part? For everything except 18/18. For that I still want to see numbers. But 18/18 is pretty independent from the reset of the series so it should not stop the reset from going in. Yes. agreed. I am going to evaluate patch 18 separately and come with results for that. Now we can consider only 1-17 patches. Gleb, 32 core machine with HT off 32 vcpu guests. base = 3.11-rc + patch 1 -17 pvspinlock_v11 patched = base + patch 18 +---+---+---++---+ dbench (Throughput in MB/sec higher is better) +---+---+---++---+ base stdev patchedstdev %improvement +---+---+---++---+ 1x 14584.3800 146.9074 14705.1000 163.1060 0.82773 2x 1713.730032.87501717.320045.5979 0.20948 3x 967.821242.0257 971.885518.8532 0.41994 4x 685.276425.7150 694.5881 8.3907 1.35882 +---+---+---++---+ Please list stddev in percentage as well ... a blind stab gave me these figures: base stdev patchedstdev %improvement 3x 967.82124.3% 971.8855 1.8% 0.4 That makes the improvement an order of magnitude smaller than the noise of the measurement ... i.e. totally inconclusive. Also please cut the excessive decimal points: with 2-4% noise what point is there in 5 decimal point results?? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor
* Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: On 07/31/2013 11:54 AM, Gleb Natapov wrote: On Tue, Jul 30, 2013 at 10:13:12PM +0530, Raghavendra K T wrote: On 07/25/2013 03:08 PM, Raghavendra K T wrote: On 07/25/2013 02:45 PM, Gleb Natapov wrote: On Thu, Jul 25, 2013 at 02:47:37PM +0530, Raghavendra K T wrote: On 07/24/2013 06:06 PM, Raghavendra K T wrote: On 07/24/2013 05:36 PM, Gleb Natapov wrote: On Wed, Jul 24, 2013 at 05:30:20PM +0530, Raghavendra K T wrote: On 07/24/2013 04:09 PM, Gleb Natapov wrote: On Wed, Jul 24, 2013 at 03:15:50PM +0530, Raghavendra K T wrote: On 07/23/2013 08:37 PM, Gleb Natapov wrote: On Mon, Jul 22, 2013 at 11:50:16AM +0530, Raghavendra K T wrote: +static void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want) [...] [ a few hundred lines of unnecessary quotation snipped. ] Ingo, Do you have any concerns reg this series? please let me know if this looks good now to you. I'm inclined to NAK it for excessive quotation - who knows how many people left the discussion in disgust? Was it done to drive away as many reviewers as possible? Anyway, see my other reply, the measurement results seem hard to interpret and inconclusive at the moment. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] x86: Add a Kconfig shortcut for kvm guest kernel
* Borislav Petkov b...@alien8.de wrote: From: Borislav Petkov b...@suse.de Date: Tue, 16 Apr 2013 18:24:34 +0200 Subject: [PATCH -v2] x86: Add a Kconfig shortcut for kvm guest kernel This is pretty useful for the case where people want to boot the resulting kernel in qemu/kvm. Instead of going and searching for each required option through the Kconfig maze, this single option should simply enable everything required/good to have to boot the resulting kernel in the guest. Please mention: ' This patch is based on a similar utility patch of the external lkvm tree. ' Cc: Fengguang Wu fengguang...@intel.com Originally-by: Pekka Enberg penb...@kernel.org Originally-by: Sasha Levin levinsasha...@gmail.com Signed-off-by: Borislav Petkov b...@suse.de --- Here's v2 which should be addressing all review comments so far. arch/x86/Kconfig | 38 ++ 1 file changed, 38 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 5651374d179f..76a95ffa959a 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -680,6 +680,44 @@ config KVM_GUEST underlying device model, the host provides the guest with timing infrastructure such as time of day, and system time +config KVM_GUEST_COMMODITY_OPTIONS Call this KVM_GUEST_COMMON_OPTIONS? + bool Enable commodity options for a standalone KVM guest + depends on KVM_GUEST + select NET + select NETDEVICES + select BLOCK + select BLK_DEV + select NETWORK_FILESYSTEMS + select INET + select EXPERIMENTAL + select TTY + select SERIAL_8250 + select SERIAL_8250_CONSOLE + select IP_PNP + select IP_PNP_DHCP + select BINFMT_ELF + select PCI_MSI + select HAVE_ARCH_KGDB + select DEBUG_KERNEL + select KGDB + select KGDB_SERIAL_CONSOLE + select VIRTUALIZATION + select VIRTIO + select VIRTIO_RING + select VIRTIO_PCI + select VIRTIO_BLK + select VIRTIO_CONSOLE + select VIRTIO_NET + select 9P_FS + select NET_9P + select NET_9P_VIRTIO + ---help--- + Select guest kernel functionality which facilitates booting the + kernel as a guest in qemu/kvm. This entails basic stuff like s/qemu/qemu or lkvm + serial support, kgdb, virtio and other so that you can be able to + have commodity functionality like serial output from the guest, + networking, etc. And seemless host file system integration into guest context. (that is what the 9P options are about) Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 2/7] KVM: VMX: Register a new IPI for posted interrupt
* Gleb Natapov g...@redhat.com wrote: On Mon, Apr 08, 2013 at 10:23:17PM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com Posted Interrupt feature requires a special IPI to deliver posted interrupt to guest. And it should has a high priority so the interrupt will not be blocked by others. Normally, the posted interrupt will be consumed by vcpu if target vcpu is running and transparent to OS. But in some cases, the interrupt will arrive when target vcpu is scheduled out. And host will see it. So we need to register a dump handler to handle it. Ingo can I add your ACK to this one? In the past you agreed to the approach. Yeah, it's fine to me - assuming it's tested and does not break anything. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 2/5] KVM: VMX: Register a new IPI for posted interrupt
* Yang Zhang yang.z.zh...@intel.com wrote: diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c index 6e03b0d..2329a54 100644 --- a/arch/x86/kernel/irqinit.c +++ b/arch/x86/kernel/irqinit.c @@ -205,6 +205,10 @@ static void __init apic_intr_init(void) /* IPI for X86 platform specific use */ alloc_intr_gate(X86_PLATFORM_IPI_VECTOR, x86_platform_ipi); +#ifdef CONFIG_HAVE_KVM + /* IPI for KVM to deliver posted interrupt */ + alloc_intr_gate(POSTED_INTR_VECTOR, kvm_posted_intr_ipi); +#endif Please avoid wasting an IDT entry by reusing x86_platform_ipi. A KVM guest is in essence one type of 'x86 platform', and this callback is used by hardware platforms, so collision is not an issue AFAICS. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 2/5] KVM: VMX: Register a new IPI for posted interrupt
* Gleb Natapov g...@redhat.com wrote: On Fri, Mar 08, 2013 at 02:26:25PM +0100, Ingo Molnar wrote: * Yang Zhang yang.z.zh...@intel.com wrote: diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c index 6e03b0d..2329a54 100644 --- a/arch/x86/kernel/irqinit.c +++ b/arch/x86/kernel/irqinit.c @@ -205,6 +205,10 @@ static void __init apic_intr_init(void) /* IPI for X86 platform specific use */ alloc_intr_gate(X86_PLATFORM_IPI_VECTOR, x86_platform_ipi); +#ifdef CONFIG_HAVE_KVM + /* IPI for KVM to deliver posted interrupt */ + alloc_intr_gate(POSTED_INTR_VECTOR, kvm_posted_intr_ipi); +#endif Please avoid wasting an IDT entry by reusing x86_platform_ipi. A KVM guest is in essence one type of 'x86 platform', and this callback is used by hardware platforms, so collision is not an issue AFAICS. This is IPI send by a host though. But received on the guest side, right? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 2/5] KVM: VMX: Register a new IPI for posted interrupt
* Gleb Natapov g...@redhat.com wrote: On Fri, Mar 08, 2013 at 03:05:45PM +0100, Ingo Molnar wrote: * Gleb Natapov g...@redhat.com wrote: On Fri, Mar 08, 2013 at 02:26:25PM +0100, Ingo Molnar wrote: * Yang Zhang yang.z.zh...@intel.com wrote: diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c index 6e03b0d..2329a54 100644 --- a/arch/x86/kernel/irqinit.c +++ b/arch/x86/kernel/irqinit.c @@ -205,6 +205,10 @@ static void __init apic_intr_init(void) /* IPI for X86 platform specific use */ alloc_intr_gate(X86_PLATFORM_IPI_VECTOR, x86_platform_ipi); +#ifdef CONFIG_HAVE_KVM + /* IPI for KVM to deliver posted interrupt */ + alloc_intr_gate(POSTED_INTR_VECTOR, kvm_posted_intr_ipi); +#endif Please avoid wasting an IDT entry by reusing x86_platform_ipi. A KVM guest is in essence one type of 'x86 platform', and this callback is used by hardware platforms, so collision is not an issue AFAICS. This is IPI send by a host though. But received on the guest side, right? Not directly. If CPU that receives it happens to run in a guest mode it makes VMX to re-evaluate pending interrupt and inject one if possible without vmexit. If CPU is not in a guest mode the handler for the IPI is called in a host mode and does nothing. Guest code is unaware of the existence of that IPI. Ok, I guess a separate IPI is fine (and better) in this case then. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 RESEND RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
* Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: * Ingo Molnar mi...@kernel.org [2013-01-24 11:32:13]: * Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: From: Peter Zijlstra pet...@infradead.org In case of undercomitted scenarios, especially in large guests yield_to overhead is significantly high. when run queue length of source and target is one, take an opportunity to bail out and return -ESRCH. This return condition can be further exploited to quickly come out of PLE handler. (History: Raghavendra initially worked on break out of kvm ple handler upon seeing source runqueue length = 1, but it had to export rq length). Peter came up with the elegant idea of return -ESRCH in scheduler core. Signed-off-by: Peter Zijlstra pet...@infradead.org Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi) Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com Acked-by: Andrew Jones drjo...@redhat.com Tested-by: Chegu Vinod chegu_vi...@hp.com --- kernel/sched/core.c | 25 +++-- 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2d8927f..fc219a5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield); * It's the caller's job to ensure that the target task struct * can't go away on us before we can do any checks. * - * Returns true if we indeed boosted the target task. + * Returns: + * true (0) if we indeed boosted the target task. + * false (0) if we failed to boost the target. + * -ESRCH if there's no task to yield to. */ bool __sched yield_to(struct task_struct *p, bool preempt) { @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt) again: p_rq = task_rq(p); + /* + * If we're the only runnable task on the rq and target rq also + * has only one task, there's absolutely no point in yielding. + */ + if (rq-nr_running == 1 p_rq-nr_running == 1) { + yielded = -ESRCH; + goto out_irq; + } Looks good to me in principle. Would be nice to get more consistent benchmark numbers. Once those are unambiguously showing that this is a win: Acked-by: Ingo Molnar mi...@kernel.org I ran the test with kernbench and sysbench again on 32 core mx3850 machine with 32 vcpu guests. Results shows definite improvements. ebizzy and dbench show similar improvement for 1x overcommit (note that stdev for 1x in dbench is lesser improvemet is now seen at only 20%) [ all the experiments are taken out of 8 run averages ]. The patches benefit large guest undercommit scenarios, so I believe with large guest performance improvemnt is even significant. [ Chegu Vinod results show performance near to no ple cases ]. Unfortunately I do not have a machine to test larger guest (32). Ingo, Please let me know if this is okay to you. base kernel = 3.8.0-rc4 +---+---+---++---+ kernbench (time in sec lower is better) +---+---+---++---+ basestdevpatchedstdev %improve +---+---+---++---+ 1x 46.6028 1.8672 42.4494 1.1390 8.91234 2x 99.9074 9.1859 90.4050 2.6131 9.51121 +---+---+---++---+ +---+---+---++---+ sysbench (time in sec lower is better) +---+---+---++---+ basestdevpatchedstdev %improve +---+---+---++---+ 1x 18.7402 0.3764 17.7431 0.3589 5.32065 2x 13.2238 0.1935 13.0096 0.3152 1.61981 +---+---+---++---+ +---+---+---++---+ ebizzy (records/sec higher is better) +---+---+---++---+ basestdevpatchedstdev %improve +---+---+---++---+ 1x 2421.900019.18015883.1000 112.7243 142.91259 +---+---+---++---+ +---+---+---++---+ dbench (throughput MB/sec higher is better) +---+---+---++---+ basestdevpatchedstdev %improve +---+---+---++---+ 1x 11675.9900 857.4154
Re: [PATCH V3 RESEND RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
* Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: On 01/25/2013 04:17 PM, Ingo Molnar wrote: * Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: * Ingo Molnar mi...@kernel.org [2013-01-24 11:32:13]: * Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: From: Peter Zijlstra pet...@infradead.org In case of undercomitted scenarios, especially in large guests yield_to overhead is significantly high. when run queue length of source and target is one, take an opportunity to bail out and return -ESRCH. This return condition can be further exploited to quickly come out of PLE handler. (History: Raghavendra initially worked on break out of kvm ple handler upon seeing source runqueue length = 1, but it had to export rq length). Peter came up with the elegant idea of return -ESRCH in scheduler core. Signed-off-by: Peter Zijlstra pet...@infradead.org Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi) Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com Acked-by: Andrew Jones drjo...@redhat.com Tested-by: Chegu Vinod chegu_vi...@hp.com --- kernel/sched/core.c | 25 +++-- 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2d8927f..fc219a5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield); * It's the caller's job to ensure that the target task struct * can't go away on us before we can do any checks. * - * Returns true if we indeed boosted the target task. + * Returns: + * true (0) if we indeed boosted the target task. + * false (0) if we failed to boost the target. + * -ESRCH if there's no task to yield to. */ bool __sched yield_to(struct task_struct *p, bool preempt) { @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt) again: p_rq = task_rq(p); + /* + * If we're the only runnable task on the rq and target rq also + * has only one task, there's absolutely no point in yielding. + */ + if (rq-nr_running == 1 p_rq-nr_running == 1) { + yielded = -ESRCH; + goto out_irq; + } Looks good to me in principle. Would be nice to get more consistent benchmark numbers. Once those are unambiguously showing that this is a win: Acked-by: Ingo Molnar mi...@kernel.org I ran the test with kernbench and sysbench again on 32 core mx3850 machine with 32 vcpu guests. Results shows definite improvements. ebizzy and dbench show similar improvement for 1x overcommit (note that stdev for 1x in dbench is lesser improvemet is now seen at only 20%) [ all the experiments are taken out of 8 run averages ]. The patches benefit large guest undercommit scenarios, so I believe with large guest performance improvemnt is even significant. [ Chegu Vinod results show performance near to no ple cases ]. Unfortunately I do not have a machine to test larger guest (32). Ingo, Please let me know if this is okay to you. base kernel = 3.8.0-rc4 +---+---+---++---+ kernbench (time in sec lower is better) +---+---+---++---+ basestdevpatchedstdev %improve +---+---+---++---+ 1x 46.6028 1.8672 42.4494 1.1390 8.91234 2x 99.9074 9.1859 90.4050 2.6131 9.51121 +---+---+---++---+ +---+---+---++---+ sysbench (time in sec lower is better) +---+---+---++---+ basestdevpatchedstdev %improve +---+---+---++---+ 1x 18.7402 0.3764 17.7431 0.3589 5.32065 2x 13.2238 0.1935 13.0096 0.3152 1.61981 +---+---+---++---+ +---+---+---++---+ ebizzy (records/sec higher is better) +---+---+---++---+ basestdevpatchedstdev %improve +---+---+---++---+ 1x 2421.900019.1801 5883.1000 112.7243 142.91259 +---+---+---++---+ +---+---+---++---+ dbench (throughput MB/sec higher is better) +---+---+---++---+ basestdevpatchedstdev %improve +---+---+---++---+ 1x 11675.9900
Re: [PATCH V3 RESEND RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
* Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: From: Peter Zijlstra pet...@infradead.org In case of undercomitted scenarios, especially in large guests yield_to overhead is significantly high. when run queue length of source and target is one, take an opportunity to bail out and return -ESRCH. This return condition can be further exploited to quickly come out of PLE handler. (History: Raghavendra initially worked on break out of kvm ple handler upon seeing source runqueue length = 1, but it had to export rq length). Peter came up with the elegant idea of return -ESRCH in scheduler core. Signed-off-by: Peter Zijlstra pet...@infradead.org Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi) Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com Acked-by: Andrew Jones drjo...@redhat.com Tested-by: Chegu Vinod chegu_vi...@hp.com --- kernel/sched/core.c | 25 +++-- 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2d8927f..fc219a5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield); * It's the caller's job to ensure that the target task struct * can't go away on us before we can do any checks. * - * Returns true if we indeed boosted the target task. + * Returns: + * true (0) if we indeed boosted the target task. + * false (0) if we failed to boost the target. + * -ESRCH if there's no task to yield to. */ bool __sched yield_to(struct task_struct *p, bool preempt) { @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt) again: p_rq = task_rq(p); + /* + * If we're the only runnable task on the rq and target rq also + * has only one task, there's absolutely no point in yielding. + */ + if (rq-nr_running == 1 p_rq-nr_running == 1) { + yielded = -ESRCH; + goto out_irq; + } Looks good to me in principle. Would be nice to get more consistent benchmark numbers. Once those are unambiguously showing that this is a win: Acked-by: Ingo Molnar mi...@kernel.org Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86, add hypervisor name to dump_stack() [v2]
* Prarit Bhargava pra...@redhat.com wrote: Debugging crash, panics, stack trace WARN_ONs, etc., from both virtual and bare-metal boots can get difficult very quickly. While there are ways to decipher the output and determine if the output is from a virtual guest, the in-kernel hypervisors now have a single registration point and set x86_hyper. We can use this to output additional debug information during a panic/oops/stack trace. Signed-off-by: Prarit Bhargava pra...@redhat.com Cc: Avi Kivity a...@redhat.com Cc: Gleb Natapov g...@redhat.com Cc: Alex Williamson alex.william...@redhat.com Cc: Marcelo Tostatti mtosa...@redhat.com Cc: Ingo Molnar mi...@redhat.com Cc: kvm@vger.kernel.org Cc: x...@kernel.org [v2]: Modifications suggested by Ingo and added changes for similar output from process.c --- arch/x86/kernel/dumpstack.c | 11 ++- arch/x86/kernel/process.c | 12 +++- 2 files changed, 21 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c index ae42418b..5dd680f 100644 --- a/arch/x86/kernel/dumpstack.c +++ b/arch/x86/kernel/dumpstack.c @@ -16,6 +16,7 @@ #include linux/nmi.h #include linux/sysfs.h +#include asm/hypervisor.h #include asm/stacktrace.h @@ -186,9 +187,17 @@ void dump_stack(void) { unsigned long bp; unsigned long stack; + const char *machine_name = x86; + const char *kernel_type = native; + + if (x86_hyper) { + machine_name = x86_hyper-name; + kernel_type = guest; + } bp = stack_frame(current, NULL); - printk(Pid: %d, comm: %.20s %s %s %.*s\n, + printk([%s %s kernel] Pid: %d, comm: %.20s %s %s %.*s\n, + machine_name, kernel_type, I'd put the kernel info at the end of the line. It's all very exciting I know, because we are working on this printout right now and all that - but to users and developers the PID/comm output plus the backtrace is far more important. current-pid, current-comm, print_tainted(), init_utsname()-release, (int)strcspn(init_utsname()-version, ), diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index b644e1c..14bd064 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -28,6 +28,7 @@ #include asm/fpu-internal.h #include asm/debugreg.h #include asm/nmi.h +#include asm/hypervisor.h /* * per-CPU TSS segments. Threads are completely 'soft' on Linux, @@ -124,6 +125,13 @@ void exit_thread(void) void show_regs_common(void) { const char *vendor, *product, *board; + const char *machine_name = x86; + const char *kernel_type = native; + + if (x86_hyper) { + machine_name = x86_hyper-name; + kernel_type = guest; + } vendor = dmi_get_system_info(DMI_SYS_VENDOR); if (!vendor) @@ -135,7 +143,9 @@ void show_regs_common(void) /* Board Name is optional */ board = dmi_get_system_info(DMI_BOARD_NAME); - printk(KERN_DEFAULT Pid: %d, comm: %.20s %s %s %.*s %s %s%s%s\n, + printk(KERN_DEFAULT +[%s %s kernel] Pid: %d, comm: %.20s %s %s %.*s %s %s%s%s\n, +machine_name, kernel_type, current-pid, current-comm, print_tainted(), init_utsname()-release, (int)strcspn(init_utsname()-version, ), Ha, duplicate code doing almost the same thing! I suspect you know what my next suggestion would be? :-) Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/3] lockdep: be nice about compiling from userspace
* Sasha Levin sasha.le...@oracle.com wrote: We can rather easily make lockdep work from userspace, although 3 issues remain which I'm not sure about: - Kernel naming - we can just wrap init_utsname() to return kvmtool related utsname, is that what we want though? - static_obj() - I don't have a better idea than calling mprobe(), which sounds wrong as well. - debug_show_all_locks() - we don't actually call it from userspace yet, but I think we might want to, so I'm not sure how to make it pretty using existing kernel code. Signed-off-by: Sasha Levin sasha.le...@oracle.com --- kernel/lockdep.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/kernel/lockdep.c b/kernel/lockdep.c index 7981e5b..fdd3670 100644 --- a/kernel/lockdep.c +++ b/kernel/lockdep.c @@ -567,10 +567,12 @@ static void lockdep_print_held_locks(struct task_struct *curr) static void print_kernel_ident(void) { +#ifdef __KERNEL__ printk(%s %.*s %s\n, init_utsname()-release, (int)strcspn(init_utsname()-version, ), init_utsname()-version, print_tainted()); +#endif I guess wrapping init_utsname() is not worth it. Although kvmtool could provide the host system's utsname - kernel identity is useful for debugging info. You could generate a Git hash version string like tools/perf/ does (see PERF_VERSION and tools/perf/util/PERF-VERSION-GEN), and put that into the -version field. -release could be the kvmtool version, and print_tainted() could return an empty string. That way you could provide init_utsname() and could remove this #ifdef. } static int very_verbose(struct lock_class *class) @@ -586,6 +588,7 @@ static int very_verbose(struct lock_class *class) */ static int static_obj(void *obj) { +#ifdef __KERNEL__ unsigned long start = (unsigned long) _stext, end = (unsigned long) _end, addr = (unsigned long) obj; @@ -609,6 +612,8 @@ static int static_obj(void *obj) * module static or percpu var? */ return is_module_address(addr) || is_module_percpu_address(addr); +#endif + return 1; Could you put an: #ifndef static_obj around it? Then kvmtool could define its own trivial version of static_obj(): #define static_obj(x) 1U or so. @@ -4108,7 +4113,7 @@ void debug_check_no_locks_held(struct task_struct *task) if (unlikely(task-lockdep_depth 0)) print_held_locks_bug(task); } - +#ifdef __KERNEL__ void debug_show_all_locks(void) { struct task_struct *g, *p; I guess a show-all-locks functionality would be useful to kvmtool as well? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/3] lockdep: be nice about compiling from userspace
* Sasha Levin sasha.le...@oracle.com wrote: On 10/25/2012 04:05 AM, Ingo Molnar wrote: * Sasha Levin sasha.le...@oracle.com wrote: We can rather easily make lockdep work from userspace, although 3 issues remain which I'm not sure about: - Kernel naming - we can just wrap init_utsname() to return kvmtool related utsname, is that what we want though? - static_obj() - I don't have a better idea than calling mprobe(), which sounds wrong as well. - debug_show_all_locks() - we don't actually call it from userspace yet, but I think we might want to, so I'm not sure how to make it pretty using existing kernel code. Signed-off-by: Sasha Levin sasha.le...@oracle.com --- kernel/lockdep.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/kernel/lockdep.c b/kernel/lockdep.c index 7981e5b..fdd3670 100644 --- a/kernel/lockdep.c +++ b/kernel/lockdep.c @@ -567,10 +567,12 @@ static void lockdep_print_held_locks(struct task_struct *curr) static void print_kernel_ident(void) { +#ifdef __KERNEL__ printk(%s %.*s %s\n, init_utsname()-release, (int)strcspn(init_utsname()-version, ), init_utsname()-version, print_tainted()); +#endif I guess wrapping init_utsname() is not worth it. Although kvmtool could provide the host system's utsname - kernel identity is useful for debugging info. You could generate a Git hash version string like tools/perf/ does (see PERF_VERSION and tools/perf/util/PERF-VERSION-GEN), and put that into the -version field. -release could be the kvmtool version, and print_tainted() could return an empty string. That way you could provide init_utsname() and could remove this #ifdef. Yeah, we already generate the version string for 'lkvm version' anyways, so I guess I'll just add init_utsname(). } static int very_verbose(struct lock_class *class) @@ -586,6 +588,7 @@ static int very_verbose(struct lock_class *class) */ static int static_obj(void *obj) { +#ifdef __KERNEL__ unsigned long start = (unsigned long) _stext, end = (unsigned long) _end, addr = (unsigned long) obj; @@ -609,6 +612,8 @@ static int static_obj(void *obj) * module static or percpu var? */ return is_module_address(addr) || is_module_percpu_address(addr); +#endif + return 1; Could you put an: #ifndef static_obj around it? Then kvmtool could define its own trivial version of static_obj(): #define static_obj(x) 1U or so. @@ -4108,7 +4113,7 @@ void debug_check_no_locks_held(struct task_struct *task) if (unlikely(task-lockdep_depth 0)) print_held_locks_bug(task); } - +#ifdef __KERNEL__ void debug_show_all_locks(void) { struct task_struct *g, *p; I guess a show-all-locks functionality would be useful to kvmtool as well? Regarding the above two, Yes, we can wrap both static_obj() and debug_show_all_locks() with #ifndefs and let kvmtool provide it's own version of those two. Only static_obj() - I see no immediate reason why you shouldn't be able to utilize debug_show_all_locks(). 'vm debug -a' already lists all backtraces on all vcpus - so 'vm debug lockdep -a' could list all current locks and indicate which one is held and by whom. The question is here more of a would lockdep maintainers be ok with adding that considering there's no in-kernel justification for those? Yeah, such a simple patch would be acceptable to me, being nice isn't against the law. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86, add hypervisor name to dump_stack()
* Prarit Bhargava pra...@redhat.com wrote: Debugging crash, panics, stack trace WARN_ONs, etc., from both virtual and bare-metal boots can get difficult very quickly. While there are ways to decipher the output and determine if the output is from a virtual guest, the in-kernel hypervisors now have a single registration point and set x86_hyper. We can use this to output a single extra line on virtual machines that indicates the hypervisor type. Signed-off-by: Prarit Bhargava pra...@redhat.com Cc: Avi Kivity a...@redhat.com Cc: Gleb Natapov g...@redhat.com Cc: Alex Williamson alex.william...@redhat.com Cc: Marcelo Tostatti mtosa...@redhat.com Cc: Ingo Molnar mi...@redhat.com Cc: kvm@vger.kernel.org Cc: x...@kernel.org --- arch/x86/kernel/dumpstack.c |3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c index ae42418b..75a635e 100644 --- a/arch/x86/kernel/dumpstack.c +++ b/arch/x86/kernel/dumpstack.c @@ -17,6 +17,7 @@ #include linux/sysfs.h #include asm/stacktrace.h +#include asm/hypervisor.h int panic_on_unrecovered_nmi; @@ -193,6 +194,8 @@ void dump_stack(void) init_utsname()-release, (int)strcspn(init_utsname()-version, ), init_utsname()-version); + if (x86_hyper x86_hyper-name) + printk(Hypervisor: %s\n, x86_hyper-name); show_trace(NULL, NULL, stack, bp); Looks useful, but please don't waste a full new line on it but embedd it in the already existing status line that prints details like release and version. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] kvm tools: Export DISPLAY ENV as our default host ip address
* Pekka Enberg penb...@kernel.org wrote: On 08/24/2012 02:29 PM, Asias He wrote: It is useful to run a X program in guest and display it on host. 1) Make host's x server listen to localhost:6000 host_shell$ socat -d -d TCP-LISTEN:6000,fork,bind=localhost \ UNIX-CONNECT:/tmp/.X11-unix/X0 2) Start the guest and run X program host_shell$ lkvm run -k /boot/bzImage guest_shell$ xlogo On Tue, Sep 4, 2012 at 4:07 PM, Avi Kivity a...@redhat.com wrote: Note, this is insecure, don't do this with untrusted guests. Asias, can we add a command line argument that enables this? It'd be safer to keep it disabled by default. It might also be prudent to name the option in a way that signals that the user of it understands the security implications: --X11-trusted-guest 1 or so. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/7] kvm tools: enable LTO
* Pekka Enberg penb...@kernel.org wrote: On Thu, Aug 30, 2012 at 10:36 AM, Sasha Levin levinsasha...@gmail.com wrote: Build with -flto set, which should enable link-time-optimizations. I'm not sure if it provides a significant performance increase, but it's probably just worth it for catching issues which it may cause. Signed-off-by: Sasha Levin levinsasha...@gmail.com Ingo, any objections to this? No objections if you can live with a 2x-4x increase in build time - at worst it might cause funnies with the BIOS linker script and such. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] apic: add apic_set_eoi_write for PV use
* Michael S. Tsirkin m...@redhat.com wrote: KVM PV EOI optimization overrides eoi_write apic op with its own version. Add an API for this to avoid meddling with core x86 apic driver data structures directly. For KVM use, we don't need any guarantees about when the switch to the new op will take place, so it could in theory use this API after SMP init, but it currently doesn't, and restricting callers to early init makes it clear that it's safe as it won't race with actual APIC driver use. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- arch/x86/include/asm/apic.h | 3 +++ arch/x86/kernel/apic/apic.c | 17 + 2 files changed, 20 insertions(+) diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index aa5b2ee..ff8dff6 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -469,6 +469,8 @@ static inline u32 safe_apic_wait_icr_idle(void) return apic-safe_wait_icr_idle(); } +extern void __init apic_set_eoi_write(void (*eoi_write)(u32 reg, u32 v)); + #else /* CONFIG_X86_LOCAL_APIC */ static inline u32 apic_read(u32 reg) { return 0; } @@ -478,6 +480,7 @@ static inline u64 apic_icr_read(void) { return 0; } static inline void apic_icr_write(u32 low, u32 high) { } static inline void apic_wait_icr_idle(void) { } static inline u32 safe_apic_wait_icr_idle(void) { return 0; } +static inline void apic_set_eoi_write(void (*eoi_write)(u32 reg, u32 v)) {} #endif /* CONFIG_X86_LOCAL_APIC */ diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 39a222e..c7520b6 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -2124,6 +2124,23 @@ void default_init_apic_ldr(void) } /* + * Override the generic EOI implementation with an optimized version. + * Only called during early boot when only one CPU is active and with + * interrupts disabled, so we know this does not race with actual APIC driver + * use. + */ +void __init apic_set_eoi_write(void (*eoi_write)(u32 reg, u32 v)) +{ + struct apic **drv; + + for (drv = __apicdrivers; drv __apicdrivers_end; drv++) { + /* Should happen once for each apic */ + WARN_ON((*drv)-eoi_write == eoi_write); + (*drv)-eoi_write = eoi_write; + } +} + ok, it's better this way. Acked-by: Ingo Molnar mi...@kernel.org Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V2 2/2] kvm PLE handler: Choose better candidate for directed yield
* Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1595,6 +1595,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) continue; if (waitqueue_active(vcpu-wq)) continue; + if (!kvm_arch_vcpu_check_and_update_eligible(vcpu)) { + continue; + } Saw this fly by for a second time and now it annoyed me enough to mention it: those curly braces are superfluous and should be dropped. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] apic: fix kvm build on UP without IOAPIC
* Michael S. Tsirkin m...@redhat.com wrote: On Fri, Jul 06, 2012 at 04:12:23PM +0200, Ingo Molnar wrote: * Marcelo Tosatti mtosa...@redhat.com wrote: On Fri, Jul 06, 2012 at 01:13:14PM +0200, Ingo Molnar wrote: * H. Peter Anvin h...@zytor.com wrote: On 07/01/2012 08:05 AM, Michael S. Tsirkin wrote: On UP i386, when APIC is disabled # CONFIG_X86_UP_APIC is not set # CONFIG_PCI_IOAPIC is not set code looking at apicdrivers never has any effect but it still gets compiled in. In particular, this causes build failures with kvm, but it generally bloats the kernel unnecessarily. Fix by defining both __apicdrivers and __apicdrivers_end to be NULL when CONFIG_X86_LOCAL_APIC is unset: I verified that as the result any loop scanning __apicdrivers gets optimized out by the compiler. Warning: a .config with apic disabled doesn't seem to boot for me (even without this patch). Still verifying why, meanwhile this patch is compile-tested only. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Note: if this patch makes sense, can x86 maintainers please ACK applying it through the kvm tree, since that is where we see the issue that it addresses? Avi, Marcelo, maybe you can carry this in kvm/linux-next as a temporary measure so that linux-next builds? I'm not happy about that as a workflow, but since you guys have an immediate problem I guess we can do that. I'm rather unhappy about this workflow - we've got quite a few apic bits in the x86 tree this cycle as well and need extra external interaction, not. Which KVM tree commit caused this, could someone please give a lkml link or quote it here? It's not referenced in the fix patch either. Thanks, Ingo This tree (kvm.git next): http://git.kernel.org/?p=virt/kvm/kvm.git;a=shortlog;h=refs/heads/next Introduced by this commit: http://git.kernel.org/?p=virt/kvm/kvm.git;a=commit;h=ab9cf4996bb989983e73da894b8dd0239aa2c3c2 This bit: + if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) { + struct apic **drv; + + for (drv = __apicdrivers; drv __apicdrivers_end; drv++) { + /* Should happen once for each apic */ + WARN_ON((*drv)-eoi_write == kvm_guest_apic_eoi_write); + (*drv)-eoi_write = kvm_guest_apic_eoi_write; + } + } + is rather disgusting I have to say. WTH is the KVM code meddling with core x86 apic driver data structures directly? At minimum factor this out and create a proper apic.c function which is EXPORT_SYMBOL_GPL() exported or so... Thanks, Ingo OK, so apic_set_eoi_write()? Yes, with a changelog comment analyzing the design decisions and locking here - what happens if actual APIC driver use races with this update on SMP, why is it all safe, etc? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] apic: fix kvm build on UP without IOAPIC
* H. Peter Anvin h...@zytor.com wrote: On 07/01/2012 08:05 AM, Michael S. Tsirkin wrote: On UP i386, when APIC is disabled # CONFIG_X86_UP_APIC is not set # CONFIG_PCI_IOAPIC is not set code looking at apicdrivers never has any effect but it still gets compiled in. In particular, this causes build failures with kvm, but it generally bloats the kernel unnecessarily. Fix by defining both __apicdrivers and __apicdrivers_end to be NULL when CONFIG_X86_LOCAL_APIC is unset: I verified that as the result any loop scanning __apicdrivers gets optimized out by the compiler. Warning: a .config with apic disabled doesn't seem to boot for me (even without this patch). Still verifying why, meanwhile this patch is compile-tested only. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Note: if this patch makes sense, can x86 maintainers please ACK applying it through the kvm tree, since that is where we see the issue that it addresses? Avi, Marcelo, maybe you can carry this in kvm/linux-next as a temporary measure so that linux-next builds? I'm not happy about that as a workflow, but since you guys have an immediate problem I guess we can do that. I'm rather unhappy about this workflow - we've got quite a few apic bits in the x86 tree this cycle as well and need extra external interaction, not. Which KVM tree commit caused this, could someone please give a lkml link or quote it here? It's not referenced in the fix patch either. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] apic: fix kvm build on UP without IOAPIC
* Marcelo Tosatti mtosa...@redhat.com wrote: On Fri, Jul 06, 2012 at 01:13:14PM +0200, Ingo Molnar wrote: * H. Peter Anvin h...@zytor.com wrote: On 07/01/2012 08:05 AM, Michael S. Tsirkin wrote: On UP i386, when APIC is disabled # CONFIG_X86_UP_APIC is not set # CONFIG_PCI_IOAPIC is not set code looking at apicdrivers never has any effect but it still gets compiled in. In particular, this causes build failures with kvm, but it generally bloats the kernel unnecessarily. Fix by defining both __apicdrivers and __apicdrivers_end to be NULL when CONFIG_X86_LOCAL_APIC is unset: I verified that as the result any loop scanning __apicdrivers gets optimized out by the compiler. Warning: a .config with apic disabled doesn't seem to boot for me (even without this patch). Still verifying why, meanwhile this patch is compile-tested only. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Note: if this patch makes sense, can x86 maintainers please ACK applying it through the kvm tree, since that is where we see the issue that it addresses? Avi, Marcelo, maybe you can carry this in kvm/linux-next as a temporary measure so that linux-next builds? I'm not happy about that as a workflow, but since you guys have an immediate problem I guess we can do that. I'm rather unhappy about this workflow - we've got quite a few apic bits in the x86 tree this cycle as well and need extra external interaction, not. Which KVM tree commit caused this, could someone please give a lkml link or quote it here? It's not referenced in the fix patch either. Thanks, Ingo This tree (kvm.git next): http://git.kernel.org/?p=virt/kvm/kvm.git;a=shortlog;h=refs/heads/next Introduced by this commit: http://git.kernel.org/?p=virt/kvm/kvm.git;a=commit;h=ab9cf4996bb989983e73da894b8dd0239aa2c3c2 This bit: + if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) { + struct apic **drv; + + for (drv = __apicdrivers; drv __apicdrivers_end; drv++) { + /* Should happen once for each apic */ + WARN_ON((*drv)-eoi_write == kvm_guest_apic_eoi_write); + (*drv)-eoi_write = kvm_guest_apic_eoi_write; + } + } + is rather disgusting I have to say. WTH is the KVM code meddling with core x86 apic driver data structures directly? At minimum factor this out and create a proper apic.c function which is EXPORT_SYMBOL_GPL() exported or so... Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: optimize ISR lookups
* Avi Kivity a...@redhat.com wrote: On 05/22/2012 02:01 AM, Thomas Gleixner wrote: Others are not my fault :) Seriously, if Avi/Marcelo want to rewrite the ISR emulation Interesting POV, really. Did you ever notice that the kernel is a collaborative effort and not controlled by Avi/Marcelo? Did you ever notice that arch/x86/kvm is part of arch/x86? This is silly. Most of the time the kernel is advanced by incremental patches. Sometimes it is advanced by minor or major refactoring. It is never advanced by personal attacks on contributors. Thomas wasn't so much doing a personal attack, it was pointing out stupidity and then it was mocking the repeated stupidity. He very politely explained his point of view (with which I agree), and then you guys pressed the issue and there's just so many hours in the merge window, so you asked to be flamed ... Avi, if you cannot be brought to properly reject incomplete patches going in the wrong direction then others maintainers interested in the code will do it. If you start to consistently require from KVM contributors incremental updates in the right direction, not piling crap on crap, then such incidents won't happen. This isn't the first such incident but there's hope that it might be the last one. The rule in arch/x86/ (and many other subsystems) is very simple: we don't speed up crappy code. If you want to speed it up then make it clean first, *then* is it suited for speedups. Crappy code is fragile and bound to introduce bugs, and crappy code leads to continued increased maintenance overhead, so crappy code is basically under a perpetual code freeze until it's uncrapped. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] sched: make callers check lock contention for cond_resched_lock()
* Takuya Yoshikawa takuya.yoshik...@gmail.com wrote: Replaced Ingo's address with kernel.org one, On Thu, 03 May 2012 17:47:30 +0200 Peter Zijlstra pet...@infradead.org wrote: On Thu, 2012-05-03 at 22:00 +0900, Takuya Yoshikawa wrote: But as I could not see why spin_needbreak() was differently implemented depending on CONFIG_PREEMPT, I wanted to understand the meaning. Its been that way since before voluntary preemption was introduced, so its possible Ingo simply missed that spot and nobody noticed until now. Ingo, do you have any recollections from back when? ping I'm not sure we had a usable spin_is_contended() back then, nor was the !PREEMPT case in my mind really. ( The patch looks ugly though, in 99% of the lines it just does something that cond_resched_lock() itself could do. ) Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
* Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM.(targeted for 3.5 window) Note: This needs debugfs changes patch that should be in Xen / linux-next https://lkml.org/lkml/2012/3/30/687 Changes in V8: - Reabsed patches to 3.4-rc4 - Combined the KVM changes with ticketlock + Xen changes (Ingo) - Removed CAP_PV_UNHALT since it is redundant (Avi). But note that we need newer qemu which uses KVM_GET_SUPPORTED_CPUID ioctl. - Rewrite GET_MP_STATE condition (Avi) - Make pv_unhalt = bool (Avi) - Move out reset pv_unhalt code to vcpu_run from vcpu_block (Gleb) - Documentation changes (Rob Landley) - Have a printk to recognize that paravirt spinlock is enabled (Nikunj) - Move out kick hypercall out of CONFIG_PARAVIRT_SPINLOCK now so that it can be used for other optimizations such as flush_tlb_ipi_others etc. (Nikunj) Ticket locks have an inherent problem in a virtualized case, because the vCPUs are scheduled rather than running concurrently (ignoring gang scheduled vCPUs). This can result in catastrophic performance collapses when the vCPU scheduler doesn't schedule the correct next vCPU, and ends up scheduling a vCPU which burns its entire timeslice spinning. (Note that this is not the same problem as lock-holder preemption, which this series also addresses; that's also a problem, but not catastrophic). (See Thomas Friebel's talk Prevent Guests from Spinning Around http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.) Currently we deal with this by having PV spinlocks, which adds a layer of indirection in front of all the spinlock functions, and defining a completely new implementation for Xen (and for other pvops users, but there are none at present). PV ticketlocks keeps the existing ticketlock implemenentation (fastpath) as-is, but adds a couple of pvops for the slow paths: - If a CPU has been waiting for a spinlock for SPIN_THRESHOLD iterations, then call out to the __ticket_lock_spinning() pvop, which allows a backend to block the vCPU rather than spinning. This pvop can set the lock into slowpath state. - When releasing a lock, if it is in slowpath state, the call __ticket_unlock_kick() to kick the next vCPU in line awake. If the lock is no longer in contention, it also clears the slowpath flag. The slowpath state is stored in the LSB of the within the lock tail ticket. This has the effect of reducing the max number of CPUs by half (so, a small ticket can deal with 128 CPUs, and large ticket 32768). For KVM, one hypercall is introduced in hypervisor,that allows a vcpu to kick another vcpu out of halt state. The blocking of vcpu is done using halt() in (lock_spinning) slowpath. Overall, it results in a large reduction in code, it makes the native and virtualized cases closer, and it removes a layer of indirection around all the spinlock functions. The fast path (taking an uncontended lock which isn't in slowpath state) is optimal, identical to the non-paravirtualized case. The inner part of ticket lock code becomes: inc = xadd(lock-tickets, inc); inc.tail = ~TICKET_SLOWPATH_FLAG; if (likely(inc.head == inc.tail)) goto out; for (;;) { unsigned count = SPIN_THRESHOLD; do { if (ACCESS_ONCE(lock-tickets.head) == inc.tail) goto out; cpu_relax(); } while (--count); __ticket_lock_spinning(lock, inc.tail); } out: barrier(); which results in: push %rbp mov%rsp,%rbp mov$0x200,%eax lock xadd %ax,(%rdi) movzbl %ah,%edx cmp%al,%dl jne1f # Slowpath if lock in contention pop%rbp retq ### SLOWPATH START 1:and$-2,%edx movzbl %dl,%esi 2:mov$0x800,%eax jmp4f 3:pause sub$0x1,%eax je 5f 4:movzbl (%rdi),%ecx cmp%cl,%dl jne3b pop%rbp retq 5:callq *__ticket_lock_spinning jmp2b ### SLOWPATH END with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where the fastpath case is straight through (taking the lock without contention), and the spin loop is out of line: push %rbp mov%rsp,%rbp mov$0x100,%eax lock xadd %ax,(%rdi) movzbl %ah,%edx cmp%al,%dl jne1f pop%rbp retq ### SLOWPATH START 1:pause movzbl (%rdi),%eax cmp%dl,%al jne1b pop%rbp retq ### SLOWPATH END The unlock code is complicated by the need to
Re: [PATCH RFC 0/5] apic: eoi optimization support
* Michael S. Tsirkin m...@redhat.com wrote: I'm looking at reducing the interrupt overhead for virtualized guests: some workloads spend a large part of their time processing interrupts. This patchset supplies infrastructure to reduce the IRQ ack overhead on x86: the idea is to add an eoi_write callback that we can then optimize without touching other apic functionality. The main user will be kvm: on kvm, an EOI write from the guest causes an expensive exit to host; we can avoid this using shared memory as the last patch in the series demonstrates. But I also wrote a micro-optimized version for the regular x2apic: this shaves off a branch and about 9 instructions from EOI when x2apic is used, and a comment in ack_APIC_irq implies that someone counted instructions there, at some point. Also included in the patchset are a couple of trivial macro fixes. The patches work fine on my boxes and I did look at the objdump output to verify that the generated code for the micro-optimization patch looks right and actually is shorter. Some benchmark results below (not sure what kind of testing is the most appropriate) show a tiny but measureable improvement. The tests were run on an AMD box with 24 cpus. - A clean kernel build after reboot shows a tiny but measureable improvement in system time which means lower CPU overhead (though not measureable in total time - that is dominated by user time and fluctuates too much): linux# reboot -f ... linux# make clean linux# time make -j 64 LOCALVERSION= 21 /dev/null Before: real2m52.244s user35m53.833s sys 6m7.194s After: real2m52.827s user35m48.916s sys 6m2.305s - perf micro-benchmarks seem to consistently show a tiny improvement in total time as well but it's below the confidence level of 3 std deviations: # ./tools/perf/perf stat --sync --repeat 100 --null perf bench sched messaging ... 0.414666797 seconds time elapsed ( +- 1.29% ) Performance counter stats for 'perf bench sched messaging' (100 runs): 0.395370891 seconds time elapsed ( +- 1.04% ) # ./tools/perf/perf stat --sync --repeat 100 --null perf bench sched pipe -l 1 0.307019664 seconds time elapsed ( +- 0.10% ) 0.304738024 seconds time elapsed ( +- 0.08% ) The patches are against 3.4-rc3 - let me know if I need to rebase. I think patches 1-2 are definitely a good idea, and patches 3-4 might be a good idea. Please review, and consider patches 1-4 for linux 3.5. Thanks, MST Michael S. Tsirkin (5): apic: fix typo EIO_ACK - EOI_ACK and document apic: use symbolic APIC_EOI_ACK x86: add apic-eoi_write callback x86: eoi micro-optimization kvm_para: guest side for eoi avoidance arch/x86/include/asm/apic.h| 22 -- arch/x86/include/asm/apicdef.h |2 +- arch/x86/include/asm/bitops.h |6 ++- arch/x86/include/asm/kvm_para.h|2 + arch/x86/kernel/apic/apic_flat_64.c|2 + arch/x86/kernel/apic/apic_noop.c |1 + arch/x86/kernel/apic/apic_numachip.c |1 + arch/x86/kernel/apic/bigsmp_32.c |1 + arch/x86/kernel/apic/es7000_32.c |2 + arch/x86/kernel/apic/numaq_32.c|1 + arch/x86/kernel/apic/probe_32.c|1 + arch/x86/kernel/apic/summit_32.c |1 + arch/x86/kernel/apic/x2apic_cluster.c |1 + arch/x86/kernel/apic/x2apic_phys.c |1 + arch/x86/kernel/apic/x2apic_uv_x.c |1 + arch/x86/kernel/kvm.c | 51 ++-- arch/x86/platform/visws/visws_quirks.c |2 +- 17 files changed, 88 insertions(+), 10 deletions(-) No objections from the x86 side. In terms of advantages, could you please create perf stat runs that counts the number of MMIOs or so? That should show a pretty obvious improvement - and that is enough as proof, no need to try to reproduce the performance win in a noisy benchmark. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/5] apic: eoi optimization support
* Michael S. Tsirkin m...@redhat.com wrote: On Mon, May 07, 2012 at 12:35:12PM +0200, Ingo Molnar wrote: Michael S. Tsirkin (5): apic: fix typo EIO_ACK - EOI_ACK and document apic: use symbolic APIC_EOI_ACK x86: add apic-eoi_write callback x86: eoi micro-optimization kvm_para: guest side for eoi avoidance arch/x86/include/asm/apic.h| 22 -- arch/x86/include/asm/apicdef.h |2 +- arch/x86/include/asm/bitops.h |6 ++- arch/x86/include/asm/kvm_para.h|2 + arch/x86/kernel/apic/apic_flat_64.c|2 + arch/x86/kernel/apic/apic_noop.c |1 + arch/x86/kernel/apic/apic_numachip.c |1 + arch/x86/kernel/apic/bigsmp_32.c |1 + arch/x86/kernel/apic/es7000_32.c |2 + arch/x86/kernel/apic/numaq_32.c|1 + arch/x86/kernel/apic/probe_32.c|1 + arch/x86/kernel/apic/summit_32.c |1 + arch/x86/kernel/apic/x2apic_cluster.c |1 + arch/x86/kernel/apic/x2apic_phys.c |1 + arch/x86/kernel/apic/x2apic_uv_x.c |1 + arch/x86/kernel/kvm.c | 51 ++-- arch/x86/platform/visws/visws_quirks.c |2 +- 17 files changed, 88 insertions(+), 10 deletions(-) No objections from the x86 side. Is kvm.git a good tree to merge this through? Fine to me, but I haven't checked how widely it conflicts with existing bits: by the looks of it most of the linecount is on the core x86 side, while the kvm change is well concentrated. In terms of advantages, could you please create perf stat runs that counts the number of MMIOs or so? That should show a pretty obvious improvement - and that is enough as proof, no need to try to reproduce the performance win in a noisy benchmark. You mean with kvm PV, right? On real hardware the micro-optimization removes branches and maybe cache-misses but I don't see why would it reduce MMIOs. Yeah, on KVM. On real hw I doubt it's measurable. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/5] apic: eoi optimization support
* Avi Kivity a...@redhat.com wrote: On 05/07/2012 02:40 PM, Ingo Molnar wrote: No objections from the x86 side. Is kvm.git a good tree to merge this through? Fine to me, but I haven't checked how widely it conflicts with existing bits: by the looks of it most of the linecount is on the core x86 side, while the kvm change is well concentrated. I don't see a problem with merging though tip.git - we're close to the next merge window, and the guest side rarely causes conflicts. But please don't apply the last patch yet, I want to review it more closely (esp. with the host side). That last patch was marked don't apply yet, so I definitely planned on another iteration that incorporates all the feedback that has been given. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
* Avi Kivity a...@redhat.com wrote: PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a win on PLE where only one of them alone could not prove the benefit. I'd like to see those numbers, then. Ingo, please hold on the kvm-specific patches, meanwhile. I'll hold off on the whole thing - frankly, we don't want this kind of Xen-only complexity. If KVM can make use of PLE then Xen ought to be able to do it as well. If both Xen and KVM makes good use of it then that's a different matter. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
* Andi Kleen a...@firstfloor.org wrote: Care to back that up with numbers and proper trace evidence instead of handwaving? E.g. my plumbers presentations on lock and mm scalability from last year has some graphs that show this very clearly, plus some additional data on the mutexes. This compares to the glibc futex locks, which perform much better than the kernel mutex locks on larger systems under higher contention If you mean these draft slides: http://www.halobates.de/plumbers-fork-locks_v2.pdf it has very little verifiable information in it. It just cryptically says lock hold time microbenchmark, which might or might not be a valid measurement. You could have been honest and straightforward in your first mail: I ran workload X on machine Y, and got results Z. Instead you are *hindering* the discussion: Given your tone I will not supply an URL. [...] If you meant the above URL then it's not the proper numbers Thomas asked for, just some vague slides. If you meant something else then put up or shut up. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] KVM updates for the 3.4 merge window
* Avi Kivity a...@redhat.com wrote: Say a fix comes in which needs to be mainlined during -rc. So I put it in some other branch, to be sent off to Linus in a few days after maturing a little. Meanwhile developers see an incomplete tree, since that patch is not in it. Once Linus pulls, I can merge it back (or even before, if I'm reasonably certain it's not going to change), but it leaves a history of unneeded merges. Or we can do throwaway merges like tip.git. We don't do throwaway merges in the -tip development branches themselves, i.e. in tip:sched/core, tip:perf/core, tip:timers/core, etc. When a fix goes into tip:sched/urgent then until Linus merges it it's not in tip:sched/core. 99% of the fixes don't *have to* go into sched/core straight away. In the odd case where there's some dependency, we can manually merge it into tip:sched/core ahead of Linus pulling into an -rc. Those rare merges are not a problem, and I explain the reason in the merge commit itself. If you look at: gll v3.2..v3.3 | grep -E '/urgent.*/core' you'll see that I only had to do it once in the previous cycle: d6c1c49de577 Merge branch 'perf/urgent' into perf/core and the changelog explains the background: Merge reason: Add these cherry-picked commits so that future changes on perf/core don't conflict. it was a rare, oddball situation where we cherry-picked perf/core changes into perf/urgent. Extra merges are perfectly fine in that case. The 'throwaway' tip:master branch you are probably referring to is basically just a testing branch, a convenient merged tree of the one dozen maintainer trees that are hosted in -tip. Since we don't want to force Linus's hand of him being able to reject individual trees we don't merge them properly - hence the integrated tree is a throwaway tree in theory. In practice I tend to throw it away only once per cycle, around -rc1, once all pending trees went to Linus. tip:master is not used for any Git based contribution work - it's for testing, it's for people who want to work with patches - the commits themselves always go into persistent non-rebasing, append-only Git trees. If we mess up bisectability we do a delta fix. When choosing between somewhat better bisectability and a proper history that others can rely on then proper history wins hands down. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 0/3] KVM: perf: kvm events analysis tool
* Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote: The output example is following: #./perf kvm-events report --event mmio --vcpu 3 So we already have 'perf kvm': usage: perf kvm [options] {top|record|report|diff|buildid-list} which is a sub-namespace for all things KVM instrumentation goodies. [ Arguably there should be a 'perf kvm trace' as well, but I digress. ] So, your new tool has a similar workflow to: perf kvm record perf kvm report but differs from it in terms of events used and in terms of reported output. To me it appears that your tool is basically pretty similar to 'perf stat', adapted to KVM, right? So, could your new tool's workflow be simplified like this: perf kvm stat .. ? To automatically stat all vcpus in the system, the well-known -a/--all-cpus system-wide method could be used: perf kvm stat -a ... with stat output following immediately after it has finished. It should also be possible to use those new events in a recording fashion - a new, rather logical command sub-space could be used for that: perf kvm stat record ... perf kvm stat report ... [ This could be expanded to regular 'perf stat' as well: 'perf stat record' and 'perf stat report' would be useful - but I suspect that's outside the scope of your patches. ] Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 0/3] KVM: perf: kvm events analysis tool
* Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote: Thanks for your review, Ingo! On 03/06/2012 05:07 PM, Ingo Molnar wrote: So, your new tool has a similar workflow to: perf kvm record perf kvm report but differs from it in terms of events used and in terms of reported output. To me it appears that your tool is basically pretty similar to 'perf stat', adapted to KVM, right? So, could your new tool's workflow be simplified like this: perf kvm stat .. ? To automatically stat all vcpus in the system, the well-known -a/--all-cpus system-wide method could be used: perf kvm stat -a ... with stat output following immediately after it has finished. Actually, the stat information has already been included in the report. It should also be possible to use those new events in a recording fashion - a new, rather logical command sub-space could be used for that: perf kvm stat record ... perf kvm stat report ... [ This could be expanded to regular 'perf stat' as well: 'perf stat record' and 'perf stat report' would be useful - but I suspect that's outside the scope of your patches. ] I totally agree with you except i prefer 'perf kvm events' to 'perf kvm stat' :) : it records some specified kvm events and smartly analyze it. I think it matches its doing better. Well, my problem is that *all* of the 'perf kvm' functionality deals with events :-) So the 'events' does not say much - you just add a variation to the theme without really distinguishing it. Since 'perf stat' is not yet used and the output of your tool looks quite perf stat alike, why not use that name and standardize all these workflows? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[VT-d reboot problems] Re: [PATCH] x86 / reboot: Blacklist Dell OptiPlex 990 known to require PCI reboot
(added KVM folks to the Cc:) * Bastien ROUCARIES roucaries.bast...@gmail.com wrote: Ping^2 Bastien On Mon, Jan 23, 2012 at 11:28 AM, Bastien ROUCARIES roucaries.bast...@gmail.com wrote: On Mon, Jan 16, 2012 at 8:21 PM, H. Peter Anvin h...@zytor.com wrote: On 01/16/2012 03:27 AM, Bastien ROUCARIES wrote: Does it work if you disable VT-d in the firmware? If so, then adding it to the reboot method blacklist is the wrong fix - we need to figure out why VT-d interferes with Dell's reboot code. Yes it work This is particularly so since we are very close to having a full Dell model catalogue in the kernel... Ping ? Do you need some dump ? testing ? So disabling VT-d in the BIOS fixes the reboot problem and Matthew Garrett suggests we should figure out why and how VT-d on this Dell box interferes with the reboot method. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: Use assert() helper to check a variable value
* Cyrill Gorcunov gorcu...@gmail.com wrote: On Mon, Dec 19, 2011 at 09:13:28AM +0200, Pekka Enberg wrote: - BUILD_BUG_ON(i E820_X_MAX); + assert(i = E820_X_MAX); We should use BUG_ON() like tools/perf does. We dont have it yet. So I'll introduce this helper later, but note that we will have to cover _all_ assert() calls then, so it's better to make in a separate patch. [...] There's only about a dozen of them, should be easy. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: Use assert() helper to check a variable value
* Cyrill Gorcunov gorcu...@gmail.com wrote: +# +#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)])) +#define BUG_ON(condition)assert(!(condition)) Just a sidenote, the patch is fine but the above will result in weird double negated assertion messages. it's better to just do our own __BUG() function and use it. That way it can also use the kernel standard 'BUG: ...' message format. To make such BUG()s easier to debug via GDB, do this from the __BUG() handler: raise(SIGABRT); GDB will catch that signal. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: Use assert() helper to check a variable value
* Cyrill Gorcunov gorcu...@gmail.com wrote: On Mon, Dec 19, 2011 at 11:40:09AM +0100, Ingo Molnar wrote: GDB will catch that signal. Yeah, good point! Pekka, drop this patch please, I'll make new one at evening. Patch is good as-is IMO - assert() was used before so it's not a regression per se. Acked-by: Ingo Molnar mi...@elte.hu The tool-specific BUG() implementation can be added as a delta on top of that. It's in fact better to keep those two steps separate. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: Fix 'make kvmconfig'
* Sasha Levin levinsasha...@gmail.com wrote: Set features which depend on the config we select to make 'make kvmconfig' non-interactive. Signed-off-by: Sasha Levin levinsasha...@gmail.com Acked-by: Ingo Molnar mi...@elte.hu Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: mmap guest kernel instead of reading it into memory
* Sasha Levin levinsasha...@gmail.com wrote: + kvm-bz_start = mmap(NULL, kvm-bz_len, PROT_READ | PROT_WRITE, + MAP_PRIVATE, kvm-bz_fd, setup_end); + /* NOP everything before the kernel start */ + memset(kvm-bz_start, 0x90, setup_size - setup_end); You should really, really think about the case where mmap() fails. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: Fix serial port probing
* Pekka Enberg penb...@kernel.org wrote: dev-ier = ioport__read8(data) 0x3f; + kvm__irq_line(kvm, dev-irq, dev-ier?1:0); break; case UART_LCR: dev-lcr = ioport__read8(data); Applied, thanks! Ingo, does this fix the occasional slowdowns you are seeing? Indeed it does! I still see things like 'top' refreshes progressing - i.e. taking something like 100-200 msecs to finish, but it's a *lot* faster now. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/28] kvm tools: 64-bit tidy; use PRIx64 when printf'ing u64s and link appropriately
* Matt Evans m...@ozlabs.org wrote: [...] I haven't looked closely at Matt's patches, but it should be possible to use [un]signed long long for the u64/s64 types, I would think. In tools/kvm/ we are using our own u64/s64 definitions, not glibc's, so i think it should be fine - as long as we don't pick up int-l64.h accidentally via the arch/powerpc/include/asm/types.h exception for user-space. That's what's happening here; we're __powerpc64__ and !__KERNEL__, tools/kvm/include/linux/types.h includes asm/types.h so gets the int-l64.h definition of __u64, as above. :/ builtin-run.c:389: error: format `%llx' expects type `long long unsigned int', but argument 2 has type `u64' So either define __KERNEL__ or patch a __NEW_USERSPACE__ define into power/asm/types.h and use it - if the PowerPC folks agree with that approach. Sane userspace should not be prevented from using the same sane types the kernel is already using :-) Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: Add 'kvm nmi' command
* Sasha Levin levinsasha...@gmail.com wrote: This patch adds a 'kvm nmi' command which can be used to trigger NMIs in guests. Mostly useful for debugging purposes. Could we make this something like 'kvm debug nmi'? We already have some debug functionality, right? Those should be consolidated under 'kvm debug'. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/28] kvm tools: 64-bit tidy; use PRIx64 when printf'ing u64s and link appropriately
* Matt Evans m...@ozlabs.org wrote: On 08/12/11 04:14, Pekka Enberg wrote: On Wed, 7 Dec 2011, Ingo Molnar wrote: * Matt Evans m...@ozlabs.org wrote: [...] I haven't looked closely at Matt's patches, but it should be possible to use [un]signed long long for the u64/s64 types, I would think. In tools/kvm/ we are using our own u64/s64 definitions, not glibc's, so i think it should be fine - as long as we don't pick up int-l64.h accidentally via the arch/powerpc/include/asm/types.h exception for user-space. That's what's happening here; we're __powerpc64__ and !__KERNEL__, tools/kvm/include/linux/types.h includes asm/types.h so gets the int-l64.h definition of __u64, as above. :/ builtin-run.c:389: error: format `%llx' expects type `long long unsigned int', but argument 2 has type `u64' So either define __KERNEL__ or patch a __NEW_USERSPACE__ define into power/asm/types.h and use it - if the PowerPC folks agree with that approach. Sane userspace should not be prevented from using the same sane types the kernel is already using :-) How does perf handle this? I'm sure it has the exact same issue, doesn't it? It does; ironically it uses PRIblah, so I had followed its example. Sadly it regressed lately in that area - it was certainly PRIblah-less in the early stages :-) Pekka, how do the headers react if we define __KERNEL__? Alternatively, allowing an override in powerpc/types.h beyond __KERNEL__ looks sensible as well. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/28] kvm tools: 64-bit tidy; use PRIx64 when printf'ing u64s and link appropriately
* Matt Evans m...@ozlabs.org wrote: Since tools/kvm/include/linux/types.h only requires __u32, __u64 et al from asm/types.h, wouldn't it be most straightforward to just #include asm-generic/int-ll64.h? This avoids #define __KERNEL__ breaking other includes brought into userland, avoids changing system headers/distros, and includes the file we're really interested in on both x86 PPC. No system headers need to be changed: you can just patch powerpc/types.h with the extra __SANE_USERSPACE__ define (or any suitable name) and the KVM tool should work as-is. The beauty of tools/kvm/ being integrated into the kernel tree. Really, if that works and if the PowerPC folks agree then we have an immediate, fully adequate, instantly applicable solution. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/28] kvm tools: 64-bit tidy; use PRIx64 when printf'ing u64s and link appropriately
* Matt Evans m...@ozlabs.org wrote: [...] I haven't looked closely at Matt's patches, but it should be possible to use [un]signed long long for the u64/s64 types, I would think. In tools/kvm/ we are using our own u64/s64 definitions, not glibc's, so i think it should be fine - as long as we don't pick up int-l64.h accidentally via the arch/powerpc/include/asm/types.h exception for user-space. That's what's happening here; we're __powerpc64__ and !__KERNEL__, tools/kvm/include/linux/types.h includes asm/types.h so gets the int-l64.h definition of __u64, as above. :/ builtin-run.c:389: error: format `%llx' expects type `long long unsigned int', but argument 2 has type `u64' So either define __KERNEL__ or patch a __NEW_USERSPACE__ define into power/asm/types.h and use it - if the PowerPC folks agree with that approach. Sane userspace should not be prevented from using the same sane types the kernel is already using :-) Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/28] kvm tools: 64-bit tidy; use PRIx64 when printf'ing u64s and link appropriately
* Matt Evans m...@ozlabs.org wrote: On 08/12/11 04:14, Pekka Enberg wrote: On Wed, 7 Dec 2011, Ingo Molnar wrote: * Matt Evans m...@ozlabs.org wrote: [...] I haven't looked closely at Matt's patches, but it should be possible to use [un]signed long long for the u64/s64 types, I would think. In tools/kvm/ we are using our own u64/s64 definitions, not glibc's, so i think it should be fine - as long as we don't pick up int-l64.h accidentally via the arch/powerpc/include/asm/types.h exception for user-space. That's what's happening here; we're __powerpc64__ and !__KERNEL__, tools/kvm/include/linux/types.h includes asm/types.h so gets the int-l64.h definition of __u64, as above. :/ builtin-run.c:389: error: format `%llx' expects type `long long unsigned int', but argument 2 has type `u64' So either define __KERNEL__ or patch a __NEW_USERSPACE__ define into power/asm/types.h and use it - if the PowerPC folks agree with that approach. Sane userspace should not be prevented from using the same sane types the kernel is already using :-) How does perf handle this? I'm sure it has the exact same issue, doesn't it? It does; ironically it uses PRIblah, so I had followed its example. Sadly it regressed lately in that area - it was certainly PRIblah-less in the early stages :-) Pekka, how do the headers react if we define __KERNEL__? Alternatively, allowing an override in powerpc/types.h beyond __KERNEL__ looks sensible as well. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/28] kvm tools: 64-bit tidy; use PRIx64 when printf'ing u64s and link appropriately
* Matt Evans m...@ozlabs.org wrote: Since tools/kvm/include/linux/types.h only requires __u32, __u64 et al from asm/types.h, wouldn't it be most straightforward to just #include asm-generic/int-ll64.h? This avoids #define __KERNEL__ breaking other includes brought into userland, avoids changing system headers/distros, and includes the file we're really interested in on both x86 PPC. No system headers need to be changed: you can just patch powerpc/types.h with the extra __SANE_USERSPACE__ define (or any suitable name) and the KVM tool should work as-is. The beauty of tools/kvm/ being integrated into the kernel tree. Really, if that works and if the PowerPC folks agree then we have an immediate, fully adequate, instantly applicable solution. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/28] kvm tools: 64-bit tidy; use PRIx64 when printf'ing u64s and link appropriately
* Sasha Levin levinsasha...@gmail.com wrote: Ingo actually got us to remove all the PRI* specifiers, but that was back when we only did x86 :) Ingo, does it make sense to use them now when we support different architectures? Not at all - ppc should use a sane u64/s64 definition, i.e. int-ll64.h instead of the int-l64.h crap. The powerpc maintainers indicated that they'd fix that, a couple of years ago, but nothing appears to have come out of that. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/28] kvm tools: 64-bit tidy; use PRIx64 when printf'ing u64s and link appropriately
* Paul Mackerras pau...@samba.org wrote: On Tue, Dec 06, 2011 at 09:28:27AM +0100, Ingo Molnar wrote: * Sasha Levin levinsasha...@gmail.com wrote: Ingo actually got us to remove all the PRI* specifiers, but that was back when we only did x86 :) Ingo, does it make sense to use them now when we support different architectures? Not at all - ppc should use a sane u64/s64 definition, i.e. int-ll64.h instead of the int-l64.h crap. The powerpc maintainers indicated that they'd fix that, a couple of years ago, but nothing appears to have come out of that. We changed it for the kernel, but not for userspace (i.e. glibc) because of concerns about possibly breaking existing userspace programs. [...] Indeed, you do: #if defined(__powerpc64__) !defined(__KERNEL__) # include asm-generic/int-l64.h #else # include asm-generic/int-ll64.h #endif which correctly uses int-ll64.h everywhere except 64-bit userspace inclusions. So i take back my 'nothing appears to have come out of that' comment - it's all nicely fixed! [...] I haven't looked closely at Matt's patches, but it should be possible to use [un]signed long long for the u64/s64 types, I would think. In tools/kvm/ we are using our own u64/s64 definitions, not glibc's, so i think it should be fine - as long as we don't pick up int-l64.h accidentally via the arch/powerpc/include/asm/types.h exception for user-space. Stray uses of int64 should be converted to u64 (i see some in tools/kvm/virtio/9p-pdu.c) but otherwise we should be ok - unless i'm missing some complication. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] kvm tools: Zero out event before calling epoll_ctl()
* Sasha Levin levinsasha...@gmail.com wrote: Zero out struct epoll_event before calling epoll_ctl(). This isn't really required, but is done to avoid warnings. Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/kvm-ipc.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/tools/kvm/kvm-ipc.c b/tools/kvm/kvm-ipc.c index 40ab457..68c2565 100644 --- a/tools/kvm/kvm-ipc.c +++ b/tools/kvm/kvm-ipc.c @@ -132,7 +132,7 @@ static void *kvm_ipc__thread(void *param) int kvm_ipc__start(int sock) { - struct epoll_event ev; + struct epoll_event ev = {0}; server_fd = sock; While it is true that epoll_ctl() currently ignores certain fields so we could leave them uninitialized but generally it's very good practice to only pass in well-defined values to kernel syscalls, so the zeroing is a quality of implementation issue as well, not just a Valgrind warning fix. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/28] kvm tools: 64-bit tidy; use PRIx64 when printf'ing u64s and link appropriately
* Sasha Levin levinsasha...@gmail.com wrote: Ingo actually got us to remove all the PRI* specifiers, but that was back when we only did x86 :) Ingo, does it make sense to use them now when we support different architectures? Not at all - ppc should use a sane u64/s64 definition, i.e. int-ll64.h instead of the int-l64.h crap. The powerpc maintainers indicated that they'd fix that, a couple of years ago, but nothing appears to have come out of that. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/28] kvm tools: 64-bit tidy; use PRIx64 when printf'ing u64s and link appropriately
* Paul Mackerras pau...@samba.org wrote: On Tue, Dec 06, 2011 at 09:28:27AM +0100, Ingo Molnar wrote: * Sasha Levin levinsasha...@gmail.com wrote: Ingo actually got us to remove all the PRI* specifiers, but that was back when we only did x86 :) Ingo, does it make sense to use them now when we support different architectures? Not at all - ppc should use a sane u64/s64 definition, i.e. int-ll64.h instead of the int-l64.h crap. The powerpc maintainers indicated that they'd fix that, a couple of years ago, but nothing appears to have come out of that. We changed it for the kernel, but not for userspace (i.e. glibc) because of concerns about possibly breaking existing userspace programs. [...] Indeed, you do: #if defined(__powerpc64__) !defined(__KERNEL__) # include asm-generic/int-l64.h #else # include asm-generic/int-ll64.h #endif which correctly uses int-ll64.h everywhere except 64-bit userspace inclusions. So i take back my 'nothing appears to have come out of that' comment - it's all nicely fixed! [...] I haven't looked closely at Matt's patches, but it should be possible to use [un]signed long long for the u64/s64 types, I would think. In tools/kvm/ we are using our own u64/s64 definitions, not glibc's, so i think it should be fine - as long as we don't pick up int-l64.h accidentally via the arch/powerpc/include/asm/types.h exception for user-space. Stray uses of int64 should be converted to u64 (i see some in tools/kvm/virtio/9p-pdu.c) but otherwise we should be ok - unless i'm missing some complication. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [F.A.Q.] the advantages of a shared tool/kernel Git repository, tools/perf/ and tools/kvm/
* Anca Emanuel anca.eman...@gmail.com wrote: I'd even argue that that C library is obviously something the kernelshould offer as well - so klibc is the way to go and would help usfurther streamline this and keep Linux quality high. I think there is code to share. Why not ? The biggest downside of libc integration into the kernel would be that the libc ABI is *vastly* larger than the kernel ABI, and i'm not sure the kernel community is good enough to handle that. It's roughly 3000 ABI components compared to the 300 ABI functions the kernel has today - so at least an order of magnitude larger... The biggest upside of libc integration into the kernel would be that we could push Linux kernel improvements into the C library - and thus to apps - immediately, along a much larger ABI surface. The 'specialization' resolution of the libc ABI is an order of magnitude larger than that of the kernel's, giving many more opportunities for good, workload specific optimizations and unique solutions. Today the latency of getting a kernel improvement to applications via a change in the C library is above a year, so most kernel people don't actually try to improve the C library but try to find improvements on the kernel level which gets to a distro within a couple of months. If the kernel offers a /proc/libc.so.6 library then the kernel will always be 'in sync' with the library (there's no library to install on-disk - it would be offered by the kernel) and we could use integration techniques like the vDSO uses today. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [F.A.Q.] the advantages of a shared tool/kernel Git repository, tools/perf/ and tools/kvm/
* Alexander Graf ag...@suse.de wrote: [...] Outside of the kernel tree, you can do your own decisions. If someone thinks it's a great idea to write device emulation in python (I would love that!), he could go in and implement it without having to worry about Linus possibly rejecting it because it's out of scope for a Linux kernel testing tool. If you want to create the greatest GUI for virtualization the world has ever seen, you can just do it! Nothing holds you back. We actually recently added Python bindings to event tracing in perf: earth5:~/tip find tools/perf/ -name '*.py' tools/perf/python/twatch.py tools/perf/util/setup.py tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace/Util.py tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace/Core.py tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace/SchedGui.py tools/perf/scripts/python/syscall-counts.py tools/perf/scripts/python/sctop.py tools/perf/scripts/python/sched-migration.py tools/perf/scripts/python/check-perf-trace.py tools/perf/scripts/python/futex-contention.py tools/perf/scripts/python/failed-syscalls-by-pid.py tools/perf/scripts/python/net_dropmonitor.py tools/perf/scripts/python/syscall-counts-by-pid.py tools/perf/scripts/python/netdev-times.py ... and Linus did not object (so far ;-) - nor does he IMHO have many reasons to object as long as the code is sane and useful. Nor did Linus object when perf extended its scope from profiling to tracing, system monitoring, etc. While i don't talk for Linus, the only 'hard boundary' that Linus enforces and expects all maintainers to enforce that i'm aware of is don't do crazy crap. Everything else is possible as long as it's high quality and reasonable, with a good upside story that is relevant to the kernel - you can let your imagination run wild, there's no artificial barriers that i'm aware of. Anyway, i have outlined the rough consequences of a user-space project being inside the kernel repo in this post: http://lkml.org/lkml/2011/11/10/86 ... and they are definitely not trivial and easy to meet. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [F.A.Q.] the advantages of a shared tool/kernel Git repository, tools/perf/ and tools/kvm/
* Ted Ts'o ty...@mit.edu wrote: On Tue, Nov 08, 2011 at 01:55:09PM +0100, Ingo Molnar wrote: I guess you can do well with a split project as well - my main claim is that good compatibility comes *naturally* with integration. Here I have to disagree; my main worry is that integration makes it *naturally* easy for people to skip the hard work needed to keep a stable kernel/userspace interface. There's two observations i have: Firstly, how come that this has not actually happened in practice in the case of perf? Looks like the (random) version compatibility experiment i conducted yesterday should have failed spectacularly. Secondly, within the kernel we don't have a stable ABI - we don't even have stable APIs, and still it's a 15 MLOC project that is thriving. I argue that it is thriving in large part *BECAUSE* we don't have a stable API of any sort: if stuff is broken and the whole world needs to be fixed then we fix the whole world. One could even make the argument that in the special case of deeply kernel integrated tools a stable kernel/userspace interface for those special, Linux-specific ABIs is *too expensive* and results in an inferior end result. I'd really love it if people started thinking outside the box a bit. Why do people assume that *all* of the kernel project's code *has* to run in kernel mode? It's not a valid technical restriction *at all*. It has been done like this for 30 years is not a valid technical restriction. Splitting deeply kernel related tools away from the kernel was a valid decision 15 years ago due to kernel image size and similar resource considerations. Today it's less and less true and we are *actively hurting* from tools being split away from the kernel proper. Graphics, storage and user-space suspend are good examples i think of separation gone bad: and the resulting mess has cost Linux distros *the desktop market*. Think about it, the price we pay for this inferior end result is huge. ext4tools is an example of separation gone good. I think it's the exception that strengthens the rule. Why was the 2.4 to 2.6 migration so difficult? I can tell you the distro side story: mainly because the release took too long and tools broke left and right which created stop-ship situations. We had a much larger ABI cross section than we could sanely handle with the testing power we had. So we got into a negative feedback loop: the reduction in 2.3 testers further delayed the release, which moved the (independently evolving ...) tools further away from the to-be-2.6 kernel, which further reduced the effective testing. It was not a sustainable. We addressed many of the problems by shortening the release cycle to 3 months, but IMHO we have not addressed the underlying problem of lack of integration. Responsible release engineering is actually *easier* if you don't have a moving target and if you have the ability to fix stuff that breaks without being bound to an external project. Deeply kernel integrated tools could come in the initrd and could be offered by the kernel, statically linked images made available via /proc/sbin or such. We could even swap them out on demand so there's no RAM overhead. There's no technical barrier. I'd even argue that that C library is obviously something the kernel should offer as well - so klibc is the way to go and would help us further streamline this and keep Linux quality high. We could actually keep the kernel and such tools tightly integrated, reducing the compatibility matrix. The kernel would upgrade with these tools but it *already* upgrades with some user-space components like the vdso so it's not a true technical barrier. The other worry which I've mentioned, but which I haven't seen addressed, is that the even if you can use a perf from a newer kernel with an older kernel, this causes distributions a huge amount of pain, since they have to package two different kernel source packages, and only compile perf from the newer kernel source package. This leads to all sorts of confusion from a distribution packaging point of view. For example, assume that RHEL 5, which is using 2.6.32 or something like that, wants to use a newer e2fsck that does a better job fixing file system corruptions. [...] Firstly, it's not a big issue: if a tool comes with the kernel package then it's part of the regular backporting flow: if you backport a new tool to an old kernel then you do the same as if you backported a new kernel feature to an older enterprise kernel. Happens all the time, it's a technological problem with technological solutions. Enterprise distros explicitly do not support cross-distro-version package installs, so backporting will be done anyway. Secondly, i actually think that the obsession with using obsolete kernel versions is silly technologically - and it has evolved that way partly *BECAUSE* we are not integrated enough and distros fear kernel
Re: [F.A.Q.] the advantages of a shared tool/kernel Git repository, tools/perf/ and tools/kvm/
* Ted Ts'o ty...@mit.edu wrote: On Tue, Nov 08, 2011 at 07:14:57PM +0200, Anca Emanuel wrote: @Ten Ts'o: you are sponsored by something like microsoft (joking) ? Stop trolling. If you are not familiar with perf, or other tools, save your time and do some useful things. I am quite familiar with perf. A disagreement with how things are done is not trolling. Anca, Ted is not trolling me in any fashion. He is a (very successful) tool space and kernel developer and his opinion and experience about how tools should interact with the kernel project is of utmost importance. Clearly Ted thinks that filesystem tools should stay separate from the kernel repo. I agree with him that the case for filesystem tool integration is weaker than for deeply kernel integrated tools such as perf or kvm and calling him a troll is not a way to settle that honest disagreement in any case. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [F.A.Q.] the advantages of a shared tool/kernel Git repository, tools/perf/ and tools/kvm/
* John Kacur jka...@redhat.com wrote: On Tue, 8 Nov 2011, Ted Ts'o wrote: On Tue, Nov 08, 2011 at 01:55:09PM +0100, Ingo Molnar wrote: I guess you can do well with a split project as well - my main claim is that good compatibility comes *naturally* with integration. Here I have to disagree; my main worry is that integration makes it *naturally* easy for people to skip the hard work needed to keep a stable kernel/userspace interface. The other worry which I've mentioned, but which I haven't seen addressed, is that the even if you can use a perf from a newer kernel with an older kernel, this causes distributions a huge amount of pain, since they have to package two different kernel source packages, and only compile perf from the newer kernel source package. This leads to all sorts of confusion from a distribution packaging point of view. For example, assume that RHEL 5, which is using 2.6.32 or something like that, wants to use a newer e2fsck that does a better job fixing file system corruptions. If it were bundled with the kernel, then they would have to package up the v3.1 kernel sources, and have a source RPM that isn't used for building kernel sources, but just to build a newer version of e2fsck. Fortunately, they don't have to do that. They just pull down a newer version of e2fsprogs, and package, build, test, and ship that. In addition, suppose Red Hat ships a security bug fix which means a new kernel-image RPM has to be shipped. Does that mean that Red Hat has to ship new binary RPM's for any and all tools/* programs that they have packaged as separate RPM's? Or should installing a new kernel RPM also imply dropping new binaries in /usr/bin/perf, et. al? There are all sorts of packaging questions that are raised integration, and from where I sit I don't think they've been adequately solved yet. This in practice is not a big deal. There are many approaches for how the RPM can be built, but basically getting the perf source is just a matter of make perf-tar-src-pkg or friends such as make perf-tarbz2-src-pkg which will create perf-3.2.0-rc1.tar, and perf-3.2.0-rc1.tar.bz2 respectively which can be used for the src rpms. This tar ball can be used as a separate package or subpackage. Great - the 'perf is impossible for distros' was a common counter argument early in the perf project's lifetime - i'm glad it turned out to be bogus in practice. Would it further simplify distro side life if all utilities deeply related to the kernel got built together and came in a single well working package? kutils-3.2.0-rc1.rpm or such. They would always upgrade together with the kernel so there would never be any forced backporting or separate errata pressure, beyond the existing flow of -stable fixes. We do -stable fixes for tools/perf/ as well, for stability/security fixes, naturally - other tools would have to follow the regular kernel maintenance process to manage high priority fixes. Basically distros could rely on the kernel and its utilities being a coherent whole, which is expected to work together, which is maintained and built together and which, if it regresses, is handled by the regular -stable kernel regressions process with high priority. I expect it would grow one by one - it's not like we can or want to force utilities to go into the kernel proper. I'd also expect that new tools would be added initially - not existing ones moved. My question to you would rather be, would it make the life of distro release engineers gradually easier if this space grew gradually over the years, adding more and more critical tool functionality? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [F.A.Q.] the advantages of a shared tool/kernel Git repository, tools/perf/ and tools/kvm/
* Gerd Hoffmann kra...@redhat.com wrote: For reference, the default set of colors now is (from tools/perf/util/ui/browser.c): static struct ui_browser__colorset { const char *name, *fg, *bg; int colorset; } ui_browser__colorsets[] = { { .colorset = HE_COLORSET_TOP, .name = top, .fg = red, .bg = default, Bad idea IMO. Setting only one of foreground+background gives pretty much unpredictable results. My xterms have different background colors, the ones with a root shell happen to have a (dark) red background. Which results in red-on-dark-red text. Not good. I'd strongly suggest to either set both background and foreground to default or to set both to a specific color. When doing the latter make sure the colors have enougth contrast so they are readable. Indeed. What we want to have is to have a set of distinctive colors - just two (background, foreground) colors are not enough - we also need colors to highlight certain information - we need 5-6 colors for the output to be maximally expressive. Is there a canonical way to handle that while still adapting to user preferences automatically by taking background/foreground color scheme of the xterm into account? I suspect to fix the worst of the fallout we could add some logic to detect low contrast combinations (too low color distance) and fall back to the foreground/background colors in that case. Plus allowing full .perfconfig configurability of all the relevant colors, for those with special taste. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [F.A.Q.] the advantages of a shared tool/kernel Git repository, tools/perf/ and tools/kvm/
* Arnaldo Carvalho de Melo a...@redhat.com wrote: sure the colors have enougth contrast so they are readable. Problem is figuring out something that is considered a good default :-\ There will always be somebody that will complain. When doing the coding to allow using the default xterm colors I tried several of the gnome-terminal xterm profiles and all looked kinda sane for the top (hottest functions, with most hits) and medium lines, where we combine some chosen foreground color (red and green). Laziest solution would be: If the user customizes that much, could the user please customize this as well? :-) I don't think it's acceptable to output unreadable color combinations (red on dark red, etc.) in any case, so we should add some safety mechanism that detects bad color combinations and a fallback, static color scheme. I like the current way how perf top/report adapts to the xterm color scheme. I use it both on dark and white backgrounds and it's easy to mistake it for --stdio output - which is good, a good TUI should blend into the console's color scheme. So i think we should keep that and just detect the few cases where it results in something unreadable. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [F.A.Q.] the advantages of a shared tool/kernel Git repository, tools/perf/ and tools/kvm/
* Steven Rostedt rost...@goodmis.org wrote: On Tue, Nov 08, 2011 at 10:32:25AM +0100, Ingo Molnar wrote: None of the perf developers with whom i'm working complained about the shared repo so far - publicly or privately. By all means they are enjoying it and if you look at the stats and results you'll agree that they are highly productive working in that environment. Just because you brought it up. I personally find it awkward to work in the linux tools directory. Maybe this is the reason that I haven't been such a big contributor of perf. [...] Well, this is an argument with a long history we've had from the moment we started perf - i think the main underlying reason for that is that you still see perf as competition to ftrace instead of seeing perf the child of ftrace, the next version of ftrace, the next iterative step of evolution :-/ Unfortunately there's not much that i can do about that beyond telling you that you are IMHO wrong - you as the main ftrace developer thinking that it's competition is a self-fulfilling expectation. Eventually someone will do the right thing and implement 'perf trace' (there's still the tip:tmp.perf/trace2 prototype branch) and users will flock to that workflow because it's so much more intuitive in practice. From what i've seen from the short prototype experiments i've conducted it's a no-brainer superior workflow and design. [...] I only pushed ktest into the kernel tools directory because people convinced me to do so. Having it there didn't seem to bring in many other developers. [...] It was somewhat similar with perf - contributors only arrived after it went upstream, and even then with a delay of a few releases. Also, and it pains me to have to mention it, but putting a .pl script into the kernel repo is not necessarily a reciepe for attracting a lot of developers. We went to great lengths to kill the .cc perf report file in perf, to keep the programming environment familiar to kernel developers and other low level utility folks. Also, obviously a tool has to be important, interesting and has to offer a distinct edge over other tools to attract contributors. Maybe tools/testing/ktest/ does not sound that interesting? Naming also matters: i sure would have moved it to tools/ktest/, its name already suggests that it's about testing, why repeat that twice? Sounds weird. In that sense tools/kvm/ is better than perf: it has already attracted a core group of good, productive contributors despite still being an out of tree fork. The point here was that Pekka co not just clearly enjoys working on tools/kvm/ and has no trouble attracting contributors, but also *relies* on it being in the kernel tree. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [F.A.Q.] the advantages of a shared tool/kernel Git repository, tools/perf/ and tools/kvm/
* Arnaldo Carvalho de Melo a...@redhat.com wrote: Em Wed, Nov 09, 2011 at 10:30:50AM -0200, Arnaldo Carvalho de Melo escreveu: Em Wed, Nov 09, 2011 at 01:26:34PM +0100, Gerd Hoffmann escreveu: Its fully configurable as of now, what we need is a set of .perfconfigs that show how people think its better, we try it, set it as the default, leave the others in tools/perf/Documentation/perfconfig/color.examples. Yep, a set of examples works too. The colors are not fully configurable yet though. First, when switching all five colorsets to default, default there are still things which are colored (top bar, bottom bar, keys help display). Second there is no way to set terminal attributes (i.e. top = bold or selected = reverse). Ok, adding those to my TODO list. /me goes to check if http://perf.wiki.kernel.org is back working so that we can have a _public_ TODO list, perhaps it may attract more contributors :) Oops, there is one, utterly old tho ;-\ I tried changing that and adding this entry but: https://perf.wiki.kernel.org/articles/u/s/e/Special~UserLogin_94cd.html Returns: The requested URL /articles/u/s/e/Special~UserLogin_94cd.html was not found on this server. Ingo, would that G+ page be useful for that? Not sure - i think perf.wiki.kernel.org is a good place for documentation kind of information. The G+ page is more like for news items. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html