What's kvmclock's custom sched_clock for?

2016-01-06 Thread Andy Lutomirski
AFAICT KVM reliably passes a monotonic TSC through to guests, even if
the host suspends.  That's all that sched_clock needs, I think.

So why does kvmclock have a custom sched_clock?

On a related note, KVM doesn't pass the "invariant TSC" feature
through to guests on my machine even though "invtsc" is set in QEMU
and the kernel host code appears to support it.  What gives?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] x86/vdso/pvclock: Protect STABLE check with the seqcount

2016-01-04 Thread Andy Lutomirski
If the clock becomes unstable while we're reading it, we need to
bail.  We can do this by simply moving the check into the seqcount
loop.

Reported-by: Marcelo Tosatti <mtosa...@redhat.com>
Signed-off-by: Andy Lutomirski <l...@kernel.org>
---

Marcelo, how's this?

arch/x86/entry/vdso/vclock_gettime.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 8602f06c759f..1a50e09c945b 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -126,23 +126,23 @@ static notrace cycle_t vread_pvclock(int *mode)
 *
 * On Xen, we don't appear to have that guarantee, but Xen still
 * supplies a valid seqlock using the version field.
-
+*
 * We only do pvclock vdso timing at all if
 * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
 * mean that all vCPUs have matching pvti and that the TSC is
 * synced, so we can just look at vCPU 0's pvti.
 */
 
-   if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
-   *mode = VCLOCK_NONE;
-   return 0;
-   }
-
do {
version = pvti->version;
 
smp_rmb();
 
+   if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
+   *mode = VCLOCK_NONE;
+   return 0;
+   }
+
tsc = rdtsc_ordered();
pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
pvti_tsc_shift = pvti->tsc_shift;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/4] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2016-01-04 Thread Andy Lutomirski
On Mon, Jan 4, 2016 at 12:26 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
> On Sun, Dec 20, 2015 at 03:05:41AM -0800, Andy Lutomirski wrote:
>> From: Andy Lutomirski <l...@amacapital.net>
>>
>> The pvclock vdso code was too abstracted to understand easily and
>> excessively paranoid.  Simplify it for a huge speedup.
>>
>> This opens the door for additional simplifications, as the vdso no
>> longer accesses the pvti for any vcpu other than vcpu 0.
>>
>> Before, vclock_gettime using kvm-clock took about 45ns on my machine.
>> With this change, it takes 29ns, which is almost as fast as the pure TSC
>> implementation.
>>
>> Reviewed-by: Paolo Bonzini <pbonz...@redhat.com>
>> Signed-off-by: Andy Lutomirski <l...@amacapital.net>
>> ---
>>  arch/x86/entry/vdso/vclock_gettime.c | 81 
>> 
>>  1 file changed, 46 insertions(+), 35 deletions(-)
>>
>> diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
>> b/arch/x86/entry/vdso/vclock_gettime.c
>> index ca94fa649251..c325ba1bdddf 100644
>> --- a/arch/x86/entry/vdso/vclock_gettime.c
>> +++ b/arch/x86/entry/vdso/vclock_gettime.c
>> @@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info 
>> *get_pvti(int cpu)
>>
>>  static notrace cycle_t vread_pvclock(int *mode)
>>  {
>> - const struct pvclock_vsyscall_time_info *pvti;
>> + const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti;
>>   cycle_t ret;
>> - u64 last;
>> - u32 version;
>> - u8 flags;
>> - unsigned cpu, cpu1;
>> -
>> + u64 tsc, pvti_tsc;
>> + u64 last, delta, pvti_system_time;
>> + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>>
>>   /*
>> -  * Note: hypervisor must guarantee that:
>> -  * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
>> -  * 2. that per-CPU pvclock time info is updated if the
>> -  *underlying CPU changes.
>> -  * 3. that version is increased whenever underlying CPU
>> -  *changes.
>> +  * Note: The kernel and hypervisor must guarantee that cpu ID
>> +  * number maps 1:1 to per-CPU pvclock time info.
>> +  *
>> +  * Because the hypervisor is entirely unaware of guest userspace
>> +  * preemption, it cannot guarantee that per-CPU pvclock time
>> +  * info is updated if the underlying CPU changes or that that
>> +  * version is increased whenever underlying CPU changes.
>>*
>> +  * On KVM, we are guaranteed that pvti updates for any vCPU are
>> +  * atomic as seen by *all* vCPUs.  This is an even stronger
>> +  * guarantee than we get with a normal seqlock.
>> +  *
>> +  * On Xen, we don't appear to have that guarantee, but Xen still
>> +  * supplies a valid seqlock using the version field.
>> +
>> +  * We only do pvclock vdso timing at all if
>> +  * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
>> +  * mean that all vCPUs have matching pvti and that the TSC is
>> +  * synced, so we can just look at vCPU 0's pvti.
>>*/
>> - do {
>> - cpu = __getcpu() & VGETCPU_CPU_MASK;
>> - /* TODO: We can put vcpu id into higher bits of pvti.version.
>> -  * This will save a couple of cycles by getting rid of
>> -  * __getcpu() calls (Gleb).
>> -  */
>> -
>> - pvti = get_pvti(cpu);
>> -
>> - version = __pvclock_read_cycles(>pvti, , );
>> -
>> - /*
>> -  * Test we're still on the cpu as well as the version.
>> -  * We could have been migrated just after the first
>> -  * vgetcpu but before fetching the version, so we
>> -  * wouldn't notice a version change.
>> -  */
>> - cpu1 = __getcpu() & VGETCPU_CPU_MASK;
>> - } while (unlikely(cpu != cpu1 ||
>> -   (pvti->pvti.version & 1) ||
>> -   pvti->pvti.version != version));
>> -
>> - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
>> +
>> + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>>   *mode = VCLOCK_NONE;
>> + return 0;
>> + }
>> +
>> + do {
>> + version = pvti->version;
>> +
>> + /* This is also a read barrier, so we'll read version first. */
>> +  

Re: [PATCH RFC 1/3] x86/pvclock: add setter for pvclock_pvti_cpu0_va

2015-12-29 Thread Andy Lutomirski
On Tue, Dec 29, 2015 at 4:50 AM, Joao Martins <joao.m.mart...@oracle.com> wrote:
> On 12/28/2015 11:45 PM, Andy Lutomirski wrote:
>> On Mon, Dec 28, 2015 at 1:52 PM, Joao Martins <joao.m.mart...@oracle.com> 
>> wrote:
>>> Right now there is only a pvclock_pvti_cpu0_va() which is defined on
>>> kvmclock since:
>>>
>>> commit dac16fba6fc5
>>> ("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap")
>>>
>>> The only user of this interface so far is kvm. This commit adds a setter
>>> function for the pvti page and moves pvclock_pvti_cpu0_va to pvclock, which
>>> is a more generic place to have it; and would allow other PV clocksources
>>> to use it, such as Xen.
>>>
>>
>>> +
>>> +void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti)
>>> +{
>>> +   pvti_cpu0_va = pvti;
>>> +}
>>
>> IMO this either wants to be __init or wants a
>> WARN_ON(vclock_was_used(VCLOCK_PVCLOCK)).  The latter hasn't landed in
>> -tip yet, but I think it'll land next week unless the merge window
>> opens early.
> OK, I will add those two once it lands in -tip.
>
> I had a silly mistake in this patch as I bindly ommited the parameter name to
> keep checkpatch happy, but didn't compile check when built without PARAVIRT.
> Apologies for that and will fix that also on the next version.
>
>>
>> It may pay to actually separate out the kvm-clock clocksource and
>> rename it rather than partially duplicating it, assuming the result
>> wouldn't be messy.
>>
> Not sure if I follow but I moved out pvclock_pvti_cpu0_va from kvm-clock or do
> you mean to separate out kvm-clock in it's enterity, or something else within
> kvm-clock is that is common to both (such as kvm_setup_vsyscall_timeinfo) ?

I meant literally using the same clocksource.  I don't know whether
the Xen and KVM variants are similar enough for that to make sense.

>
>> Can you CC me on the rest of the series for new versions?
>>
> Sure! Thanks for the prompt reply.
>
>> BTW, since this seems to require hypervisor changes to be useful, it
>> might make sense to rethink the interface a bit.  Are you actually
>> planning to support per-cpu pvti for this in any useful way?  If not,
>> I think that this would work a whole lot better and be considerably
>> less code if you had a single global pvti that lived in
>> hypervisor-allocated memory instead of an array that lives in guest
>> memory.  I'd be happy to discuss next week in more detail (currently
>> on vacation).
> Initially I had this series using per-cpu pvti's based on Linux 4.4 but since
> that was removed in favor of vdso using solely cpu0 pvti, then I ended up just
> registering the cpu 0 page. I don't intend to add per-cpu pvti's since it 
> would
> only be used for this case: (unless the reviewers think it should be done)
> meaning I would register pvti's for the other CPUs without having them used.
> Having a global pvti as you suggest it would get a lot simpler for the guest,
> but I guess this would only work assuming PVCLOCK_TSC_STABLE_BIT is there?
> Looking forward to discuss it next week.

Sounds good.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/3] x86/pvclock: add setter for pvclock_pvti_cpu0_va

2015-12-28 Thread Andy Lutomirski
On Mon, Dec 28, 2015 at 1:52 PM, Joao Martins  wrote:
> Right now there is only a pvclock_pvti_cpu0_va() which is defined on
> kvmclock since:
>
> commit dac16fba6fc5
> ("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap")
>
> The only user of this interface so far is kvm. This commit adds a setter
> function for the pvti page and moves pvclock_pvti_cpu0_va to pvclock, which
> is a more generic place to have it; and would allow other PV clocksources
> to use it, such as Xen.
>

> +
> +void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti)
> +{
> +   pvti_cpu0_va = pvti;
> +}

IMO this either wants to be __init or wants a
WARN_ON(vclock_was_used(VCLOCK_PVCLOCK)).  The latter hasn't landed in
-tip yet, but I think it'll land next week unless the merge window
opens early.

It may pay to actually separate out the kvm-clock clocksource and
rename it rather than partially duplicating it, assuming the result
wouldn't be messy.

Can you CC me on the rest of the series for new versions?

BTW, since this seems to require hypervisor changes to be useful, it
might make sense to rethink the interface a bit.  Are you actually
planning to support per-cpu pvti for this in any useful way?  If not,
I think that this would work a whole lot better and be considerably
less code if you had a single global pvti that lived in
hypervisor-allocated memory instead of an array that lives in guest
memory.  I'd be happy to discuss next week in more detail (currently
on vacation).

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-23 Thread Andy Lutomirski
On Wed, Dec 23, 2015 at 11:27 AM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
> On Mon, Dec 21, 2015 at 02:49:25PM -0800, Andy Lutomirski wrote:
>> On Fri, Dec 18, 2015 at 1:49 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
>> > (busy spin is equally problematic as IPI for realtime guests).
>>
>> I disagree.  It's never been safe to call clock_gettime from an RT
>> task and expect a guarantee of real-time performance.  We could fix
>> that, but it's not even safe on non-KVM.
>
> The problem is how long the IPI (or busy spinning in case of version
> above) interrupts the vcpu.

The busy spin should be a few hundred cycles in the very worst case (a
couple of remote cache misses timed such that the guest is spinning
the whole time).  The IPI is always thousands of cycles no matter what
the guest is doing.

>
>> Sending an IPI *always* stalls the task.  Taking a lock (which is
>> effectively what this is doing) only stalls the tasks that contend for
>> the lock, which, most of the time, means that nothing stalls.
>>
>> Also, if the host disables preemption or otherwise boosts its priority
>> while version is odd, then the actual stall will be very short, in
>> contrast to an IPI-induced stall, which will be much, much longer.
>>
>> --Andy
>
> 1) The updates are rare.
> 2) There are no user complaints about the IPI mechanism.

If KVM ever starts directly propagating corrected time
(CLOCK_MONOTONIC, for example), then the updates won't be rare.

Maybe I'll try to instrument this.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-21 Thread Andy Lutomirski
On Fri, Dec 18, 2015 at 1:49 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
> On Fri, Dec 18, 2015 at 12:25:11PM -0800, Andy Lutomirski wrote:
>> [cc: John Stultz -- maybe you have ideas on how this should best
>> integrate with the core code]
>>
>> On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti <mtosa...@redhat.com> 
>> wrote:

>> > Can you write an actual proposal (with details) that accomodates the
>> > issue described at "Assuming a stable TSC across physical CPUS, and a
>> > stable TSC" ?
>> >
>> > Yes it would be nicer, the IPIs (to stop the vcpus) are problematic for
>> > realtime guests.
>>
>> This shouldn't require many details, and I don't think there's an ABI
>> change.  The rules are:
>>
>> When the overall system timebase changes (e.g. when the selected
>> clocksource changes or when update_pvclock_gtod is called), the KVM
>> host would:
>>
>> optionally: preempt_disable();  /* for performance */
>>
>> for all vms {
>>
>>   for all registered pvti structures {
>> pvti->version++;  /* should be odd now */
>>   }
>
> pvti is userspace data, so you have to pin it before?

Yes.

Fortunately, most systems probably only have one page of pvti
structures, I think (unless there are a ton of vcpus), so the
performance impact should be negligible.

>
>>   /* Note: right now, any vcpu that tries to access pvti will start
>> infinite looping.  We should add cpu_relax() to the guests. */
>>
>>   for all registered pvti structures {
>> update everything except pvti->version;
>>   }
>>
>>   for all registered pvti structures {
>> pvti->version++;  /* should be even now */
>>   }
>>
>>   cond_resched();
>> }
>>
>> Is this enough detail?  This should work with all existing guests,
>> too, unless there's a buggy guest out there that actually fails to
>> double-check version.
>
> What is the advantage of this over the brute force method, given
> that guests will busy spin?
>
> (busy spin is equally problematic as IPI for realtime guests).

I disagree.  It's never been safe to call clock_gettime from an RT
task and expect a guarantee of real-time performance.  We could fix
that, but it's not even safe on non-KVM.

Sending an IPI *always* stalls the task.  Taking a lock (which is
effectively what this is doing) only stalls the tasks that contend for
the lock, which, most of the time, means that nothing stalls.

Also, if the host disables preemption or otherwise boosts its priority
while version is odd, then the actual stall will be very short, in
contrast to an IPI-induced stall, which will be much, much longer.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/4] x86/vdso: Remove pvclock fixmap machinery

2015-12-20 Thread Andy Lutomirski
Acked-by: Paolo Bonzini <pbonz...@redhat.com>
Signed-off-by: Andy Lutomirski <l...@kernel.org>
---
 arch/x86/entry/vdso/vclock_gettime.c |  1 -
 arch/x86/entry/vdso/vma.c|  1 +
 arch/x86/include/asm/fixmap.h|  5 -
 arch/x86/include/asm/pvclock.h   |  5 -
 arch/x86/kernel/kvmclock.c   |  6 --
 arch/x86/kernel/pvclock.c| 24 
 6 files changed, 1 insertion(+), 41 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 5dd363d54348..59a98c25bde7 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -45,7 +45,6 @@ extern u8 pvclock_page
 
 #include 
 #include 
-#include 
 #include 
 
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index aa828191c654..b8f69e264ac4 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index f80d70009ff8..6d7d0e52ed5a 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -19,7 +19,6 @@
 #include 
 #include 
 #include 
-#include 
 #ifdef CONFIG_X86_32
 #include 
 #include 
@@ -72,10 +71,6 @@ enum fixed_addresses {
 #ifdef CONFIG_X86_VSYSCALL_EMULATION
VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
 #endif
-#ifdef CONFIG_PARAVIRT_CLOCK
-   PVCLOCK_FIXMAP_BEGIN,
-   PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1,
-#endif
 #endif
FIX_DBGP_BASE,
FIX_EARLYCON_MEM_BASE,
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 571dad355bbc..fdcc04020636 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -100,10 +100,5 @@ struct pvclock_vsyscall_time_info {
 } __attribute__((__aligned__(SMP_CACHE_BYTES)));
 
 #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
-#define PVCLOCK_VSYSCALL_NR_PAGES (((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1)
-
-int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
-int size);
-struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu);
 
 #endif /* _ASM_X86_PVCLOCK_H */
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index ec1b06dc82d2..72cef58693c7 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -310,7 +310,6 @@ int __init kvm_setup_vsyscall_timeinfo(void)
 {
 #ifdef CONFIG_X86_64
int cpu;
-   int ret;
u8 flags;
struct pvclock_vcpu_time_info *vcpu_time;
unsigned int size;
@@ -330,11 +329,6 @@ int __init kvm_setup_vsyscall_timeinfo(void)
return 1;
}
 
-   if ((ret = pvclock_init_vsyscall(hv_clock, size))) {
-   put_cpu();
-   return ret;
-   }
-
put_cpu();
 
kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 2f355d229a58..99bfc025111d 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -140,27 +140,3 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
*wall_clock,
 
set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
 }
-
-#ifdef CONFIG_X86_64
-/*
- * Initialize the generic pvclock vsyscall state.  This will allocate
- * a/some page(s) for the per-vcpu pvclock information, set up a
- * fixmap mapping for the page(s)
- */
-
-int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
-int size)
-{
-   int idx;
-
-   WARN_ON (size != PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE);
-
-   for (idx = 0; idx <= (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
-   __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
-__pa(i) + (idx*PAGE_SIZE),
-PAGE_KERNEL_VVAR);
-   }
-
-   return 0;
-}
-#endif
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/4] x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap

2015-12-20 Thread Andy Lutomirski
Acked-by: Paolo Bonzini <pbonz...@redhat.com>
Signed-off-by: Andy Lutomirski <l...@kernel.org>
---
 arch/x86/entry/vdso/vclock_gettime.c  | 20 
 arch/x86/entry/vdso/vdso-layout.lds.S |  3 ++-
 arch/x86/entry/vdso/vdso2c.c  |  3 +++
 arch/x86/entry/vdso/vma.c | 13 +
 arch/x86/include/asm/pvclock.h|  9 +
 arch/x86/include/asm/vdso.h   |  1 +
 arch/x86/kernel/kvmclock.c|  5 +
 7 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index c325ba1bdddf..5dd363d54348 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -36,6 +36,11 @@ static notrace cycle_t vread_hpet(void)
 }
 #endif
 
+#ifdef CONFIG_PARAVIRT_CLOCK
+extern u8 pvclock_page
+   __attribute__((visibility("hidden")));
+#endif
+
 #ifndef BUILD_VDSO32
 
 #include 
@@ -62,23 +67,14 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, 
struct timezone *tz)
 
 #ifdef CONFIG_PARAVIRT_CLOCK
 
-static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
+static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void)
 {
-   const struct pvclock_vsyscall_time_info *pvti_base;
-   int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
-   int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
-
-   BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
-
-   pvti_base = (struct pvclock_vsyscall_time_info *)
-   __fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
-
-   return _base[offset];
+   return (const struct pvclock_vsyscall_time_info *)_page;
 }
 
 static notrace cycle_t vread_pvclock(int *mode)
 {
-   const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti;
+   const struct pvclock_vcpu_time_info *pvti = _pvti0()->pvti;
cycle_t ret;
u64 tsc, pvti_tsc;
u64 last, delta, pvti_system_time;
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S 
b/arch/x86/entry/vdso/vdso-layout.lds.S
index de2c921025f5..4158acc17df0 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -25,7 +25,7 @@ SECTIONS
 * segment.
 */
 
-   vvar_start = . - 2 * PAGE_SIZE;
+   vvar_start = . - 3 * PAGE_SIZE;
vvar_page = vvar_start;
 
/* Place all vvars at the offsets in asm/vvar.h. */
@@ -36,6 +36,7 @@ SECTIONS
 #undef EMIT_VVAR
 
hpet_page = vvar_start + PAGE_SIZE;
+   pvclock_page = vvar_start + 2 * PAGE_SIZE;
 
. = SIZEOF_HEADERS;
 
diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
index 785d9922b106..491020b2826d 100644
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -73,6 +73,7 @@ enum {
sym_vvar_start,
sym_vvar_page,
sym_hpet_page,
+   sym_pvclock_page,
sym_VDSO_FAKE_SECTION_TABLE_START,
sym_VDSO_FAKE_SECTION_TABLE_END,
 };
@@ -80,6 +81,7 @@ enum {
 const int special_pages[] = {
sym_vvar_page,
sym_hpet_page,
+   sym_pvclock_page,
 };
 
 struct vdso_sym {
@@ -91,6 +93,7 @@ struct vdso_sym required_syms[] = {
[sym_vvar_start] = {"vvar_start", true},
[sym_vvar_page] = {"vvar_page", true},
[sym_hpet_page] = {"hpet_page", true},
+   [sym_pvclock_page] = {"pvclock_page", true},
[sym_VDSO_FAKE_SECTION_TABLE_START] = {
"VDSO_FAKE_SECTION_TABLE_START", false
},
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 64df47148160..aa828191c654 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -100,6 +100,7 @@ static int map_vdso(const struct vdso_image *image, bool 
calculate_addr)
.name = "[vvar]",
.pages = no_pages,
};
+   struct pvclock_vsyscall_time_info *pvti;
 
if (calculate_addr) {
addr = vdso_addr(current->mm->start_stack,
@@ -169,6 +170,18 @@ static int map_vdso(const struct vdso_image *image, bool 
calculate_addr)
}
 #endif
 
+   pvti = pvclock_pvti_cpu0_va();
+   if (pvti && image->sym_pvclock_page) {
+   ret = remap_pfn_range(vma,
+ text_start + image->sym_pvclock_page,
+ __pa(pvti) >> PAGE_SHIFT,
+ PAGE_SIZE,
+ PAGE_READONLY);
+
+   if (ret)
+   goto up_fail;
+   }
+
 up_fail:
if (ret)
current->mm->context.vdso = NULL;
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 7a6bed5c08bc..571dad355bbc 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -4,6 +4,15 @@
 #include 
 #include 
 
+

[PATCH v2 0/4] x86: KVM vdso and clock improvements

2015-12-20 Thread Andy Lutomirski
x86: KVM vdso and clock improvements

NB: patch 1 doesn't really belong here, but it makes this a lot
easier for me to test.  Patch 1, if it's okay at all, should go
though the kvm tree.  The rest should probably go through
tip:x86/vdso once they're reviewed.

I'll do a followup to enable vdso pvclock on 32-bit guests.
I'm not currently set up to test it.  (The KVM people could also
do it very easily on top of these patches.)

Changes from v1:
 - Dropped patch 1
 - Added Paolo's review and acks
 - Fixed a build issue on some configs

Andy Lutomirski (4):
  x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap
  x86/vdso: Remove pvclock fixmap machinery
  x86/vdso: Enable vdso pvclock access on all vdso variants

 arch/x86/entry/vdso/vclock_gettime.c  | 151 --
 arch/x86/entry/vdso/vdso-layout.lds.S |   3 +-
 arch/x86/entry/vdso/vdso2c.c  |   3 +
 arch/x86/entry/vdso/vma.c |  14 
 arch/x86/include/asm/fixmap.h |   5 --
 arch/x86/include/asm/pvclock.h|  14 ++--
 arch/x86/include/asm/vdso.h   |   1 +
 arch/x86/kernel/kvmclock.c|  11 ++-
 arch/x86/kernel/pvclock.c |  24 --
 9 files changed, 107 insertions(+), 119 deletions(-)

-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/4] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-12-20 Thread Andy Lutomirski
From: Andy Lutomirski <l...@amacapital.net>

The pvclock vdso code was too abstracted to understand easily and
excessively paranoid.  Simplify it for a huge speedup.

This opens the door for additional simplifications, as the vdso no
longer accesses the pvti for any vcpu other than vcpu 0.

Before, vclock_gettime using kvm-clock took about 45ns on my machine.
With this change, it takes 29ns, which is almost as fast as the pure TSC
implementation.

Reviewed-by: Paolo Bonzini <pbonz...@redhat.com>
Signed-off-by: Andy Lutomirski <l...@amacapital.net>
---
 arch/x86/entry/vdso/vclock_gettime.c | 81 
 1 file changed, 46 insertions(+), 35 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index ca94fa649251..c325ba1bdddf 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info 
*get_pvti(int cpu)
 
 static notrace cycle_t vread_pvclock(int *mode)
 {
-   const struct pvclock_vsyscall_time_info *pvti;
+   const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti;
cycle_t ret;
-   u64 last;
-   u32 version;
-   u8 flags;
-   unsigned cpu, cpu1;
-
+   u64 tsc, pvti_tsc;
+   u64 last, delta, pvti_system_time;
+   u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
 
/*
-* Note: hypervisor must guarantee that:
-* 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
-* 2. that per-CPU pvclock time info is updated if the
-*underlying CPU changes.
-* 3. that version is increased whenever underlying CPU
-*changes.
+* Note: The kernel and hypervisor must guarantee that cpu ID
+* number maps 1:1 to per-CPU pvclock time info.
+*
+* Because the hypervisor is entirely unaware of guest userspace
+* preemption, it cannot guarantee that per-CPU pvclock time
+* info is updated if the underlying CPU changes or that that
+* version is increased whenever underlying CPU changes.
 *
+* On KVM, we are guaranteed that pvti updates for any vCPU are
+* atomic as seen by *all* vCPUs.  This is an even stronger
+* guarantee than we get with a normal seqlock.
+*
+* On Xen, we don't appear to have that guarantee, but Xen still
+* supplies a valid seqlock using the version field.
+
+* We only do pvclock vdso timing at all if
+* PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
+* mean that all vCPUs have matching pvti and that the TSC is
+* synced, so we can just look at vCPU 0's pvti.
 */
-   do {
-   cpu = __getcpu() & VGETCPU_CPU_MASK;
-   /* TODO: We can put vcpu id into higher bits of pvti.version.
-* This will save a couple of cycles by getting rid of
-* __getcpu() calls (Gleb).
-*/
-
-   pvti = get_pvti(cpu);
-
-   version = __pvclock_read_cycles(>pvti, , );
-
-   /*
-* Test we're still on the cpu as well as the version.
-* We could have been migrated just after the first
-* vgetcpu but before fetching the version, so we
-* wouldn't notice a version change.
-*/
-   cpu1 = __getcpu() & VGETCPU_CPU_MASK;
-   } while (unlikely(cpu != cpu1 ||
- (pvti->pvti.version & 1) ||
- pvti->pvti.version != version));
-
-   if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
+
+   if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
*mode = VCLOCK_NONE;
+   return 0;
+   }
+
+   do {
+   version = pvti->version;
+
+   /* This is also a read barrier, so we'll read version first. */
+   tsc = rdtsc_ordered();
+
+   pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
+   pvti_tsc_shift = pvti->tsc_shift;
+   pvti_system_time = pvti->system_time;
+   pvti_tsc = pvti->tsc_timestamp;
+
+   /* Make sure that the version double-check is last. */
+   smp_rmb();
+   } while (unlikely((version & 1) || version != pvti->version));
+
+   delta = tsc - pvti_tsc;
+   ret = pvti_system_time +
+   pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
+   pvti_tsc_shift);
 
/* refer to tsc.c read_tsc() comment for rationale */
last = gtod->cycle_last;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/4] x86/vdso: Enable vdso pvclock access on all vdso variants

2015-12-20 Thread Andy Lutomirski
Now that pvclock doesn't require access to the fixmap, all vdso
variants can use it.

The kernel side isn't wired up for 32-bit kernels yet, but this
covers 32-bit and x32 userspace on 64-bit kernels.

Acked-by: Paolo Bonzini <pbonz...@redhat.com>
Signed-off-by: Andy Lutomirski <l...@kernel.org>
---
 arch/x86/entry/vdso/vclock_gettime.c | 91 
 1 file changed, 40 insertions(+), 51 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 59a98c25bde7..8602f06c759f 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -17,8 +17,10 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 
 #define gtod ((vsyscall_gtod_data))
 
@@ -43,10 +45,6 @@ extern u8 pvclock_page
 
 #ifndef BUILD_VDSO32
 
-#include 
-#include 
-#include 
-
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
 {
long ret;
@@ -64,8 +62,42 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, 
struct timezone *tz)
return ret;
 }
 
-#ifdef CONFIG_PARAVIRT_CLOCK
 
+#else
+
+notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
+{
+   long ret;
+
+   asm(
+   "mov %%ebx, %%edx \n"
+   "mov %2, %%ebx \n"
+   "call __kernel_vsyscall \n"
+   "mov %%edx, %%ebx \n"
+   : "=a" (ret)
+   : "0" (__NR_clock_gettime), "g" (clock), "c" (ts)
+   : "memory", "edx");
+   return ret;
+}
+
+notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
+{
+   long ret;
+
+   asm(
+   "mov %%ebx, %%edx \n"
+   "mov %2, %%ebx \n"
+   "call __kernel_vsyscall \n"
+   "mov %%edx, %%ebx \n"
+   : "=a" (ret)
+   : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
+   : "memory", "edx");
+   return ret;
+}
+
+#endif
+
+#ifdef CONFIG_PARAVIRT_CLOCK
 static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void)
 {
return (const struct pvclock_vsyscall_time_info *)_page;
@@ -109,9 +141,9 @@ static notrace cycle_t vread_pvclock(int *mode)
do {
version = pvti->version;
 
-   /* This is also a read barrier, so we'll read version first. */
-   tsc = rdtsc_ordered();
+   smp_rmb();
 
+   tsc = rdtsc_ordered();
pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
pvti_tsc_shift = pvti->tsc_shift;
pvti_system_time = pvti->system_time;
@@ -126,7 +158,7 @@ static notrace cycle_t vread_pvclock(int *mode)
pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
pvti_tsc_shift);
 
-   /* refer to tsc.c read_tsc() comment for rationale */
+   /* refer to vread_tsc() comment for rationale */
last = gtod->cycle_last;
 
if (likely(ret >= last))
@@ -136,49 +168,6 @@ static notrace cycle_t vread_pvclock(int *mode)
 }
 #endif
 
-#else
-
-notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
-{
-   long ret;
-
-   asm(
-   "mov %%ebx, %%edx \n"
-   "mov %2, %%ebx \n"
-   "call __kernel_vsyscall \n"
-   "mov %%edx, %%ebx \n"
-   : "=a" (ret)
-   : "0" (__NR_clock_gettime), "g" (clock), "c" (ts)
-   : "memory", "edx");
-   return ret;
-}
-
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
-{
-   long ret;
-
-   asm(
-   "mov %%ebx, %%edx \n"
-   "mov %2, %%ebx \n"
-   "call __kernel_vsyscall \n"
-   "mov %%edx, %%ebx \n"
-   : "=a" (ret)
-   : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
-   : "memory", "edx");
-   return ret;
-}
-
-#ifdef CONFIG_PARAVIRT_CLOCK
-
-static notrace cycle_t vread_pvclock(int *mode)
-{
-   *mode = VCLOCK_NONE;
-   return 0;
-}
-#endif
-
-#endif
-
 notrace static cycle_t vread_tsc(void)
 {
cycle_t ret = (cycle_t)rdtsc_ordered();
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-18 Thread Andy Lutomirski
[cc: John Stultz -- maybe you have ideas on how this should best
integrate with the core code]

On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
> On Fri, Dec 18, 2015 at 11:27:13AM -0800, Andy Lutomirski wrote:
>> On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
>> > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote:
>> >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosa...@redhat.com> 
>> >> wrote:
>> >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
>> >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosa...@redhat.com> 
>> >> >> wrote:
>> >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
>> >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski 
>> >> >> >> <l...@amacapital.net> wrote:
>> >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini 
>> >> >> >> > <pbonz...@redhat.com> wrote:
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
>> >> >> >> >>> > RAW TSC NTP corrected TSC
>> >> >> >> >>> > t0  10  10
>> >> >> >> >>> > t1  20  19.99
>> >> >> >> >>> > t2  30  29.98
>> >> >> >> >>> > t3  40  39.97
>> >> >> >> >>> > t4  50  49.96
>> >> >
>> >> > (1)
>> >> >
>> >> >> >> >>> >
>> >> >> >> >>> > ...
>> >> >> >> >>> >
>> >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
>> >> >> >> >>> > you can see what will happen.
>> >> >> >> >>>
>> >> >> >> >>> Sure, but why would you ever switch from one to the other?
>> >> >> >> >>
>> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  
>> >> >> >> >> After
>> >> >> >> >> resume, the TSC certainly increases at the same rate as before, 
>> >> >> >> >> but the
>> >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased 
>> >> >> >> >> slower
>> >> >> >> >> than the guest kvmclock.
>> >> >> >> >
>> >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP?
>> >> >> >> >
>> >> >> >> > If it's the host's, then wouldn't systemtime be reset after 
>> >> >> >> > resume to
>> >> >> >> > the NTP corrected value?  If so, the guest wouldn't see time go
>> >> >> >> > backwards.
>> >> >> >> >
>> >> >> >> > If it's the guest's, then the guest's NTP correction is applied 
>> >> >> >> > on top
>> >> >> >> > of kvmclock, and this shouldn't matter.
>> >> >> >> >
>> >> >> >> > I still feel like I'm missing something very basic here.
>> >> >> >> >
>> >> >> >>
>> >> >> >> OK, I think I get it.
>> >> >> >>
>> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the 
>> >> >> >> host's
>> >> >> >> correction to the guest.  If it did, indeed, propagate the 
>> >> >> >> correction
>> >> >> >> then, after resume, the host's new system_time would match the 
>> >> >> >> guest's
>> >> >> >> idea of it (after accounting for the guest's long nap), and I don't
>> >> >> >> think there would be a problem.
>> >> >> >> That being said, I can't find the code in the masterclock stuff that
>> >> >> >> would actually do this.
>> >> >> &

Re: kvmclock doesn't work, help?

2015-12-18 Thread Andy Lutomirski
On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
> On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote:
>> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosa...@redhat.com> 
>> wrote:
>> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
>> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosa...@redhat.com> 
>> >> wrote:
>> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
>> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <l...@amacapital.net> 
>> >> >> wrote:
>> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini 
>> >> >> > <pbonz...@redhat.com> wrote:
>> >> >> >>
>> >> >> >>
>> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
>> >> >> >>> > RAW TSC NTP corrected TSC
>> >> >> >>> > t0  10  10
>> >> >> >>> > t1  20  19.99
>> >> >> >>> > t2  30  29.98
>> >> >> >>> > t3  40  39.97
>> >> >> >>> > t4  50  49.96
>> >
>> > (1)
>> >
>> >> >> >>> >
>> >> >> >>> > ...
>> >> >> >>> >
>> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
>> >> >> >>> > you can see what will happen.
>> >> >> >>>
>> >> >> >>> Sure, but why would you ever switch from one to the other?
>> >> >> >>
>> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
>> >> >> >> resume, the TSC certainly increases at the same rate as before, but 
>> >> >> >> the
>> >> >> >> raw TSC restarted counting from 0 and systemtime has increased 
>> >> >> >> slower
>> >> >> >> than the guest kvmclock.
>> >> >> >
>> >> >> > Wait, are we talking about the host's NTP or the guest's NTP?
>> >> >> >
>> >> >> > If it's the host's, then wouldn't systemtime be reset after resume to
>> >> >> > the NTP corrected value?  If so, the guest wouldn't see time go
>> >> >> > backwards.
>> >> >> >
>> >> >> > If it's the guest's, then the guest's NTP correction is applied on 
>> >> >> > top
>> >> >> > of kvmclock, and this shouldn't matter.
>> >> >> >
>> >> >> > I still feel like I'm missing something very basic here.
>> >> >> >
>> >> >>
>> >> >> OK, I think I get it.
>> >> >>
>> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's
>> >> >> correction to the guest.  If it did, indeed, propagate the correction
>> >> >> then, after resume, the host's new system_time would match the guest's
>> >> >> idea of it (after accounting for the guest's long nap), and I don't
>> >> >> think there would be a problem.
>> >> >> That being said, I can't find the code in the masterclock stuff that
>> >> >> would actually do this.
>> >> >
>> >> > Guest clock is maintained by guest timekeeping code, which does:
>> >> >
>> >> > timer_interrupt()
>> >> > offset = read clocksource since last timer interrupt
>> >> > accumulate_to_systemclock(offset)
>> >> >
>> >> > The frequency correction of NTP in the host can be applied to
>> >> > kvmclock, which will be visible to the guest
>> >> > at "read clocksource since last timer interrupt"
>> >> > (kvmclock_clocksource_read function).
>> >>
>> >> pvclock_clocksource_read?  That seems to do the same thing as all the
>> >> other clocksource access functions.
>> >>
>> >> >
>> >> > This does not mean that the NTP correction in the host is propagated
>> >> > to the guests system clock directly.
>> >> &

Re: kvmclock doesn't work, help?

2015-12-17 Thread Andy Lutomirski
On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
> On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
>> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <l...@amacapital.net> wrote:
>> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonz...@redhat.com> 
>> > wrote:
>> >>
>> >>
>> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
>> >>> > RAW TSC NTP corrected TSC
>> >>> > t0  10  10
>> >>> > t1  20  19.99
>> >>> > t2  30  29.98
>> >>> > t3  40  39.97
>> >>> > t4  50  49.96
>> >>> >
>> >>> > ...
>> >>> >
>> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
>> >>> > you can see what will happen.
>> >>>
>> >>> Sure, but why would you ever switch from one to the other?
>> >>
>> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
>> >> resume, the TSC certainly increases at the same rate as before, but the
>> >> raw TSC restarted counting from 0 and systemtime has increased slower
>> >> than the guest kvmclock.
>> >
>> > Wait, are we talking about the host's NTP or the guest's NTP?
>> >
>> > If it's the host's, then wouldn't systemtime be reset after resume to
>> > the NTP corrected value?  If so, the guest wouldn't see time go
>> > backwards.
>> >
>> > If it's the guest's, then the guest's NTP correction is applied on top
>> > of kvmclock, and this shouldn't matter.
>> >
>> > I still feel like I'm missing something very basic here.
>> >
>>
>> OK, I think I get it.
>>
>> Marcelo, I thought that kvmclock was supposed to propagate the host's
>> correction to the guest.  If it did, indeed, propagate the correction
>> then, after resume, the host's new system_time would match the guest's
>> idea of it (after accounting for the guest's long nap), and I don't
>> think there would be a problem.
>> That being said, I can't find the code in the masterclock stuff that
>> would actually do this.
>
> Guest clock is maintained by guest timekeeping code, which does:
>
> timer_interrupt()
> offset = read clocksource since last timer interrupt
> accumulate_to_systemclock(offset)
>
> The frequency correction of NTP in the host can be applied to
> kvmclock, which will be visible to the guest
> at "read clocksource since last timer interrupt"
> (kvmclock_clocksource_read function).

pvclock_clocksource_read?  That seems to do the same thing as all the
other clocksource access functions.

>
> This does not mean that the NTP correction in the host is propagated
> to the guests system clock directly.
>
> (For example, the guest can run NTP which is free to do further
> adjustments at "accumulate_to_systemclock(offset)" time).

Of course.  But I expected that, in the absence of NTP on the guest,
that the guest would track the host's *corrected* time.

>
>> If, on the other hand, the host's NTP correction is not supposed to
>> propagate to the guest,
>
> This is optional. There is a module option to control this, in fact.
>
> Its nice to have, because then you can execute a guest without NTP
> (say without network connection), and have a kvmclock (kvmclock is a
> clocksource, not a guest system clock) which is NTP corrected.

Can you point to how this works?  I found kvm_guest_time_update, whch
is called under circumstances that I haven't untangled.  I can't
really tell what it's trying to do.

In any case, this still seems much more convoluted than it has to be.
In the case in which the host has a stable TSC (tsc is selected in the
core timekeeping code, VCLOCK_TSC is set, etc), which is basically all
the time on the last few generations of CPUs, then the core
timekeeping code is already exposing a linear function that's supposed
to be used for monotonic, cpu-local access to a corrected nanosecond
counter.  It's even in pretty much exactly the right form to pass
through to the guest via pvclock in the gtod data.  Why doesn't KVM
pass it through verbatim, updated in real time?  Is there some legacy
reason that KVM must apply its own corrections and has to jump through
hoops to pause vcpus when updating those vcpu's copies of the pvclock
data?

>
>> then shouldn't KVM just update system_time on
>> resume to whatever the guest would think it had (which

Re: kvmclock doesn't work, help?

2015-12-17 Thread Andy Lutomirski
On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
> On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
>> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
>> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
>> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <l...@amacapital.net> 
>> >> wrote:
>> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonz...@redhat.com> 
>> >> > wrote:
>> >> >>
>> >> >>
>> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
>> >> >>> > RAW TSC NTP corrected TSC
>> >> >>> > t0  10  10
>> >> >>> > t1  20  19.99
>> >> >>> > t2  30  29.98
>> >> >>> > t3  40  39.97
>> >> >>> > t4  50  49.96
>
> (1)
>
>> >> >>> >
>> >> >>> > ...
>> >> >>> >
>> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
>> >> >>> > you can see what will happen.
>> >> >>>
>> >> >>> Sure, but why would you ever switch from one to the other?
>> >> >>
>> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
>> >> >> resume, the TSC certainly increases at the same rate as before, but the
>> >> >> raw TSC restarted counting from 0 and systemtime has increased slower
>> >> >> than the guest kvmclock.
>> >> >
>> >> > Wait, are we talking about the host's NTP or the guest's NTP?
>> >> >
>> >> > If it's the host's, then wouldn't systemtime be reset after resume to
>> >> > the NTP corrected value?  If so, the guest wouldn't see time go
>> >> > backwards.
>> >> >
>> >> > If it's the guest's, then the guest's NTP correction is applied on top
>> >> > of kvmclock, and this shouldn't matter.
>> >> >
>> >> > I still feel like I'm missing something very basic here.
>> >> >
>> >>
>> >> OK, I think I get it.
>> >>
>> >> Marcelo, I thought that kvmclock was supposed to propagate the host's
>> >> correction to the guest.  If it did, indeed, propagate the correction
>> >> then, after resume, the host's new system_time would match the guest's
>> >> idea of it (after accounting for the guest's long nap), and I don't
>> >> think there would be a problem.
>> >> That being said, I can't find the code in the masterclock stuff that
>> >> would actually do this.
>> >
>> > Guest clock is maintained by guest timekeeping code, which does:
>> >
>> > timer_interrupt()
>> > offset = read clocksource since last timer interrupt
>> > accumulate_to_systemclock(offset)
>> >
>> > The frequency correction of NTP in the host can be applied to
>> > kvmclock, which will be visible to the guest
>> > at "read clocksource since last timer interrupt"
>> > (kvmclock_clocksource_read function).
>>
>> pvclock_clocksource_read?  That seems to do the same thing as all the
>> other clocksource access functions.
>>
>> >
>> > This does not mean that the NTP correction in the host is propagated
>> > to the guests system clock directly.
>> >
>> > (For example, the guest can run NTP which is free to do further
>> > adjustments at "accumulate_to_systemclock(offset)" time).
>>
>> Of course.  But I expected that, in the absence of NTP on the guest,
>> that the guest would track the host's *corrected* time.
>>
>> >
>> >> If, on the other hand, the host's NTP correction is not supposed to
>> >> propagate to the guest,
>> >
>> > This is optional. There is a module option to control this, in fact.
>> >
>> > Its nice to have, because then you can execute a guest without NTP
>> > (say without network connection), and have a kvmclock (kvmclock is a
>> > clocksource, not a guest system clock) which is NTP corrected.
>>
>> Can you point to how this works?  I found kvm_guest_time_update, whch
>> is called under circumstances that I haven

Re: kvmclock doesn't work, help?

2015-12-16 Thread Andy Lutomirski
On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>
>
> On 14/12/2015 23:31, Andy Lutomirski wrote:
>> > RAW TSC NTP corrected TSC
>> > t0  10  10
>> > t1  20  19.99
>> > t2  30  29.98
>> > t3  40  39.97
>> > t4  50  49.96
>> >
>> > ...
>> >
>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
>> > you can see what will happen.
>>
>> Sure, but why would you ever switch from one to the other?
>
> The guest uses the raw TSC and systemtime = 0 until suspend.  After
> resume, the TSC certainly increases at the same rate as before, but the
> raw TSC restarted counting from 0 and systemtime has increased slower
> than the guest kvmclock.

Wait, are we talking about the host's NTP or the guest's NTP?

If it's the host's, then wouldn't systemtime be reset after resume to
the NTP corrected value?  If so, the guest wouldn't see time go
backwards.

If it's the guest's, then the guest's NTP correction is applied on top
of kvmclock, and this shouldn't matter.

I still feel like I'm missing something very basic here.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-16 Thread Andy Lutomirski
On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <l...@amacapital.net> wrote:
> On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>>
>>
>> On 14/12/2015 23:31, Andy Lutomirski wrote:
>>> > RAW TSC NTP corrected TSC
>>> > t0  10  10
>>> > t1  20  19.99
>>> > t2  30  29.98
>>> > t3  40  39.97
>>> > t4  50  49.96
>>> >
>>> > ...
>>> >
>>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
>>> > you can see what will happen.
>>>
>>> Sure, but why would you ever switch from one to the other?
>>
>> The guest uses the raw TSC and systemtime = 0 until suspend.  After
>> resume, the TSC certainly increases at the same rate as before, but the
>> raw TSC restarted counting from 0 and systemtime has increased slower
>> than the guest kvmclock.
>
> Wait, are we talking about the host's NTP or the guest's NTP?
>
> If it's the host's, then wouldn't systemtime be reset after resume to
> the NTP corrected value?  If so, the guest wouldn't see time go
> backwards.
>
> If it's the guest's, then the guest's NTP correction is applied on top
> of kvmclock, and this shouldn't matter.
>
> I still feel like I'm missing something very basic here.
>

OK, I think I get it.

Marcelo, I thought that kvmclock was supposed to propagate the host's
correction to the guest.  If it did, indeed, propagate the correction
then, after resume, the host's new system_time would match the guest's
idea of it (after accounting for the guest's long nap), and I don't
think there would be a problem.

That being said, I can't find the code in the masterclock stuff that
would actually do this.

If, on the other hand, the host's NTP correction is not supposed to
propagate to the guest, then shouldn't KVM just update system_time on
resume to whatever the guest would think it had (which I think would
be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly
shifted by some per-guest constant offset).

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-14 Thread Andy Lutomirski
On Fri, Dec 11, 2015 at 3:48 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
> On Fri, Dec 11, 2015 at 01:57:23PM -0800, Andy Lutomirski wrote:
>> On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
>> > On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote:
>> >> I'm trying to clean up kvmclock and I can't get it to work at all.  My
>> >> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC.
>> >>
>> >> If I boot an SMP (2 vcpus) guest, tracing says:
>> >>
>> >>  qemu-system-x86-2517  [001] 102242.610654: kvm_update_master_clock:
>> >> masterclock 0 hostclock tsc offsetmatched 0
>> >>  qemu-system-x86-2521  [000] 102242.613742: kvm_track_tsc:
>> >> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc
>> >>  qemu-system-x86-2522  [000] 102242.622959: kvm_track_tsc:
>> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>> >>  qemu-system-x86-2521  [000] 102242.645123: kvm_track_tsc:
>> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>> >>  qemu-system-x86-2522  [000] 102242.647291: kvm_track_tsc:
>> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>> >>  qemu-system-x86-2521  [000] 102242.653369: kvm_track_tsc:
>> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>> >>  qemu-system-x86-2522  [000] 102242.653429: kvm_track_tsc:
>> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>> >>  qemu-system-x86-2517  [001] 102242.653447: kvm_update_master_clock:
>> >> masterclock 0 hostclock tsc offsetmatched 1
>> >>  qemu-system-x86-2521  [000] 102242.653657: kvm_update_master_clock:
>> >> masterclock 0 hostclock tsc offsetmatched 1
>> >>  qemu-system-x86-2522  [002] 102242.664448: kvm_update_master_clock:
>> >> masterclock 0 hostclock tsc offsetmatched 1
>> >>
>> >>
>> >> If I boot a UP guest, tracing says:
>> >>
>> >>  qemu-system-x86-2567  [001] 102370.447484: kvm_update_master_clock:
>> >> masterclock 0 hostclock tsc offsetmatched 1
>> >>  qemu-system-x86-2571  [002] 102370.447688: kvm_update_master_clock:
>> >> masterclock 0 hostclock tsc offsetmatched 1
>> >>
>> >> I suspect, but I haven't verified, that this is fallout from:
>> >>
>> >> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f
>> >> Author: Marcelo Tosatti <mtosa...@redhat.com>
>> >> Date:   Wed May 14 12:43:24 2014 -0300
>> >>
>> >> KVM: x86: disable master clock if TSC is reset during suspend
>> >>
>> >> Updating system_time from the kernel clock once master clock
>> >> has been enabled can result in time backwards event, in case
>> >> kernel clock frequency is lower than TSC frequency.
>> >>
>> >> Disable master clock in case it is necessary to update it
>> >> from the resume path.
>> >>
>> >> Signed-off-by: Marcelo Tosatti <mtosa...@redhat.com>
>> >> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
>> >>
>> >>
>> >> Can we please stop making kvmclock more complex?  It's a beast right
>> >> now, and not in a good way.  It's far too tangled with the vclock
>> >> machinery on both the host and guest sides, the pvclock stuff is not
>> >> well thought out (even in principle in an ABI sense), and it's never
>> >> been clear to my what problem exactly the kvmclock stuff is supposed
>> >> to solve.
>> >>
>> >> I'm somewhat tempted to suggest that we delete kvmclock entirely and
>> >> start over.  A correctly functioning KVM guest using TSC (i.e.
>> >> ignoring kvmclock entirely)
>> >> seems to work rather more reliably and
>> >> considerably faster than a kvmclock guest.
>> >>
>> >> --Andy
>> >>
>> >> --
>> >> Andy Lutomirski
>> >> AMA Capital Management, LLC
>> >
>> > Andy,
>> >
>> > I am all for solving practical problems rather than pleasing aesthetic
>> > pleasure.
>> >
>> >> Updating system_time from the kernel clock once master clock
>> >> has been enabled can result in time backwards event, in case
>> >> kernel clock frequency is lower than TSC frequency.
>> >&

Re: kvmclock doesn't work, help?

2015-12-14 Thread Andy Lutomirski
On Mon, Dec 14, 2015 at 2:00 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
> On Mon, Dec 14, 2015 at 02:44:15PM +0100, Paolo Bonzini wrote:
>>
>>
>> On 11/12/2015 22:57, Andy Lutomirski wrote:
>> > I'm still not seeing the issue.
>> >
>> > The formula is:
>> >
>> > (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >>
>> > pvti->tsc_shift) + pvti->system_time
>> >
>> > Obviously, if you reset pvti->tsc_timestamp to the current tsc value
>> > after suspend/resume, you would also need to update system_time.
>> >
>> > I don't see what this has to do with suspend/resume or with whether
>> > the effective scale factor is greater than or less than one.  The only
>> > suspend/resume interaction I can see is that, if the host allows the
>> > guest-observed TSC value to jump (which is arguably a bug, what that's
>> > not important here), it needs to update pvti before resuming the
>> > guest.
>>
>> Which is not an issue, since freezing obviously gets all CPUs out of
>> guest mode.
>>
>> Marcelo, can you provide an example with made-up values for tsc and pvti?
>
> I meant "systemtime" at ^.
>
> guest visible clock = systemtime (updated at time 0, guest initialization) + 
> scaled tsc reads=LARGE VALUE.
>   ^^
> guest reads clock to memory at location A = scaled tsc read.
> -> suspend resume event
> guest visible clock = systemtime (updated at time AFTER SUSPEND) + scaled tsc 
> reads=0.
> guest reads clock to memory at location B.
>
> So before the suspend/resume event, the clock is the RAW TSC values
> (scaled by kvmclock, but the frequency of the RAW TSC).
>
> After suspend/resume event, the clock is updated from the host
> via get_kernel_ns(), which reads the corrected NTP frequency TSC.
>
> So you switch the timebase, from a clock running at a given frequency,
> to a clock running at another frequency (effective frequency).
>
> Example:
>
> RAW TSC NTP corrected TSC
> t0  10  10
> t1  20  19.99
> t2  30  29.98
> t3  40  39.97
> t4  50  49.96
>
> ...
>
> if you suddenly switch from RAW TSC to NTP corrected TSC,
> you can see what will happen.

Sure, but why would you ever switch from one to the other?  I'm still
not seeing the scenario under which this discontinuity is visible to
anything other than the kvmclock code itself.

The only things that need to be monotonic are the output from
vread_pvclock and the in-kernel equivalent, I think.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-11 Thread Andy Lutomirski
On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
> On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote:
>> I'm trying to clean up kvmclock and I can't get it to work at all.  My
>> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC.
>>
>> If I boot an SMP (2 vcpus) guest, tracing says:
>>
>>  qemu-system-x86-2517  [001] 102242.610654: kvm_update_master_clock:
>> masterclock 0 hostclock tsc offsetmatched 0
>>  qemu-system-x86-2521  [000] 102242.613742: kvm_track_tsc:
>> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc
>>  qemu-system-x86-2522  [000] 102242.622959: kvm_track_tsc:
>> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>>  qemu-system-x86-2521  [000] 102242.645123: kvm_track_tsc:
>> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>>  qemu-system-x86-2522  [000] 102242.647291: kvm_track_tsc:
>> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>>  qemu-system-x86-2521  [000] 102242.653369: kvm_track_tsc:
>> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>>  qemu-system-x86-2522  [000] 102242.653429: kvm_track_tsc:
>> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>>  qemu-system-x86-2517  [001] 102242.653447: kvm_update_master_clock:
>> masterclock 0 hostclock tsc offsetmatched 1
>>  qemu-system-x86-2521  [000] 102242.653657: kvm_update_master_clock:
>> masterclock 0 hostclock tsc offsetmatched 1
>>  qemu-system-x86-2522  [002] 102242.664448: kvm_update_master_clock:
>> masterclock 0 hostclock tsc offsetmatched 1
>>
>>
>> If I boot a UP guest, tracing says:
>>
>>  qemu-system-x86-2567  [001] 102370.447484: kvm_update_master_clock:
>> masterclock 0 hostclock tsc offsetmatched 1
>>  qemu-system-x86-2571  [002] 102370.447688: kvm_update_master_clock:
>> masterclock 0 hostclock tsc offsetmatched 1
>>
>> I suspect, but I haven't verified, that this is fallout from:
>>
>> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f
>> Author: Marcelo Tosatti <mtosa...@redhat.com>
>> Date:   Wed May 14 12:43:24 2014 -0300
>>
>> KVM: x86: disable master clock if TSC is reset during suspend
>>
>> Updating system_time from the kernel clock once master clock
>> has been enabled can result in time backwards event, in case
>> kernel clock frequency is lower than TSC frequency.
>>
>> Disable master clock in case it is necessary to update it
>> from the resume path.
>>
>> Signed-off-by: Marcelo Tosatti <mtosa...@redhat.com>
>> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
>>
>>
>> Can we please stop making kvmclock more complex?  It's a beast right
>> now, and not in a good way.  It's far too tangled with the vclock
>> machinery on both the host and guest sides, the pvclock stuff is not
>> well thought out (even in principle in an ABI sense), and it's never
>> been clear to my what problem exactly the kvmclock stuff is supposed
>> to solve.
>>
>> I'm somewhat tempted to suggest that we delete kvmclock entirely and
>> start over.  A correctly functioning KVM guest using TSC (i.e.
>> ignoring kvmclock entirely)
>> seems to work rather more reliably and
>> considerably faster than a kvmclock guest.
>>
>> --Andy
>>
>> --
>> Andy Lutomirski
>> AMA Capital Management, LLC
>
> Andy,
>
> I am all for solving practical problems rather than pleasing aesthetic
> pleasure.
>
>> Updating system_time from the kernel clock once master clock
>> has been enabled can result in time backwards event, in case
>> kernel clock frequency is lower than TSC frequency.
>>
>> Disable master clock in case it is necessary to update it
>> from the resume path.
>
>> once master clock
>> has been enabled can result in time backwards event, in case
>> kernel clock frequency is lower than TSC frequency.
>
> guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads.
>
> If the effective frequency of the kernel clock is lower (for example
> due to NTP correcting the TSC frequency of the system), and you resume
> and update the system, the following happens:
>
> guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc 
> reads=LARGE VALUE.
> suspend/resume event.
> guest visible clock = tsc_timestamp (updated at time N) + scaled tsc reads=0.
>

I'm still not seeing the issue.

The formula is:

(((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) 

Re: [PATCH] x86/platform/uv: Include clocksource.h for clocksource_touch_watchdog()

2015-12-11 Thread Andy Lutomirski
On Fri, Dec 11, 2015 at 12:06 AM, Ingo Molnar <mi...@kernel.org> wrote:
>
> * Andy Lutomirski <l...@kernel.org> wrote:
>
>> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
>> index f80d70009ff8..6d7d0e52ed5a 100644
>> --- a/arch/x86/include/asm/fixmap.h
>> +++ b/arch/x86/include/asm/fixmap.h
>> @@ -19,7 +19,6 @@
>>  #include 
>>  #include 
>>  #include 
>> -#include 
>>  #ifdef CONFIG_X86_32
>>  #include 
>>  #include 
>
> So this change triggered a build failure on 64-bit allmodconfig - fixed via 
> the
> patch below. Your change unearthed a latent bug, a missing header inclusion.
>
> Thanks,
>
> Ingo
>
> >
> From d51953b0873358d13b189996e6976dfa12a9b59d Mon Sep 17 00:00:00 2001
> From: Ingo Molnar <mi...@kernel.org>
> Date: Fri, 11 Dec 2015 09:01:30 +0100
> Subject: [PATCH] x86/platform/uv: Include clocksource.h for 
> clocksource_touch_watchdog()
>
> This build failure triggers on 64-bit allmodconfig:
>
>   arch/x86/platform/uv/uv_nmi.c:493:2: error: implicit declaration of 
> function ‘clocksource_touch_watchdog’ [-Werror=implicit-function-declaration]
>
> which is caused by recent changes exposing a missing clocksource.h include
> in uv_nmi.c:
>
>   cc1e24fdb064 x86/vdso: Remove pvclock fixmap machinery
>
> this file got clocksource.h indirectly via fixmap.h - that stealth route
> of header inclusion is now gone.
>
> Cc: Borislav Petkov <b...@alien8.de>
> Cc: H. Peter Anvin <h...@zytor.com>
> Cc: Linus Torvalds <torva...@linux-foundation.org>
> Cc: Thomas Gleixner <t...@linutronix.de>
> Signed-off-by: Ingo Molnar <mi...@kernel.org>

LGTM.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-12-11 Thread Andy Lutomirski
On Fri, Dec 11, 2015 at 12:42 AM, Paolo Bonzini  wrote:
>
>
> On 11/12/2015 08:52, Ingo Molnar wrote:
>>
>> * Paolo Bonzini  wrote:

>>>
>>> Reviewed-by: Paolo Bonzini 
>>
>> Thanks. I've added your Reviewed-by to the 1/5 patch as well - to be able to 
>> put
>> the whole series into the tip:x86/entry tree. Let me know if you'd like it 
>> to be
>> done differently.
>
> The 1/5 patch is entirely in KVM and is not necessary for the rest of
> the series to work.  I would like it to be separate, because Marcelo has
> not yet chimed in to say why it was necessary.
>
> Can you just apply patches 2-5?

Yes, please.  I don't grok the clock update mechanism in the KVM host
well enough to be sure that patch 1 is actually correct.  All I know
is that it works better on my laptop with the patch than without the
patch and that it seems at least conceptually correct.

In any event, patch 1 is a host patch and 2-5 are guest patches, and
they only interact to the extent that it's hard for me to test 2-5 on
the guest without patch 1 on the host because without patch 1 my
laptop's host kernel tends to disable stable kvmclock, thus disabling
the entire mechanism in the guest.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kvmclock doesn't work, help?

2015-12-09 Thread Andy Lutomirski
I'm trying to clean up kvmclock and I can't get it to work at all.  My
host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC.

If I boot an SMP (2 vcpus) guest, tracing says:

 qemu-system-x86-2517  [001] 102242.610654: kvm_update_master_clock:
masterclock 0 hostclock tsc offsetmatched 0
 qemu-system-x86-2521  [000] 102242.613742: kvm_track_tsc:
vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc
 qemu-system-x86-2522  [000] 102242.622959: kvm_track_tsc:
vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
 qemu-system-x86-2521  [000] 102242.645123: kvm_track_tsc:
vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
 qemu-system-x86-2522  [000] 102242.647291: kvm_track_tsc:
vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
 qemu-system-x86-2521  [000] 102242.653369: kvm_track_tsc:
vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
 qemu-system-x86-2522  [000] 102242.653429: kvm_track_tsc:
vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
 qemu-system-x86-2517  [001] 102242.653447: kvm_update_master_clock:
masterclock 0 hostclock tsc offsetmatched 1
 qemu-system-x86-2521  [000] 102242.653657: kvm_update_master_clock:
masterclock 0 hostclock tsc offsetmatched 1
 qemu-system-x86-2522  [002] 102242.664448: kvm_update_master_clock:
masterclock 0 hostclock tsc offsetmatched 1


If I boot a UP guest, tracing says:

 qemu-system-x86-2567  [001] 102370.447484: kvm_update_master_clock:
masterclock 0 hostclock tsc offsetmatched 1
 qemu-system-x86-2571  [002] 102370.447688: kvm_update_master_clock:
masterclock 0 hostclock tsc offsetmatched 1

I suspect, but I haven't verified, that this is fallout from:

commit 16a9602158861687c78b6de6dc6a79e6e8a9136f
Author: Marcelo Tosatti <mtosa...@redhat.com>
Date:   Wed May 14 12:43:24 2014 -0300

KVM: x86: disable master clock if TSC is reset during suspend

Updating system_time from the kernel clock once master clock
has been enabled can result in time backwards event, in case
kernel clock frequency is lower than TSC frequency.

Disable master clock in case it is necessary to update it
from the resume path.

Signed-off-by: Marcelo Tosatti <mtosa...@redhat.com>
Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>


Can we please stop making kvmclock more complex?  It's a beast right
now, and not in a good way.  It's far too tangled with the vclock
machinery on both the host and guest sides, the pvclock stuff is not
well thought out (even in principle in an ABI sense), and it's never
been clear to my what problem exactly the kvmclock stuff is supposed
to solve.


I'm somewhat tempted to suggest that we delete kvmclock entirely and
start over.  A correctly functioning KVM guest using TSC (i.e.
ignoring kvmclock entirely) seems to work rather more reliably and
considerably faster than a kvmclock guest.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-09 Thread Andy Lutomirski
On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>
>
> On 09/12/2015 22:10, Andy Lutomirski wrote:
>> Can we please stop making kvmclock more complex?  It's a beast right
>> now, and not in a good way.  It's far too tangled with the vclock
>> machinery on both the host and guest sides, the pvclock stuff is not
>> well thought out (even in principle in an ABI sense), and it's never
>> been clear to my what problem exactly the kvmclock stuff is supposed
>> to solve.
>
> It's supposed to solve the problem that:
>
> - not all hosts have a working TSC

Fine, but we don't need any vdso integration for that.

>
> - even if they all do, virtual machines can be migrated (or
> saved/restored) to a host with a different TSC frequency

OK, I buy that.  So we want to export a linear function that the guest
applies to the TSC so the guest can apply it.

I suppose we also want ntp frequency corrections on the host to
propagate to the guest.

>
> - any MMIO- or PIO-based mechanism to access the current time is orders
> of magnitude slower than the TSC and less precise too.

Yup.  But TSC by itself gets that benefit, too.

>
>> I'm somewhat tempted to suggest that we delete kvmclock entirely and
>> start over.  A correctly functioning KVM guest using TSC (i.e.
>> ignoring kvmclock entirely) seems to work rather more reliably and
>> considerably faster than a kvmclock guest.
>
> If all your hosts have a working TSC and you don't do migration or
> save/restore, that's a valid configuration.  It's not a good default,
> however.

Er?

kvmclock is still really quite slow and buggy.  And the patch I
identified is definitely a problem here:

[  136.131241] KVM: disabling fast timing permanently due to inability
to recover from suspend

I got that on the host with this whitespace-damaged patch:

if (backwards_tsc) {
u64 delta_cyc = max_tsc - local_tsc;
+   if (!backwards_tsc_observed)
+   pr_warn("KVM: disabling fast timing
permanently due to inability to recover from suspend\n");

when I suspended and resumed.

Can anyone explain what problem
16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve?  On
brief inspection, it just seems to be incorrect.  Shouldn't KVM's
normal TSC logic handle that case right?  After all, all vcpus should
be paused when we resume from suspend.  At worst, we should just need
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus.  (Actually,
shouldn't we do that regardless of which way the TSC jumped on
suspend/resume?  After all, the jTSC-to-wall-clock offset is quite
likely to change except on the very small handful of CPUs (if any)
that keep the TSC running in S3 and hibernate.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-09 Thread Andy Lutomirski
On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>
>
> On 09/12/2015 22:49, Andy Lutomirski wrote:
>> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>>>
>>>
>>> On 09/12/2015 22:10, Andy Lutomirski wrote:
>>>> Can we please stop making kvmclock more complex?  It's a beast right
>>>> now, and not in a good way.  It's far too tangled with the vclock
>>>> machinery on both the host and guest sides, the pvclock stuff is not
>>>> well thought out (even in principle in an ABI sense), and it's never
>>>> been clear to my what problem exactly the kvmclock stuff is supposed
>>>> to solve.
>>>
>>> It's supposed to solve the problem that:
>>>
>>> - not all hosts have a working TSC
>>
>> Fine, but we don't need any vdso integration for that.
>
> Well, you still want a fast time source.  That was a given. :)

If the host can't figure out how to give *itself* a fast time source,
I'd be surprised if KVM can manage to give the guest a fast, reliable
time source.

>
>>> - even if they all do, virtual machines can be migrated (or
>>> saved/restored) to a host with a different TSC frequency
>>>
>>> - any MMIO- or PIO-based mechanism to access the current time is orders
>>> of magnitude slower than the TSC and less precise too.
>>
>> Yup.  But TSC by itself gets that benefit, too.
>
> Yes, the problem is if you want to solve all three of them.  The first
> two are solved by the ACPI PM timer with a decent resolution (70
> ns---much faster anyway than an I/O port access).  The third is solved
> by TSC.  To solve all three, you need kvmclock.

Still confused.  Is kvmclock really used in cases where even the host
can't pull of working TSC?

>
>>>> I'm somewhat tempted to suggest that we delete kvmclock entirely and
>>>> start over.  A correctly functioning KVM guest using TSC (i.e.
>>>> ignoring kvmclock entirely) seems to work rather more reliably and
>>>> considerably faster than a kvmclock guest.
>>>
>>> If all your hosts have a working TSC and you don't do migration or
>>> save/restore, that's a valid configuration.  It's not a good default,
>>> however.
>>
>> Er?
>>
>> kvmclock is still really quite slow and buggy.
>
> Unless it takes 3-4000 clock cycles for a gettimeofday, which it
> shouldn't even with vdso disabled, it's definitely not slower than PIO.
>
>> And the patch I identified is definitely a problem here:
>>
>> [  136.131241] KVM: disabling fast timing permanently due to inability
>> to recover from suspend
>>
>> I got that on the host with this whitespace-damaged patch:
>>
>> if (backwards_tsc) {
>> u64 delta_cyc = max_tsc - local_tsc;
>> +   if (!backwards_tsc_observed)
>> +   pr_warn("KVM: disabling fast timing
>> permanently due to inability to recover from suspend\n");
>>
>> when I suspended and resumed.
>>
>> Can anyone explain what problem
>> 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve?  On
>> brief inspection, it just seems to be incorrect.  Shouldn't KVM's
>> normal TSC logic handle that case right?  After all, all vcpus should
>> be paused when we resume from suspend.  At worst, we should just need
>> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus.  (Actually,
>> shouldn't we do that regardless of which way the TSC jumped on
>> suspend/resume?  After all, the jTSC-to-wall-clock offset is quite
>> likely to change except on the very small handful of CPUs (if any)
>> that keep the TSC running in S3 and hibernate.
>
> I don't recall the details of that patch, so Marcelo will have to answer
> this, or Alex too since he chimed in the original thread.  At least it
> should be made conditional on the existence of a VM at suspend time (and
> the master clock stuff should be made per VM, as I suggested at
> https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html).
>
> It would indeed be great if the master clock could be dropped.  But I'm
> definitely missing some of the subtle details. :(

Me, too.

Anyway, see the attached untested patch.  Marcelo?

--Andy
From e4a5e834d3fb6fc2499966e1af42cb5bd59f4410 Mon Sep 17 00:00:00 2001
Message-Id: <e4a5e834d3fb6fc2499966e1af42cb5bd59f4410.1449700024.git.l...@kernel.org>
From: Andy Lutomirski <l...@kernel.org>
Date: Wed, 9 Dec 2015 14:21:05 -0800
Subject: [PATCH] x86/kvm: On KVM re-enable (e.g. after suspect), update clocks

This gets rid of the "did TSC 

Re: kvmclock doesn't work, help?

2015-12-09 Thread Andy Lutomirski
On Wed, Dec 9, 2015 at 2:27 PM, Andy Lutomirski <l...@amacapital.net> wrote:
> On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>>
>>
>> On 09/12/2015 22:49, Andy Lutomirski wrote:
>>> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>>>>
>>>>
>>>> On 09/12/2015 22:10, Andy Lutomirski wrote:
>>>>> Can we please stop making kvmclock more complex?  It's a beast right
>>>>> now, and not in a good way.  It's far too tangled with the vclock
>>>>> machinery on both the host and guest sides, the pvclock stuff is not
>>>>> well thought out (even in principle in an ABI sense), and it's never
>>>>> been clear to my what problem exactly the kvmclock stuff is supposed
>>>>> to solve.
>>>>
>>>> It's supposed to solve the problem that:
>>>>
>>>> - not all hosts have a working TSC
>>>
>>> Fine, but we don't need any vdso integration for that.
>>
>> Well, you still want a fast time source.  That was a given. :)
>
> If the host can't figure out how to give *itself* a fast time source,
> I'd be surprised if KVM can manage to give the guest a fast, reliable
> time source.
>
>>
>>>> - even if they all do, virtual machines can be migrated (or
>>>> saved/restored) to a host with a different TSC frequency
>>>>
>>>> - any MMIO- or PIO-based mechanism to access the current time is orders
>>>> of magnitude slower than the TSC and less precise too.
>>>
>>> Yup.  But TSC by itself gets that benefit, too.
>>
>> Yes, the problem is if you want to solve all three of them.  The first
>> two are solved by the ACPI PM timer with a decent resolution (70
>> ns---much faster anyway than an I/O port access).  The third is solved
>> by TSC.  To solve all three, you need kvmclock.
>
> Still confused.  Is kvmclock really used in cases where even the host
> can't pull of working TSC?
>
>>
>>>>> I'm somewhat tempted to suggest that we delete kvmclock entirely and
>>>>> start over.  A correctly functioning KVM guest using TSC (i.e.
>>>>> ignoring kvmclock entirely) seems to work rather more reliably and
>>>>> considerably faster than a kvmclock guest.
>>>>
>>>> If all your hosts have a working TSC and you don't do migration or
>>>> save/restore, that's a valid configuration.  It's not a good default,
>>>> however.
>>>
>>> Er?
>>>
>>> kvmclock is still really quite slow and buggy.
>>
>> Unless it takes 3-4000 clock cycles for a gettimeofday, which it
>> shouldn't even with vdso disabled, it's definitely not slower than PIO.
>>
>>> And the patch I identified is definitely a problem here:
>>>
>>> [  136.131241] KVM: disabling fast timing permanently due to inability
>>> to recover from suspend
>>>
>>> I got that on the host with this whitespace-damaged patch:
>>>
>>> if (backwards_tsc) {
>>> u64 delta_cyc = max_tsc - local_tsc;
>>> +   if (!backwards_tsc_observed)
>>> +   pr_warn("KVM: disabling fast timing
>>> permanently due to inability to recover from suspend\n");
>>>
>>> when I suspended and resumed.
>>>
>>> Can anyone explain what problem
>>> 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve?  On
>>> brief inspection, it just seems to be incorrect.  Shouldn't KVM's
>>> normal TSC logic handle that case right?  After all, all vcpus should
>>> be paused when we resume from suspend.  At worst, we should just need
>>> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus.  (Actually,
>>> shouldn't we do that regardless of which way the TSC jumped on
>>> suspend/resume?  After all, the jTSC-to-wall-clock offset is quite
>>> likely to change except on the very small handful of CPUs (if any)
>>> that keep the TSC running in S3 and hibernate.
>>
>> I don't recall the details of that patch, so Marcelo will have to answer
>> this, or Alex too since he chimed in the original thread.  At least it
>> should be made conditional on the existence of a VM at suspend time (and
>> the master clock stuff should be made per VM, as I suggested at
>> https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html).
>>
>> It would indeed be great if the master clock could be dropped.  But I'm
>> defi

[PATCH 1/5] x86/kvm: On KVM re-enable (e.g. after suspend), update clocks

2015-12-09 Thread Andy Lutomirski
This gets rid of the "did TSC go backwards" logic and just updates
all clocks.  It should work better (no more disabling of fast
timing) and more reliably (all of the clocks are actually updated).

Signed-off-by: Andy Lutomirski <l...@kernel.org>
---
 arch/x86/kvm/x86.c | 75 +++---
 1 file changed, 3 insertions(+), 72 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eed32283d22c..c88f91f4b1a3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -123,8 +123,6 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
 unsigned int __read_mostly lapic_timer_advance_ns = 0;
 module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
 
-static bool __read_mostly backwards_tsc_observed = false;
-
 #define KVM_NR_SHARED_MSRS 16
 
 struct kvm_shared_msrs_global {
@@ -1671,7 +1669,6 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
>master_cycle_now);
 
ka->use_master_clock = host_tsc_clocksource && vcpus_matched
-   && !backwards_tsc_observed
&& !ka->boot_vcpu_runs_old_kvmclock;
 
if (ka->use_master_clock)
@@ -7369,88 +7366,22 @@ int kvm_arch_hardware_enable(void)
struct kvm_vcpu *vcpu;
int i;
int ret;
-   u64 local_tsc;
-   u64 max_tsc = 0;
-   bool stable, backwards_tsc = false;
 
kvm_shared_msr_cpu_online();
ret = kvm_x86_ops->hardware_enable();
if (ret != 0)
return ret;
 
-   local_tsc = rdtsc();
-   stable = !check_tsc_unstable();
list_for_each_entry(kvm, _list, vm_list) {
kvm_for_each_vcpu(i, vcpu, kvm) {
-   if (!stable && vcpu->cpu == smp_processor_id())
+   if (vcpu->cpu == smp_processor_id()) {
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-   if (stable && vcpu->arch.last_host_tsc > local_tsc) {
-   backwards_tsc = true;
-   if (vcpu->arch.last_host_tsc > max_tsc)
-   max_tsc = vcpu->arch.last_host_tsc;
+   kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE,
+vcpu);
}
}
}
 
-   /*
-* Sometimes, even reliable TSCs go backwards.  This happens on
-* platforms that reset TSC during suspend or hibernate actions, but
-* maintain synchronization.  We must compensate.  Fortunately, we can
-* detect that condition here, which happens early in CPU bringup,
-* before any KVM threads can be running.  Unfortunately, we can't
-* bring the TSCs fully up to date with real time, as we aren't yet far
-* enough into CPU bringup that we know how much real time has actually
-* elapsed; our helper function, get_kernel_ns() will be using boot
-* variables that haven't been updated yet.
-*
-* So we simply find the maximum observed TSC above, then record the
-* adjustment to TSC in each VCPU.  When the VCPU later gets loaded,
-* the adjustment will be applied.  Note that we accumulate
-* adjustments, in case multiple suspend cycles happen before some VCPU
-* gets a chance to run again.  In the event that no KVM threads get a
-* chance to run, we will miss the entire elapsed period, as we'll have
-* reset last_host_tsc, so VCPUs will not have the TSC adjusted and may
-* loose cycle time.  This isn't too big a deal, since the loss will be
-* uniform across all VCPUs (not to mention the scenario is extremely
-* unlikely). It is possible that a second hibernate recovery happens
-* much faster than a first, causing the observed TSC here to be
-* smaller; this would require additional padding adjustment, which is
-* why we set last_host_tsc to the local tsc observed here.
-*
-* N.B. - this code below runs only on platforms with reliable TSC,
-* as that is the only way backwards_tsc is set above.  Also note
-* that this runs for ALL vcpus, which is not a bug; all VCPUs should
-* have the same delta_cyc adjustment applied if backwards_tsc
-* is detected.  Note further, this adjustment is only done once,
-* as we reset last_host_tsc on all VCPUs to stop this from being
-* called multiple times (one for each physical CPU bringup).
-*
-* Platforms with unreliable TSCs don't have to deal with this, they
-* will be compensated by the logic in vcpu_load, which sets the TSC to
-* catchup mode.  This will catchup all VCPUs to real time, but cannot
-

[PATCH 2/5] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-12-09 Thread Andy Lutomirski
From: Andy Lutomirski <l...@amacapital.net>

The pvclock vdso code was too abstracted to understand easily and
excessively paranoid.  Simplify it for a huge speedup.

This opens the door for additional simplifications, as the vdso no
longer accesses the pvti for any vcpu other than vcpu 0.

Before, vclock_gettime using kvm-clock took about 45ns on my machine.
With this change, it takes 29ns, which is almost as fast as the pure TSC
implementation.

Signed-off-by: Andy Lutomirski <l...@amacapital.net>
---
 arch/x86/entry/vdso/vclock_gettime.c | 81 
 1 file changed, 46 insertions(+), 35 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index ca94fa649251..c325ba1bdddf 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info 
*get_pvti(int cpu)
 
 static notrace cycle_t vread_pvclock(int *mode)
 {
-   const struct pvclock_vsyscall_time_info *pvti;
+   const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti;
cycle_t ret;
-   u64 last;
-   u32 version;
-   u8 flags;
-   unsigned cpu, cpu1;
-
+   u64 tsc, pvti_tsc;
+   u64 last, delta, pvti_system_time;
+   u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
 
/*
-* Note: hypervisor must guarantee that:
-* 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
-* 2. that per-CPU pvclock time info is updated if the
-*underlying CPU changes.
-* 3. that version is increased whenever underlying CPU
-*changes.
+* Note: The kernel and hypervisor must guarantee that cpu ID
+* number maps 1:1 to per-CPU pvclock time info.
+*
+* Because the hypervisor is entirely unaware of guest userspace
+* preemption, it cannot guarantee that per-CPU pvclock time
+* info is updated if the underlying CPU changes or that that
+* version is increased whenever underlying CPU changes.
 *
+* On KVM, we are guaranteed that pvti updates for any vCPU are
+* atomic as seen by *all* vCPUs.  This is an even stronger
+* guarantee than we get with a normal seqlock.
+*
+* On Xen, we don't appear to have that guarantee, but Xen still
+* supplies a valid seqlock using the version field.
+
+* We only do pvclock vdso timing at all if
+* PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
+* mean that all vCPUs have matching pvti and that the TSC is
+* synced, so we can just look at vCPU 0's pvti.
 */
-   do {
-   cpu = __getcpu() & VGETCPU_CPU_MASK;
-   /* TODO: We can put vcpu id into higher bits of pvti.version.
-* This will save a couple of cycles by getting rid of
-* __getcpu() calls (Gleb).
-*/
-
-   pvti = get_pvti(cpu);
-
-   version = __pvclock_read_cycles(>pvti, , );
-
-   /*
-* Test we're still on the cpu as well as the version.
-* We could have been migrated just after the first
-* vgetcpu but before fetching the version, so we
-* wouldn't notice a version change.
-*/
-   cpu1 = __getcpu() & VGETCPU_CPU_MASK;
-   } while (unlikely(cpu != cpu1 ||
- (pvti->pvti.version & 1) ||
- pvti->pvti.version != version));
-
-   if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
+
+   if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
*mode = VCLOCK_NONE;
+   return 0;
+   }
+
+   do {
+   version = pvti->version;
+
+   /* This is also a read barrier, so we'll read version first. */
+   tsc = rdtsc_ordered();
+
+   pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
+   pvti_tsc_shift = pvti->tsc_shift;
+   pvti_system_time = pvti->system_time;
+   pvti_tsc = pvti->tsc_timestamp;
+
+   /* Make sure that the version double-check is last. */
+   smp_rmb();
+   } while (unlikely((version & 1) || version != pvti->version));
+
+   delta = tsc - pvti_tsc;
+   ret = pvti_system_time +
+   pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
+   pvti_tsc_shift);
 
/* refer to tsc.c read_tsc() comment for rationale */
last = gtod->cycle_last;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] x86: KVM vdso and clock improvements

2015-12-09 Thread Andy Lutomirski
NB: patch 1 doesn't really belong here, but it makes this a lot
easier for me to test.  Patch 1, if it's okay at all, should go
though the kvm tree.  The rest should probably go through
tip:x86/vdso once they're reviewed.

I'll do a followup to enable vdso pvclock on 32-bit guests.
I'm not currently set up to test it.  (The KVM people could also
do it very easily on top of these patches.)

Andy Lutomirski (5):
  x86/kvm: On KVM re-enable (e.g. after suspend), update clocks
  x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap
  x86/vdso: Remove pvclock fixmap machinery
  x86/vdso: Enable vdso pvclock access on all vdso variants

 arch/x86/entry/vdso/vclock_gettime.c  | 151 --
 arch/x86/entry/vdso/vdso-layout.lds.S |   3 +-
 arch/x86/entry/vdso/vdso2c.c  |   3 +
 arch/x86/entry/vdso/vma.c |  14 
 arch/x86/include/asm/fixmap.h |   5 --
 arch/x86/include/asm/pvclock.h|  14 ++--
 arch/x86/include/asm/vdso.h   |   1 +
 arch/x86/kernel/kvmclock.c|  11 ++-
 arch/x86/kernel/pvclock.c |  24 --
 arch/x86/kvm/x86.c|  75 +
 10 files changed, 110 insertions(+), 191 deletions(-)

-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap

2015-12-09 Thread Andy Lutomirski
Signed-off-by: Andy Lutomirski <l...@kernel.org>
---
 arch/x86/entry/vdso/vclock_gettime.c  | 20 
 arch/x86/entry/vdso/vdso-layout.lds.S |  3 ++-
 arch/x86/entry/vdso/vdso2c.c  |  3 +++
 arch/x86/entry/vdso/vma.c | 13 +
 arch/x86/include/asm/pvclock.h|  9 +
 arch/x86/include/asm/vdso.h   |  1 +
 arch/x86/kernel/kvmclock.c|  5 +
 7 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index c325ba1bdddf..5dd363d54348 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -36,6 +36,11 @@ static notrace cycle_t vread_hpet(void)
 }
 #endif
 
+#ifdef CONFIG_PARAVIRT_CLOCK
+extern u8 pvclock_page
+   __attribute__((visibility("hidden")));
+#endif
+
 #ifndef BUILD_VDSO32
 
 #include 
@@ -62,23 +67,14 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, 
struct timezone *tz)
 
 #ifdef CONFIG_PARAVIRT_CLOCK
 
-static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
+static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void)
 {
-   const struct pvclock_vsyscall_time_info *pvti_base;
-   int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
-   int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
-
-   BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
-
-   pvti_base = (struct pvclock_vsyscall_time_info *)
-   __fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
-
-   return _base[offset];
+   return (const struct pvclock_vsyscall_time_info *)_page;
 }
 
 static notrace cycle_t vread_pvclock(int *mode)
 {
-   const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti;
+   const struct pvclock_vcpu_time_info *pvti = _pvti0()->pvti;
cycle_t ret;
u64 tsc, pvti_tsc;
u64 last, delta, pvti_system_time;
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S 
b/arch/x86/entry/vdso/vdso-layout.lds.S
index de2c921025f5..4158acc17df0 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -25,7 +25,7 @@ SECTIONS
 * segment.
 */
 
-   vvar_start = . - 2 * PAGE_SIZE;
+   vvar_start = . - 3 * PAGE_SIZE;
vvar_page = vvar_start;
 
/* Place all vvars at the offsets in asm/vvar.h. */
@@ -36,6 +36,7 @@ SECTIONS
 #undef EMIT_VVAR
 
hpet_page = vvar_start + PAGE_SIZE;
+   pvclock_page = vvar_start + 2 * PAGE_SIZE;
 
. = SIZEOF_HEADERS;
 
diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
index 785d9922b106..491020b2826d 100644
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -73,6 +73,7 @@ enum {
sym_vvar_start,
sym_vvar_page,
sym_hpet_page,
+   sym_pvclock_page,
sym_VDSO_FAKE_SECTION_TABLE_START,
sym_VDSO_FAKE_SECTION_TABLE_END,
 };
@@ -80,6 +81,7 @@ enum {
 const int special_pages[] = {
sym_vvar_page,
sym_hpet_page,
+   sym_pvclock_page,
 };
 
 struct vdso_sym {
@@ -91,6 +93,7 @@ struct vdso_sym required_syms[] = {
[sym_vvar_start] = {"vvar_start", true},
[sym_vvar_page] = {"vvar_page", true},
[sym_hpet_page] = {"hpet_page", true},
+   [sym_pvclock_page] = {"pvclock_page", true},
[sym_VDSO_FAKE_SECTION_TABLE_START] = {
"VDSO_FAKE_SECTION_TABLE_START", false
},
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 64df47148160..aa828191c654 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -100,6 +100,7 @@ static int map_vdso(const struct vdso_image *image, bool 
calculate_addr)
.name = "[vvar]",
.pages = no_pages,
};
+   struct pvclock_vsyscall_time_info *pvti;
 
if (calculate_addr) {
addr = vdso_addr(current->mm->start_stack,
@@ -169,6 +170,18 @@ static int map_vdso(const struct vdso_image *image, bool 
calculate_addr)
}
 #endif
 
+   pvti = pvclock_pvti_cpu0_va();
+   if (pvti && image->sym_pvclock_page) {
+   ret = remap_pfn_range(vma,
+ text_start + image->sym_pvclock_page,
+ __pa(pvti) >> PAGE_SHIFT,
+ PAGE_SIZE,
+ PAGE_READONLY);
+
+   if (ret)
+   goto up_fail;
+   }
+
 up_fail:
if (ret)
current->mm->context.vdso = NULL;
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 7a6bed5c08bc..3864398c7cb2 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -4,6 +4,15 @@
 #include 
 #include 
 
+#ifdef CONFIG_PARAVIRT_CLOCK
+extern struct pvclock_vsyscall

[PATCH 4/5] x86/vdso: Remove pvclock fixmap machinery

2015-12-09 Thread Andy Lutomirski
Signed-off-by: Andy Lutomirski <l...@kernel.org>
---
 arch/x86/entry/vdso/vclock_gettime.c |  1 -
 arch/x86/entry/vdso/vma.c|  1 +
 arch/x86/include/asm/fixmap.h|  5 -
 arch/x86/include/asm/pvclock.h   |  5 -
 arch/x86/kernel/kvmclock.c   |  6 --
 arch/x86/kernel/pvclock.c| 24 
 6 files changed, 1 insertion(+), 41 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 5dd363d54348..59a98c25bde7 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -45,7 +45,6 @@ extern u8 pvclock_page
 
 #include 
 #include 
-#include 
 #include 
 
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index aa828191c654..b8f69e264ac4 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index f80d70009ff8..6d7d0e52ed5a 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -19,7 +19,6 @@
 #include 
 #include 
 #include 
-#include 
 #ifdef CONFIG_X86_32
 #include 
 #include 
@@ -72,10 +71,6 @@ enum fixed_addresses {
 #ifdef CONFIG_X86_VSYSCALL_EMULATION
VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
 #endif
-#ifdef CONFIG_PARAVIRT_CLOCK
-   PVCLOCK_FIXMAP_BEGIN,
-   PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1,
-#endif
 #endif
FIX_DBGP_BASE,
FIX_EARLYCON_MEM_BASE,
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 3864398c7cb2..66df22b2e0c9 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -100,10 +100,5 @@ struct pvclock_vsyscall_time_info {
 } __attribute__((__aligned__(SMP_CACHE_BYTES)));
 
 #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
-#define PVCLOCK_VSYSCALL_NR_PAGES (((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1)
-
-int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
-int size);
-struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu);
 
 #endif /* _ASM_X86_PVCLOCK_H */
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index ec1b06dc82d2..72cef58693c7 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -310,7 +310,6 @@ int __init kvm_setup_vsyscall_timeinfo(void)
 {
 #ifdef CONFIG_X86_64
int cpu;
-   int ret;
u8 flags;
struct pvclock_vcpu_time_info *vcpu_time;
unsigned int size;
@@ -330,11 +329,6 @@ int __init kvm_setup_vsyscall_timeinfo(void)
return 1;
}
 
-   if ((ret = pvclock_init_vsyscall(hv_clock, size))) {
-   put_cpu();
-   return ret;
-   }
-
put_cpu();
 
kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 2f355d229a58..99bfc025111d 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -140,27 +140,3 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
*wall_clock,
 
set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
 }
-
-#ifdef CONFIG_X86_64
-/*
- * Initialize the generic pvclock vsyscall state.  This will allocate
- * a/some page(s) for the per-vcpu pvclock information, set up a
- * fixmap mapping for the page(s)
- */
-
-int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
-int size)
-{
-   int idx;
-
-   WARN_ON (size != PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE);
-
-   for (idx = 0; idx <= (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
-   __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
-__pa(i) + (idx*PAGE_SIZE),
-PAGE_KERNEL_VVAR);
-   }
-
-   return 0;
-}
-#endif
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] x86/vdso: Enable vdso pvclock access on all vdso variants

2015-12-09 Thread Andy Lutomirski
Now that pvclock doesn't require access to the fixmap, all vdso
variants can use it.

The kernel side isn't wired up for 32-bit kernels yet, but this
covers 32-bit and x32 userspace on 64-bit kernels.

Signed-off-by: Andy Lutomirski <l...@kernel.org>
---
 arch/x86/entry/vdso/vclock_gettime.c | 91 
 1 file changed, 40 insertions(+), 51 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 59a98c25bde7..8602f06c759f 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -17,8 +17,10 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 
 #define gtod ((vsyscall_gtod_data))
 
@@ -43,10 +45,6 @@ extern u8 pvclock_page
 
 #ifndef BUILD_VDSO32
 
-#include 
-#include 
-#include 
-
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
 {
long ret;
@@ -64,8 +62,42 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, 
struct timezone *tz)
return ret;
 }
 
-#ifdef CONFIG_PARAVIRT_CLOCK
 
+#else
+
+notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
+{
+   long ret;
+
+   asm(
+   "mov %%ebx, %%edx \n"
+   "mov %2, %%ebx \n"
+   "call __kernel_vsyscall \n"
+   "mov %%edx, %%ebx \n"
+   : "=a" (ret)
+   : "0" (__NR_clock_gettime), "g" (clock), "c" (ts)
+   : "memory", "edx");
+   return ret;
+}
+
+notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
+{
+   long ret;
+
+   asm(
+   "mov %%ebx, %%edx \n"
+   "mov %2, %%ebx \n"
+   "call __kernel_vsyscall \n"
+   "mov %%edx, %%ebx \n"
+   : "=a" (ret)
+   : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
+   : "memory", "edx");
+   return ret;
+}
+
+#endif
+
+#ifdef CONFIG_PARAVIRT_CLOCK
 static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void)
 {
return (const struct pvclock_vsyscall_time_info *)_page;
@@ -109,9 +141,9 @@ static notrace cycle_t vread_pvclock(int *mode)
do {
version = pvti->version;
 
-   /* This is also a read barrier, so we'll read version first. */
-   tsc = rdtsc_ordered();
+   smp_rmb();
 
+   tsc = rdtsc_ordered();
pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
pvti_tsc_shift = pvti->tsc_shift;
pvti_system_time = pvti->system_time;
@@ -126,7 +158,7 @@ static notrace cycle_t vread_pvclock(int *mode)
pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
pvti_tsc_shift);
 
-   /* refer to tsc.c read_tsc() comment for rationale */
+   /* refer to vread_tsc() comment for rationale */
last = gtod->cycle_last;
 
if (likely(ret >= last))
@@ -136,49 +168,6 @@ static notrace cycle_t vread_pvclock(int *mode)
 }
 #endif
 
-#else
-
-notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
-{
-   long ret;
-
-   asm(
-   "mov %%ebx, %%edx \n"
-   "mov %2, %%ebx \n"
-   "call __kernel_vsyscall \n"
-   "mov %%edx, %%ebx \n"
-   : "=a" (ret)
-   : "0" (__NR_clock_gettime), "g" (clock), "c" (ts)
-   : "memory", "edx");
-   return ret;
-}
-
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
-{
-   long ret;
-
-   asm(
-   "mov %%ebx, %%edx \n"
-   "mov %2, %%ebx \n"
-   "call __kernel_vsyscall \n"
-   "mov %%edx, %%ebx \n"
-   : "=a" (ret)
-   : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
-   : "memory", "edx");
-   return ret;
-}
-
-#ifdef CONFIG_PARAVIRT_CLOCK
-
-static notrace cycle_t vread_pvclock(int *mode)
-{
-   *mode = VCLOCK_NONE;
-   return 0;
-}
-#endif
-
-#endif
-
 notrace static cycle_t vread_tsc(void)
 {
cycle_t ret = (cycle_t)rdtsc_ordered();
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-19 Thread Andy Lutomirski
On Nov 19, 2015 5:45 AM, "Michael S. Tsirkin" <m...@redhat.com> wrote:
>
> On Tue, Oct 27, 2015 at 11:38:57PM -0700, Andy Lutomirski wrote:
> > This switches virtio to use the DMA API unconditionally.  I'm sure
> > it breaks things, but it seems to work on x86 using virtio-pci, with
> > and without Xen, and using both the modern 1.0 variant and the
> > legacy variant.
>
> So thinking hard about it, I don't see any real drawbacks to making this
> conditional on a new feature bit, that Xen can then set..

Can you elaborate?  If I run QEMU, hosting Xen, hosting Linux, and the
virtio device is provided by QEMU, then how does Xen set the bit?
Similarly, how would Xen set the bit for a real physical device?


--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm/vmx: EPTP switching test

2015-11-16 Thread Andy Lutomirski
On Mon, Nov 16, 2015 at 10:18 AM,
=?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?=  wrote:
> 2015-11-16 19:59+0200, Michael S. Tsirkin:
>> On Mon, Nov 16, 2015 at 06:51:06PM +0100, 
>> =?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?= wrote:
>>> 2015-11-15 18:00+0200, Michael S. Tsirkin:
>>> (And I think that eptp switching is expected to be used in conjuction
>>>  with #VE, so it'd then make sense to implement a nop for it as well.)
>>
>> No idea how would I even test it, so I'm not interested in #VE at this
>> point.  If you are - go ahead and post a patch for that on top though,
>> why not.
>
> I thought that it's going to be simpler to provide functionality (that
> utilizes eptp switching) to the guest through #VE, which probably isn't
> true as I think more about it.  (Not interested in implementing it :])

I'm willing to review a #VE stub.  I'm not likely to implement it.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-11 Thread Andy Lutomirski
On Wed, Nov 11, 2015 at 2:05 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Tue, Nov 10, 2015 at 10:54:21AM -0800, Andy Lutomirski wrote:
>> On Nov 10, 2015 7:02 AM, "Michael S. Tsirkin" <m...@redhat.com> wrote:
>> >
>> > On Sun, Nov 08, 2015 at 12:49:46PM +0100, Joerg Roedel wrote:
>> > > On Sun, Nov 08, 2015 at 12:37:47PM +0200, Michael S. Tsirkin wrote:
>> > > > I have no problem with that. For example, can we teach
>> > > > the DMA API on intel x86 to use PT for virtio by default?
>> > > > That would allow merging Andy's patches with
>> > > > full compatibility with old guests and hosts.
>> > >
>> > > Well, the only incompatibility comes from an experimental qemu feature,
>> > > more explicitly from a bug in that features implementation. So why
>> > > should we work around that in the kernel? I think it is not too hard to
>> > > fix qemu to generate a correct DMAR table which excludes the virtio
>> > > devices from iommu translation.
>> > >
>> > >
>> > >   Joerg
>> >
>> > It's not that easy - you'd have to dedicate some buses
>> > for iommu bypass, and teach management tools to only put
>> > virtio there - but it's possible.
>> >
>> > This will absolutely address guests that don't need to set up IOMMU for
>> > virtio devices, and virtio that bypasses the IOMMU.
>> >
>> > But the problem is that we do want to *allow* guests
>> > to set up IOMMU for virtio devices.
>> > In that case, these are two other usecases:
>> >
>> > A- monolitic virtio within QEMU:
>> > iommu only needed for VFIO ->
>> > guest should always use iommu=pt
>> > iommu=on works but is just useless overhead.
>> >
>> > B- modular out of process virtio outside QEMU:
>> > iommu needed for VFIO or kernel driver ->
>> > guest should use iommu=pt or iommu=on
>> > depending on security/performance requirements
>> >
>> > Note that there could easily be a mix of these in the same system.
>> >
>> > So for these cases we do need QEMU to specify to guest that IOMMU covers
>> > the virtio devices.  Also, once one does this, the default on linux is
>> > iommu=on and not pt, which works but ATM is very slow.
>> >
>> > This poses three problems:
>> >
>> > 1. How do we address the different needs of A and B?
>> >One way would be for virtio to pass the information to guest
>> >using some virtio specific way, and have drivers
>> >specify what kind of DMA access they want.
>> >
>> > 2. (Kind of a subset of 1) once we do allow IOMMU, how do we make sure 
>> > most guests
>> >use the more sensible iommu=pt.
>> >
>> > 3. Once we do allow IOMMU, how can we keep existing guests work in this 
>> > configuration?
>> >Creating different hypervisor configurations depending on guest is very 
>> > nasty.
>> >Again, one way would be some virtio specific interface.
>> >
>> > I'd rather we figured the answers to this before merging Andy's patches
>> > because I'm concerned that instead of 1 broken configuration
>> > (virtio always bypasses IOMMU) we'll get two bad configurations
>> > (in the second one, virtio uses the slow default with no
>> > gain in security).
>> >
>> > Suggestions wellcome.
>>
>> I think there's still no downside of using my patches, even on x86.
>>
>> Old kernels on new QEMU work unless IOMMU is enabled on the host.  I
>> think that's the best we can possibly do.
>> New kernels work at full speed on old QEMU.
>
> Only if IOMMU is disabled, right?
>
>> New kernels with new QEMU and iommu enabled work slower.  Even newer
>> kernels with default passthrough work at full speed, and there's no
>> obvious downside to the existence of kernels with just my patches.
>>
>> --Andy
>>
>
> I tried to explain the possible downside. Let me try again.  Imagine
> that guest kernel notifies hypervisor that it wants IOMMU to actually
> work.  This will make old kernel on new QEMU work even with IOMMU
> enabled on host - better than "the best we can do" that you described
> above.  Specifically, QEMU will assume that if it didn't get
> notification, it's an old kernel so it should ignore the IOMMU.

Can you flesh out this trick?

On x86 IIUC the IOMMU more-or-less defaults to passthrou

Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-10 Thread Andy Lutomirski
On Nov 10, 2015 7:02 AM, "Michael S. Tsirkin"  wrote:
>
> On Sun, Nov 08, 2015 at 12:49:46PM +0100, Joerg Roedel wrote:
> > On Sun, Nov 08, 2015 at 12:37:47PM +0200, Michael S. Tsirkin wrote:
> > > I have no problem with that. For example, can we teach
> > > the DMA API on intel x86 to use PT for virtio by default?
> > > That would allow merging Andy's patches with
> > > full compatibility with old guests and hosts.
> >
> > Well, the only incompatibility comes from an experimental qemu feature,
> > more explicitly from a bug in that features implementation. So why
> > should we work around that in the kernel? I think it is not too hard to
> > fix qemu to generate a correct DMAR table which excludes the virtio
> > devices from iommu translation.
> >
> >
> >   Joerg
>
> It's not that easy - you'd have to dedicate some buses
> for iommu bypass, and teach management tools to only put
> virtio there - but it's possible.
>
> This will absolutely address guests that don't need to set up IOMMU for
> virtio devices, and virtio that bypasses the IOMMU.
>
> But the problem is that we do want to *allow* guests
> to set up IOMMU for virtio devices.
> In that case, these are two other usecases:
>
> A- monolitic virtio within QEMU:
> iommu only needed for VFIO ->
> guest should always use iommu=pt
> iommu=on works but is just useless overhead.
>
> B- modular out of process virtio outside QEMU:
> iommu needed for VFIO or kernel driver ->
> guest should use iommu=pt or iommu=on
> depending on security/performance requirements
>
> Note that there could easily be a mix of these in the same system.
>
> So for these cases we do need QEMU to specify to guest that IOMMU covers
> the virtio devices.  Also, once one does this, the default on linux is
> iommu=on and not pt, which works but ATM is very slow.
>
> This poses three problems:
>
> 1. How do we address the different needs of A and B?
>One way would be for virtio to pass the information to guest
>using some virtio specific way, and have drivers
>specify what kind of DMA access they want.
>
> 2. (Kind of a subset of 1) once we do allow IOMMU, how do we make sure most 
> guests
>use the more sensible iommu=pt.
>
> 3. Once we do allow IOMMU, how can we keep existing guests work in this 
> configuration?
>Creating different hypervisor configurations depending on guest is very 
> nasty.
>Again, one way would be some virtio specific interface.
>
> I'd rather we figured the answers to this before merging Andy's patches
> because I'm concerned that instead of 1 broken configuration
> (virtio always bypasses IOMMU) we'll get two bad configurations
> (in the second one, virtio uses the slow default with no
> gain in security).
>
> Suggestions wellcome.

I think there's still no downside of using my patches, even on x86.

Old kernels on new QEMU work unless IOMMU is enabled on the host.  I
think that's the best we can possibly do.

New kernels work at full speed on old QEMU.

New kernels with new QEMU and iommu enabled work slower.  Even newer
kernels with default passthrough work at full speed, and there's no
obvious downside to the existence of kernels with just my patches.

--Andy

>
> --
> MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-10 Thread Andy Lutomirski
On Nov 10, 2015 2:38 AM, "Benjamin Herrenschmidt"
<b...@kernel.crashing.org> wrote:
>
> On Mon, 2015-11-09 at 21:35 -0800, Andy Lutomirski wrote:
> >
> > We could do it the other way around: on powerpc, if a PCI device is in
> > that range and doesn't have the "bypass" property at all, then it's
> > assumed to bypass the IOMMU.  This means that everything that
> > currently works continues working.  If someone builds a physical
> > virtio device or uses another system in PCIe target mode speaking
> > virtio, then it won't work until they upgrade their firmware to set
> > bypass=0.  Meanwhile everyone using hypothetical new QEMU also gets
> > bypass=0 and no ambiguity.
> >
> > vfio will presumably notice the bypass and correctly refuse to map any
> > current virtio devices.
> >
> > Would that work?
>
> That would be extremely strange from a platform perspective. Any device
> in that vendor/device range would bypass the iommu unless some new
> property "actually-works-like-a-real-pci-device" happens to exist in
> the device-tree, which we would then need to define somewhere and
> handle accross at least 3 different platforms who get their device-tree
> from widly different places.
>
> Also if tomorrow I create a PCI device that implements virtio-net and
> put it in a machine running IBM proprietary firmware (or Apple's or
> Sun's), it won't have that property...
>
> This is not hypothetical. People are using virtio to do point-to-point
> communication between machines via PCIe today.

Does that work on powerpc on existing kernels?

Anyway, here's another crazy idea: make the quirk assume that the
IOMMU is bypasses if and only if the weak barriers bit is set on
systems that are missing the new DT binding.

--Andy

>
> Cheers,
> Ben.
>
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-10 Thread Andy Lutomirski
On Tue, Nov 10, 2015 at 2:27 PM, Benjamin Herrenschmidt
<b...@kernel.crashing.org> wrote:
> On Tue, 2015-11-10 at 10:54 -0800, Andy Lutomirski wrote:
>>
>> Does that work on powerpc on existing kernels?
>>
>> Anyway, here's another crazy idea: make the quirk assume that the
>> IOMMU is bypasses if and only if the weak barriers bit is set on
>> systems that are missing the new DT binding.
>
> "New DT bindings" doesn't mean much ... how do we change DT bindings on
> existing machines with a FW in flash ?
>
> What about partition <-> partition virtio such as what we could do on
> PAPR systems. That would have the weak barrier bit.
>

Is it partition <-> partition, bypassing IOMMU?

I think I'd settle for just something that doesn't regress
non-experimental setups that actually work today and that allow new
setups (x86 with fixed QEMU and maybe something more complicated on
powerpc and/or sparc) to work in all cases.

We could certainly just make powerpc and sparc continue bypassing the
IOMMU until someone comes up with a way to fix it.  I'll send out some
patches that do that, and maybe that'll help this make progress.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-10 Thread Andy Lutomirski
On Nov 10, 2015 4:44 PM, "Benjamin Herrenschmidt"
<b...@kernel.crashing.org> wrote:
>
> On Tue, 2015-11-10 at 15:44 -0800, Andy Lutomirski wrote:
> >
> > > What about partition <-> partition virtio such as what we could do on
> > > PAPR systems. That would have the weak barrier bit.
> > >
> >
> > Is it partition <-> partition, bypassing IOMMU?
>
> No.
>
> > I think I'd settle for just something that doesn't regress
> > non-experimental setups that actually work today and that allow new
> > setups (x86 with fixed QEMU and maybe something more complicated on
> > powerpc and/or sparc) to work in all cases.
> >
> > We could certainly just make powerpc and sparc continue bypassing the
> > IOMMU until someone comes up with a way to fix it.  I'll send out some
> > patches that do that, and maybe that'll help this make progress.
>
> But we haven't found a solution that works. All we have come up with is
> a quirk that will force bypass on virtio always and will not allow us
> to operate non-bypassing devices on either of those architectures in
> the future.
>
> I'm not too happy about this.

Me neither.  At least it wouldn't be a regression, but it's still crappy.

I think that arm is fine, at least.  I was unable to find an arm QEMU
config that has any problems with my patches.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-09 Thread Andy Lutomirski
On Mon, Nov 9, 2015 at 9:28 PM, Benjamin Herrenschmidt
<b...@kernel.crashing.org> wrote:
> On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote:
>>
>> /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF.
>> */
>> static const struct pci_device_id virtio_pci_id_table[] = {
>> { PCI_DEVICE(0x1af4, PCI_ANY_ID) },
>> { 0 }
>> };
>>
>> Can we match on that range?
>
> We can, but the problem remains, how do we differenciate an existing
> device that does bypass vs. a newer one that needs the IOMMU and thus
> doesn't have the new "bypass" property in the device-tree.
>

We could do it the other way around: on powerpc, if a PCI device is in
that range and doesn't have the "bypass" property at all, then it's
assumed to bypass the IOMMU.  This means that everything that
currently works continues working.  If someone builds a physical
virtio device or uses another system in PCIe target mode speaking
virtio, then it won't work until they upgrade their firmware to set
bypass=0.  Meanwhile everyone using hypothetical new QEMU also gets
bypass=0 and no ambiguity.

vfio will presumably notice the bypass and correctly refuse to map any
current virtio devices.

Would that work?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-09 Thread Andy Lutomirski
On Mon, Nov 9, 2015 at 9:26 PM, Benjamin Herrenschmidt
<b...@kernel.crashing.org> wrote:
> On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote:
>>
>> Which leaves the special case of Xen, where even preexisting devices
>> don't bypass the IOMMU.  Can we keep this specific to powerpc and
>> sparc?  On x86, this problem is basically nonexistent, since the IOMMU
>> is properly self-describing.
>>
>> IOW, I think that on x86 we should assume that all virtio devices
>> honor the IOMMU.
>
> You don't like performances ? :-)

This should have basically no effect.  Every non-experimental x86
virtio setup in existence either doesn't work at all (Xen) or has DMA
ops that are no-ops.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-09 Thread Andy Lutomirski
On Mon, Nov 9, 2015 at 6:04 PM, Benjamin Herrenschmidt
<b...@kernel.crashing.org> wrote:
> On Mon, 2015-11-09 at 16:46 -0800, Andy Lutomirski wrote:
>> The problem here is that in some of the problematic cases the virtio
>> driver may not even be loaded.  If someone runs an L1 guest with an
>> IOMMU-bypassing virtio device and assigns it to L2 using vfio, then
>> *boom* L1 crashes.  (Same if, say, DPDK gets used, I think.)
>>
>> >
>> > The only way out of this while keeping the "platform" stuff would be to
>> > also bump some kind of version in the virtio config (or PCI header). I
>> > have no other way to differenciate between "this is an old qemu that
>> > doesn't do the 'bypass property' yet" from "this is a virtio device
>> > that doesn't bypass".
>> >
>> > Any better idea ?
>>
>> I'd suggest that, in the absence of the new DT binding, we assume that
>> any PCI device with the virtio vendor ID is passthrough on powerpc.  I
>> can do this in the virtio driver, but if it's in the platform code
>> then vfio gets it right too (i.e. fails to load).
>
> The problem is there isn't *a* virtio vendor ID. It's the RedHat vendor
> ID which will be used by more than just virtio, so we need to
> specifically list the devices.

Really?

/* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. */
static const struct pci_device_id virtio_pci_id_table[] = {
{ PCI_DEVICE(0x1af4, PCI_ANY_ID) },
{ 0 }
};

Can we match on that range?

>
> Additionally, that still means that once we have a virtio device that
> actually uses the iommu, powerpc will not work since the "workaround"
> above will kick in.

I don't know how to solve that problem, though, especially since the
vendor of such a device (especially if it's real hardware) might not
set any new bit.

>
> The "in absence of the new DT binding" doesn't make that much sense.
>
> Those platforms use device-trees defined since the dawn of ages by
> actual open firmware implementations, they either have no iommu
> representation in there (Macs, the platform code hooks it all up) or
> have various properties related to the iommu but no concept of "bypass"
> in there.
>
> We can *add* a new property under some circumstances that indicates a
> bypass on a per-device basis, however that doesn't completely solve it:
>
>   - As I said above, what does the absence of that property mean ? An
> old qemu that does bypass on all virtio or a new qemu trying to tell
> you that the virtio device actually does use the iommu (or some other
> environment that isn't qemu) ?
>
>   - On things like macs, the device-tree is generated by openbios, it
> would have to have some added logic to try to figure that out, which
> means it needs to know *via different means* that some or all virtio
> devices bypass the iommu.
>
> I thus go back to my original statement, it's a LOT easier to handle if
> the device itself is self describing, indicating whether it is set to
> bypass a host iommu or not. For L1->L2, well, that wouldn't be the
> first time qemu/VFIO plays tricks with the passed through device
> configuration space...

Which leaves the special case of Xen, where even preexisting devices
don't bypass the IOMMU.  Can we keep this specific to powerpc and
sparc?  On x86, this problem is basically nonexistent, since the IOMMU
is properly self-describing.

IOW, I think that on x86 we should assume that all virtio devices
honor the IOMMU.

>
> Note that the above can be solved via some kind of compromise: The
> device self describes the ability to honor the iommu, along with the
> property (or ACPI table entry) that indicates whether or not it does.
>
> IE. We could use the revision or ProgIf field of the config space for
> example. Or something in virtio config. If it's an "old" device, we
> know it always bypass. If it's a new device, we know it only bypasses
> if the corresponding property is in. I still would have to sort out the
> openbios case for mac among others but it's at least a workable
> direction.
>
> BTW. Don't you have a similar problem on x86 that today qemu claims
> that everything honors the iommu in ACPI ?

Only on a single experimental configuration, and that can apparently
just be fixed going forward without any real problems being caused.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-09 Thread Andy Lutomirski
On Mon, Nov 9, 2015 at 2:58 PM, Benjamin Herrenschmidt
 wrote:
> So ...
>
> I've finally tried to sort that out for powerpc and I can't find a way
> to make that work that isn't a complete pile of stinking shit.
>
> I'm very tempted to go back to my original idea: virtio itself should
> indicate it's "bypassing ability" via the virtio config space or some
> other bit (like the ProgIf of the PCI header etc...).
>
> I don't see how I can make it work otherwise.
>
> The problem with the statement "it's a platform matter" is that:
>
>   - It's not entirely correct. It's actually a feature of a specific
> virtio implementation (qemu's) that it bypasses all the platform IOMMU
> facilities.
>
>   - The platforms (on powerpc there's at least 3 or 4 that have qemu
> emulation and support some form of PCI) don't have an existing way to
> convey the information that a device bypasses the IOMMU (if any).
>
>   - Even if we add such a mechanism (new property in the device-tree),
> we end up with a big turd: Because we need to be compatible with older
> qemus, we essentially need a quirk that will make all virtio devices
> assume that property is present. That will of course break whenever we
> try to use another implementation of virtio on powerpc which doesn't
> bypass the iommu.
>
> We don't have a way to differenciate a virtio device that does the
> bypass from one that doesn't and the backward compatibility requirement
> forces us to treat all existing virtio devices as doing bypass.

The problem here is that in some of the problematic cases the virtio
driver may not even be loaded.  If someone runs an L1 guest with an
IOMMU-bypassing virtio device and assigns it to L2 using vfio, then
*boom* L1 crashes.  (Same if, say, DPDK gets used, I think.)

>
> The only way out of this while keeping the "platform" stuff would be to
> also bump some kind of version in the virtio config (or PCI header). I
> have no other way to differenciate between "this is an old qemu that
> doesn't do the 'bypass property' yet" from "this is a virtio device
> that doesn't bypass".
>
> Any better idea ?

I'd suggest that, in the absence of the new DT binding, we assume that
any PCI device with the virtio vendor ID is passthrough on powerpc.  I
can do this in the virtio driver, but if it's in the platform code
then vfio gets it right too (i.e. fails to load).

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] s390/dma: Allow per device dma ops

2015-11-05 Thread Andy Lutomirski

On 11/05/2015 01:33 AM, Christian Borntraeger wrote:

Am 03.11.2015 um 13:26 schrieb Cornelia Huck:

On Tue,  3 Nov 2015 12:54:39 +0100
Christian Borntraeger  wrote:


As virtio-ccw now has dma ops, we can no longer default to the PCI ones.
Make use of dev_archdata to keep the dma_ops per device. The pci devices
now use that to override the default, and the default is changed to use
the noop ops for everything that is not PCI. To compile without PCI
support we also have to enable the DMA api with virtio.


Not only with virtio, but generally, right?


Yes, will update the patch description.



Signed-off-by: Christian Borntraeger 
Reviewed-by: Joerg Roedel 
Acked-by: Sebastian Ott 
---
  arch/s390/Kconfig   | 3 ++-
  arch/s390/include/asm/device.h  | 6 +-
  arch/s390/include/asm/dma-mapping.h | 6 --
  arch/s390/pci/pci.c | 1 +
  arch/s390/pci/pci_dma.c | 4 ++--
  5 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 1d57000..04f0e02 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -113,6 +113,7 @@ config S390
select GENERIC_FIND_FIRST_BIT
select GENERIC_SMP_IDLE_THREAD
select GENERIC_TIME_VSYSCALL
+   select HAS_DMA
select HAVE_ALIGNED_STRUCT_PAGE if SLUB
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_EARLY_PFN_TO_NID
@@ -124,6 +125,7 @@ config S390
select HAVE_CMPXCHG_DOUBLE
select HAVE_CMPXCHG_LOCAL
select HAVE_DEBUG_KMEMLEAK
+   select HAVE_DMA_ATTRS
select HAVE_DYNAMIC_FTRACE
select HAVE_DYNAMIC_FTRACE_WITH_REGS
select HAVE_FTRACE_MCOUNT_RECORD
@@ -580,7 +582,6 @@ config QDIO

  menuconfig PCI
bool "PCI support"
-   select HAVE_DMA_ATTRS
select PCI_MSI
help
  Enable PCI support.


Hm. Further down in this file, there's

config HAS_DMA
 def_bool PCI
 select HAVE_DMA_API_DEBUG

Should we maybe select HAVE_DMA_API_DEBUG above, drop the HAS_DMA
config option and rely on not defining NO_DMA instead?


Hmm, yes. That would simplify things a lot.  Right now we include
lib/Kconfig (which defines HAS_DMA) and define it ourselfes in
arch/s390/Kconfig. WHoever comes first wins. Adding a select statement
would make this even more complicated.

Andy, I will simply send you a respin of this patch.



I'm slightly concerned that I'm going to screw this all up and apply the 
wrong version.  Could you resend the whole series with git format-patch 
-vN for some appropriate N (or similar)?


Thanks,
Andy


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL v4 0/3] dma and virtio prep patches

2015-11-05 Thread Andy Lutomirski
On Thu, Nov 5, 2015 at 12:08 PM, Christian Borntraeger
 wrote:
> Andy,
>
> to make it obvious which version is the latest, here is a branch
>
> The following changes since commit 6a13feb9c82803e2b815eca72fa7a9f5561d7861:
>
>   Linux 4.3 (2015-11-01 16:05:25 -0800)
>
> are available in the git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux.git  dma
>
> for you to fetch changes up to fc7f9754db6ce0c12281da4055281f731d36bdee:
>
>   s390/dma: Allow per device dma ops (2015-11-05 21:02:40 +0100)

Pulled, thanks.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC 0/4] dma ops and virtio

2015-11-04 Thread Andy Lutomirski
On Wed, Nov 4, 2015 at 9:52 AM, Cornelia Huck <cornelia.h...@de.ibm.com> wrote:
> On Wed, 4 Nov 2015 15:29:23 +0100
> Cornelia Huck <cornelia.h...@de.ibm.com> wrote:
>
>> I'm currently suspecting some endianness issues, probably with the ecw
>> accesses, which is why I'd be interested in your trace information (as
>> I currently don't have a LE test setup at hand.)
>
> I think I've got it. We have sense_data as a byte array, which
> implicitly makes it BE already. When we copy to the ecws while building
> the irb, the data ends up in 32 bit values. The conversion from host
> endianness to BE now treats them as LE on your system...
>
> Could you please give the following qemu patch a try?

Tested-by: Andy Lutomirski <l...@kernel.org>

Now my test script panics for the right reason (init isn't actually an
s390 binary).  Thanks!

I'll update my branch with the latest s390 DMA stuff, and now I can
give it at least basic regression testing.  Of course, I have to find
a root filesystem, too. :)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC 0/4] dma ops and virtio

2015-11-03 Thread Andy Lutomirski
On Tue, Nov 3, 2015 at 12:14 AM, Christian Borntraeger
<borntrae...@de.ibm.com> wrote:
> Am 02.11.2015 um 21:23 schrieb Andy Lutomirski:
>> On Mon, Nov 2, 2015 at 3:16 AM, Cornelia Huck <cornelia.h...@de.ibm.com> 
>> wrote:
>>> On Fri, 30 Oct 2015 13:33:07 -0700
>>> Andy Lutomirski <l...@amacapital.net> wrote:
>>>
>>>> On Fri, Oct 30, 2015 at 1:25 AM, Cornelia Huck <cornelia.h...@de.ibm.com> 
>>>> wrote:
>>>>> On Thu, 29 Oct 2015 15:50:38 -0700
>>>>> Andy Lutomirski <l...@amacapital.net> wrote:
>>>>>
>>>>>> Progress!  After getting that sort-of-working, I figured out what was
>>>>>> wrong with my earlier command, and I got that working, too.  Now I
>>>>>> get:
>>>>>>
>>>>>> qemu-system-s390x -fsdev
>>>>>> local,id=virtfs1,path=/,security_model=none,readonly -device
>>>>>> virtio-9p-ccw,fsdev=virtfs1,mount_tag=/dev/root -M s390-ccw-virtio
>>>>>> -nodefaults -device sclpconsole,chardev=console -parallel none -net
>>>>>> none -echr 1 -serial none -chardev stdio,id=console,signal=off,mux=on
>>>>>> -serial chardev:console -mon chardev=console -vga none -display none
>>>>>> -kernel arch/s390/boot/bzImage -append
>>>>>> 'init=/home/luto/devel/virtme/virtme/guest/virtme-init
>>>>>> psmouse.proto=exps "virtme_stty_con=rows 24 cols 150 iutf8"
>>>>>> TERM=xterm-256color rootfstype=9p
>>>>>> rootflags=ro,version=9p2000.L,trans=virtio,access=any
>>>>>> raid=noautodetect debug'
>>>>>
>>>>> The commandline looks sane AFAICS.
>>>>>
>>>>> (...)
>>>>>
>>>>>> vrfy: device 0.0.: rc=0 pgroup=0 mpath=0 vpm=80
>>>>>> virtio_ccw 0.0.: Failed to set online: -5
>>>>>>
>>>>>> ^^^ bad news!
>>>>>
>>>>> I'd like to see where in the onlining process this fails. Could you set
>>>>> up qemu tracing for css_* and virtio_ccw_* (instructions in
>>>>> qemu/docs/tracing.txt)?
>>>>
>>>> I have a file called events that contains:
>>>>
>>>> css_*
>>>> virtio_ccw_*
>>>>
>>>> pointing -trace events= at it results in a trace- file that's 549
>>>> bytes long and contains nothing.  Are wildcards not as well-supported
>>>> as the docs suggest?
>>>
>>> Just tried it, seemed to work for me as expected. And as your messages
>>> indicate, at least some of the css tracepoints are guaranteed to be
>>> hit. Odd.
>>>
>>> Can you try the following sophisticated printf debug patch?
>>>
>>> diff --git a/hw/s390x/css.c b/hw/s390x/css.c
>>> index c033612..6a87bd6 100644
>>> --- a/hw/s390x/css.c
>>> +++ b/hw/s390x/css.c
>>> @@ -308,6 +308,8 @@ static int css_interpret_ccw(SubchDev *sch, hwaddr 
>>> ccw_addr)
>>>  sch->ccw_no_data_cnt++;
>>>  }
>>>
>>> +fprintf(stderr, "CH DBG: %s: cmd_code=%x\n", __func__, ccw.cmd_code);
>>> +
>>>  /* Look at the command. */
>>>  switch (ccw.cmd_code) {
>>>  case CCW_CMD_NOOP:
>>> @@ -375,6 +377,7 @@ static int css_interpret_ccw(SubchDev *sch, hwaddr 
>>> ccw_addr)
>>>  }
>>>  break;
>>>  }
>>> +fprintf(stderr, "CH DBG: %s: ret=%d\n", __func__, ret);
>>>  sch->last_cmd = ccw;
>>>  sch->last_cmd_valid = true;
>>>  if (ret == 0) {
>>>
>>>
>>>>> Which qemu version is this, btw.?
>>>>>
>>>>
>>>> git from yesterday.
>>>
>>> Hm. Might be worth trying the s390-ccw-virtio-2.4 machine instead.
>>>
>>
>> No change.
>>
>> With s390-ccw-virtio-2.4, I get:
>>
>> Initializing cgroup subsys cpuset
>> Initializing cgroup subsys cpu
>> Initializing cgroup subsys cpuacct
>> Linux version 4.3.0-rc7-8-gff230d6ec6b2
>> (l...@amaluto.corp.amacapital.net) (gcc version 5.1.1 20150618 (Red
>> Hat Cross 5.1.1-3) (GCC) ) #344 SMP Fri Oct 30 13:16:13 PDT 2015
>> setup: Linux is running under KVM in 64-bit mode
>> setup: Max memory size: 128MB
>> Zone ranges:
>>   DMA  [mem 0x-0x7fff]
>>   Normal   empty
>> Movable zone

Re: [PATCH/RFC 0/4] dma ops and virtio

2015-11-03 Thread Andy Lutomirski
On Tue, Nov 3, 2015 at 9:59 AM, Cornelia Huck  wrote:
> It's just failing very early in the setup phase. As it works for me
> with a kvm setup, I'm suspecting some error in qemu's emulation code,
> which is unfortunately not my turf.
>

That should be easy to rule out.  Can you try with -machine accel=tcg?
 I can't test with kvm for obvious reasons.

> Some more poke-around-in-the-dark ideas:
>
> - Do you get more debug out put when you switch back to s390-ccw-virtio
> (virtio-1), i.e. does cmd 83 work and is it followed by further
> commands?

I'll play with this stuff later today.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC 0/4] dma ops and virtio

2015-11-02 Thread Andy Lutomirski
On Mon, Nov 2, 2015 at 3:16 AM, Cornelia Huck <cornelia.h...@de.ibm.com> wrote:
> On Fri, 30 Oct 2015 13:33:07 -0700
> Andy Lutomirski <l...@amacapital.net> wrote:
>
>> On Fri, Oct 30, 2015 at 1:25 AM, Cornelia Huck <cornelia.h...@de.ibm.com> 
>> wrote:
>> > On Thu, 29 Oct 2015 15:50:38 -0700
>> > Andy Lutomirski <l...@amacapital.net> wrote:
>> >
>> >> Progress!  After getting that sort-of-working, I figured out what was
>> >> wrong with my earlier command, and I got that working, too.  Now I
>> >> get:
>> >>
>> >> qemu-system-s390x -fsdev
>> >> local,id=virtfs1,path=/,security_model=none,readonly -device
>> >> virtio-9p-ccw,fsdev=virtfs1,mount_tag=/dev/root -M s390-ccw-virtio
>> >> -nodefaults -device sclpconsole,chardev=console -parallel none -net
>> >> none -echr 1 -serial none -chardev stdio,id=console,signal=off,mux=on
>> >> -serial chardev:console -mon chardev=console -vga none -display none
>> >> -kernel arch/s390/boot/bzImage -append
>> >> 'init=/home/luto/devel/virtme/virtme/guest/virtme-init
>> >> psmouse.proto=exps "virtme_stty_con=rows 24 cols 150 iutf8"
>> >> TERM=xterm-256color rootfstype=9p
>> >> rootflags=ro,version=9p2000.L,trans=virtio,access=any
>> >> raid=noautodetect debug'
>> >
>> > The commandline looks sane AFAICS.
>> >
>> > (...)
>> >
>> >> vrfy: device 0.0.: rc=0 pgroup=0 mpath=0 vpm=80
>> >> virtio_ccw 0.0.: Failed to set online: -5
>> >>
>> >> ^^^ bad news!
>> >
>> > I'd like to see where in the onlining process this fails. Could you set
>> > up qemu tracing for css_* and virtio_ccw_* (instructions in
>> > qemu/docs/tracing.txt)?
>>
>> I have a file called events that contains:
>>
>> css_*
>> virtio_ccw_*
>>
>> pointing -trace events= at it results in a trace- file that's 549
>> bytes long and contains nothing.  Are wildcards not as well-supported
>> as the docs suggest?
>
> Just tried it, seemed to work for me as expected. And as your messages
> indicate, at least some of the css tracepoints are guaranteed to be
> hit. Odd.
>
> Can you try the following sophisticated printf debug patch?
>
> diff --git a/hw/s390x/css.c b/hw/s390x/css.c
> index c033612..6a87bd6 100644
> --- a/hw/s390x/css.c
> +++ b/hw/s390x/css.c
> @@ -308,6 +308,8 @@ static int css_interpret_ccw(SubchDev *sch, hwaddr 
> ccw_addr)
>  sch->ccw_no_data_cnt++;
>  }
>
> +fprintf(stderr, "CH DBG: %s: cmd_code=%x\n", __func__, ccw.cmd_code);
> +
>  /* Look at the command. */
>  switch (ccw.cmd_code) {
>  case CCW_CMD_NOOP:
> @@ -375,6 +377,7 @@ static int css_interpret_ccw(SubchDev *sch, hwaddr 
> ccw_addr)
>  }
>  break;
>  }
> +fprintf(stderr, "CH DBG: %s: ret=%d\n", __func__, ret);
>  sch->last_cmd = ccw;
>  sch->last_cmd_valid = true;
>  if (ret == 0) {
>
>
>> > Which qemu version is this, btw.?
>> >
>>
>> git from yesterday.
>
> Hm. Might be worth trying the s390-ccw-virtio-2.4 machine instead.
>

No change.

With s390-ccw-virtio-2.4, I get:

Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Initializing cgroup subsys cpuacct
Linux version 4.3.0-rc7-8-gff230d6ec6b2
(l...@amaluto.corp.amacapital.net) (gcc version 5.1.1 20150618 (Red
Hat Cross 5.1.1-3) (GCC) ) #344 SMP Fri Oct 30 13:16:13 PDT 2015
setup: Linux is running under KVM in 64-bit mode
setup: Max memory size: 128MB
Zone ranges:
  DMA  [mem 0x-0x7fff]
  Normal   empty
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x-0x07ff]
Initmem setup node 0 [mem 0x-0x07ff]
On node 0 totalpages: 32768
  DMA zone: 512 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 32768 pages, LIFO batch:7
PERCPU: Embedded 466 pages/cpu @07605000 s1868032 r8192 d32512 u1908736
pcpu-alloc: s1868032 r8192 d32512 u1908736 alloc=466*4096
pcpu-alloc: [0] 0 [0] 1
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 32256
Kernel command line:
init=/home/luto/devel/virtme/virtme/guest/virtme-init
psmouse.proto=exps "virtme_stty_con=rows 45 cols 150 iutf8"
TERM=xterm-256color rootfstype=9p
rootflags=version=9p2000.L,trans=virtio,access=any raid=noautodetect
ro debug
PID hash table entries: 512 (order: 0, 4096 bytes)
Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
Inode-cache hash table entries: 8192 (order: 4, 65536

Re: [PATCH/RFC 0/4] dma ops and virtio

2015-10-30 Thread Andy Lutomirski
On Fri, Oct 30, 2015 at 1:25 AM, Cornelia Huck <cornelia.h...@de.ibm.com> wrote:
> On Thu, 29 Oct 2015 15:50:38 -0700
> Andy Lutomirski <l...@amacapital.net> wrote:
>
>> Progress!  After getting that sort-of-working, I figured out what was
>> wrong with my earlier command, and I got that working, too.  Now I
>> get:
>>
>> qemu-system-s390x -fsdev
>> local,id=virtfs1,path=/,security_model=none,readonly -device
>> virtio-9p-ccw,fsdev=virtfs1,mount_tag=/dev/root -M s390-ccw-virtio
>> -nodefaults -device sclpconsole,chardev=console -parallel none -net
>> none -echr 1 -serial none -chardev stdio,id=console,signal=off,mux=on
>> -serial chardev:console -mon chardev=console -vga none -display none
>> -kernel arch/s390/boot/bzImage -append
>> 'init=/home/luto/devel/virtme/virtme/guest/virtme-init
>> psmouse.proto=exps "virtme_stty_con=rows 24 cols 150 iutf8"
>> TERM=xterm-256color rootfstype=9p
>> rootflags=ro,version=9p2000.L,trans=virtio,access=any
>> raid=noautodetect debug'
>
> The commandline looks sane AFAICS.
>
> (...)
>
>> vrfy: device 0.0.: rc=0 pgroup=0 mpath=0 vpm=80
>> virtio_ccw 0.0.: Failed to set online: -5
>>
>> ^^^ bad news!
>
> I'd like to see where in the onlining process this fails. Could you set
> up qemu tracing for css_* and virtio_ccw_* (instructions in
> qemu/docs/tracing.txt)?

I have a file called events that contains:

css_*
virtio_ccw_*

pointing -trace events= at it results in a trace- file that's 549
bytes long and contains nothing.  Are wildcards not as well-supported
as the docs suggest?

>
> Which qemu version is this, btw.?
>

git from yesterday.

--Andy



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 1/6] virtio-net: Stop doing DMA from the stack

2015-10-30 Thread Andy Lutomirski
On Fri, Oct 30, 2015 at 6:55 AM, Christian Borntraeger
<borntrae...@de.ibm.com> wrote:
> Am 30.10.2015 um 02:09 schrieb Andy Lutomirski:
>> From: "Michael S. Tsirkin" <m...@redhat.com>
>>
>> Once virtio starts using the DMA API, we won't be able to safely DMA
>> from the stack.  virtio-net does a couple of config DMA requests
>> from small stack buffers -- switch to using dynamically-allocated
>> memory.
>>
>> This should have no effect on any performance-critical code paths.
>>
>> [I wrote the subject and commit message.  mst wrote the code. --luto]
>>
>> Signed-off-by: Andy Lutomirski <l...@kernel.org>
>> signed-off-by: Michael S. Tsirkin <m...@redhat.com>
>
> I still get an error when using multiqueue:
>
> #  ethtool -L eth0 combined 4
> [   33.534686] virtio_ccw 0.0.000d: DMA-API: device driver maps memory from 
> stack [addr=629e7c06]

Fixed in my branch, I think.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/6] virtio_ring: Support DMA APIs

2015-10-30 Thread Andy Lutomirski
On Fri, Oct 30, 2015 at 5:05 AM, Christian Borntraeger
<borntrae...@de.ibm.com> wrote:
> Am 30.10.2015 um 13:01 schrieb Cornelia Huck:
>> On Thu, 29 Oct 2015 18:09:47 -0700
>> Andy Lutomirski <l...@kernel.org> wrote:
>>
>>> virtio_ring currently sends the device (usually a hypervisor)
>>> physical addresses of its I/O buffers.  This is okay when DMA
>>> addresses and physical addresses are the same thing, but this isn't
>>> always the case.  For example, this never works on Xen guests, and
>>> it is likely to fail if a physical "virtio" device ever ends up
>>> behind an IOMMU or swiotlb.
>>>
>>> The immediate use case for me is to enable virtio on Xen guests.
>>> For that to work, we need vring to support DMA address translation
>>> as well as a corresponding change to virtio_pci or to another
>>> driver.
>>>
>>> With this patch, if enabled, virtfs survives kmemleak and
>>> CONFIG_DMA_API_DEBUG.
>>>
>>> Signed-off-by: Andy Lutomirski <l...@kernel.org>
>>> ---
>>>  drivers/virtio/Kconfig   |   2 +-
>>>  drivers/virtio/virtio_ring.c | 190 
>>> +++
>>>  tools/virtio/linux/dma-mapping.h |  17 
>>>  3 files changed, 172 insertions(+), 37 deletions(-)
>>>  create mode 100644 tools/virtio/linux/dma-mapping.h
>>
>>>  static void detach_buf(struct vring_virtqueue *vq, unsigned int head)
>>>  {
>>> -unsigned int i;
>>> +unsigned int i, j;
>>> +u16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
>>>
>>>  /* Clear data ptr. */
>>> -vq->data[head] = NULL;
>>> +vq->desc_state[head].data = NULL;
>>>
>>> -/* Put back on free list: find end */
>>> +/* Put back on free list: unmap first-level descriptors and find end */
>>>  i = head;
>>>
>>> -/* Free the indirect table */
>>> -if (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, 
>>> VRING_DESC_F_INDIRECT))
>>> -kfree(phys_to_virt(virtio64_to_cpu(vq->vq.vdev, 
>>> vq->vring.desc[i].addr)));
>>> -
>>> -while (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, 
>>> VRING_DESC_F_NEXT)) {
>>> +while (vq->vring.desc[i].flags & nextflag) {
>>> +vring_unmap_one(vq, >vring.desc[i]);
>>>  i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
>>>  vq->vq.num_free++;
>>>  }
>>>
>>> +vring_unmap_one(vq, >vring.desc[i]);
>>>  vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
>>>  vq->free_head = head;
>>> +
>>>  /* Plus final descriptor */
>>>  vq->vq.num_free++;
>>> +
>>> +/* Free the indirect table, if any, now that it's unmapped. */
>>> +if (vq->desc_state[head].indir_desc) {
>>> +struct vring_desc *indir_desc = 
>>> vq->desc_state[head].indir_desc;
>>> +u32 len = vq->vring.desc[head].len;
>>
>> This one needs to be virtio32_to_cpu(...) as well.
>
> Yes, just did the exact same change
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index f269e1c..f2249df 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -556,7 +556,7 @@ static void detach_buf(struct vring_virtqueue *vq, 
> unsigned int head)
> /* Free the indirect table, if any, now that it's unmapped. */
> if (vq->desc_state[head].indir_desc) {
> struct vring_desc *indir_desc = 
> vq->desc_state[head].indir_desc;
> -   u32 len = vq->vring.desc[head].len;
> +   u32 len = virtio32_to_cpu(vq->vq.vdev, 
> vq->vring.desc[head].len);
>
> BUG_ON(!(vq->vring.desc[head].flags &
>  cpu_to_virtio16(vq->vq.vdev, 
> VRING_DESC_F_INDIRECT)));
>
>
> now it boots.

Thanks!  I applied this to my tree.  I won't send a new version quite
yet, though, to reduce inbox load.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-29 Thread Andy Lutomirski
On Oct 28, 2015 6:11 PM, "Benjamin Herrenschmidt"
 wrote:
>
> On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote:
> > On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote:
> >
> > > On Power, I generally have 2 IOMMU windows for a device, one at the
> > > bottom is remapped, and is generally used for 32-bit devices and the
> > > one at the top us setup as a bypass
> >
> > So in the normal case of decent 64-bit devices (and not in a VM),
> > they'll *already* be using the bypass region and have full access to
> > all of memory, all of the time? And you have no protection against
> > driver and firmware bugs causing stray DMA?
>
> Correct, we chose to do that for performance reasons.

Could this be mitigated using pools?  I don't know if the net code
would play along easily.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC 0/4] dma ops and virtio

2015-10-29 Thread Andy Lutomirski
On Wed, Oct 28, 2015 at 5:04 PM, Christian Borntraeger
<borntrae...@de.ibm.com> wrote:
> Am 29.10.2015 um 07:22 schrieb Andy Lutomirski:
>> On Tue, Oct 27, 2015 at 3:48 PM, Christian Borntraeger
>> <borntrae...@de.ibm.com> wrote:
>>> This is an RFC to check if I am on the right track.  There
>>> are some attempts to unify the dma ops (Christoph) as well
>>> as some attempts to make virtio use the dma API (Andy).
>>>
>>> At todays discussion at the kernel summit, we concluded that
>>> we want to use the same code on all platforms, whereever
>>> possible, so having a dummy dma_op might be the easiest
>>> solution to keep virtio-ccw as similar as possible to
>>> virtio-pci. Andy Lutomirski will rework his patchset to
>>> unconditionally use the dma ops.  We will also need a
>>> compatibility quirk for powerpc to bypass the iommu mappings
>>> on older QEMU version (which translates to all versions as
>>> of today) and per device, e.g.  via device tree.  Ben
>>> Herrenschmidt will look into that.
>>
>> The combination of these patches plus my series don't link for me
>> unless I enable PCI.  Would s390 need to select HAS_DMA from VIRTIO or
>> something similar?
>
> Well, actually this is a potential improvement for series. I could just
> make the noop dma ops default for _all_ devices unless it has a per
> device dma_ops (e.g. s390 pci) and the unconditionally set HAS_DMA.
>>
>> Also, is it possible to use QEMU to emulate an s390x?  Even just:
>>
>> qemu-system-s390x -M s390-ccw-virtio
>
> Yes, we have no interactive bios and if no boot device is there is bios
> will load a disabled wait, which will exit qemu.
>
> Make sure to compile your kernel for z900  (processor type) as qemu does not
> handle all things of newer processors.
> You can then do
> qemu-system-s390x -nographic -m 256 -kernel vmlinux -initrd 
>

Progress!  After getting that sort-of-working, I figured out what was
wrong with my earlier command, and I got that working, too.  Now I
get:

qemu-system-s390x -fsdev
local,id=virtfs1,path=/,security_model=none,readonly -device
virtio-9p-ccw,fsdev=virtfs1,mount_tag=/dev/root -M s390-ccw-virtio
-nodefaults -device sclpconsole,chardev=console -parallel none -net
none -echr 1 -serial none -chardev stdio,id=console,signal=off,mux=on
-serial chardev:console -mon chardev=console -vga none -display none
-kernel arch/s390/boot/bzImage -append
'init=/home/luto/devel/virtme/virtme/guest/virtme-init
psmouse.proto=exps "virtme_stty_con=rows 24 cols 150 iutf8"
TERM=xterm-256color rootfstype=9p
rootflags=ro,version=9p2000.L,trans=virtio,access=any
raid=noautodetect debug'
Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Initializing cgroup subsys cpuacct
Linux version 4.3.0-rc7-00403-ga2b5cd810259-dirty
(l...@amaluto.corp.amacapital.net) (gcc version 5.1.1 20150618 (Red
Hat Cross 5.1.1-3) (GCC) ) #328 SMP Thu Oct 29 15:46:05 PDT 2015
setup: Linux is running under KVM in 64-bit mode
setup: Max memory size: 128MB
Zone ranges:
  DMA  [mem 0x-0x7fff]
  Normal   empty
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x-0x07ff]
Initmem setup node 0 [mem 0x-0x07ff]
On node 0 totalpages: 32768
  DMA zone: 512 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 32768 pages, LIFO batch:7
PERCPU: Embedded 466 pages/cpu @07605000 s1868032 r8192 d32512 u1908736
pcpu-alloc: s1868032 r8192 d32512 u1908736 alloc=466*4096
pcpu-alloc: [0] 0 [0] 1
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 32256
Kernel command line:
init=/home/luto/devel/virtme/virtme/guest/virtme-init
psmouse.proto=exps "virtme_stty_con=rows 24 cols 150 iutf8"
TERM=xterm-256color rootfstype=9p
rootflags=ro,version=9p2000.L,trans=virtio,access=any
raid=noautodetect debug
PID hash table entries: 512 (order: 0, 4096 bytes)
Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
Memory: 92552K/131072K available (8229K kernel code, 798K rwdata,
3856K rodata, 2384K init, 14382K bss, 38520K reserved, 0K
cma-reserved)
Write protected kernel read-only data: 0x10 - 0xccdfff
SLUB: HWalign=256, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
Running RCU self tests
Hierarchical RCU implementation.
RCU debugfs-based tracing is enabled.
RCU lockdep checking is enabled.
Build-time adjustment of leaf fanout to 64.
RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=2.
RCU: Adjusting geometry for rcu_fanout_leaf=64, nr_cpu_ids=2
NR_IRQS:3
clocksource: tod: mask: 0x max_cycles:
0x3b0a9be803b0a9, max_idle_ns: 1805497147909793 ns
console [ttyS1] enabled
L

[PATCH v4 5/6] virtio_mmio: Use the DMA API

2015-10-29 Thread Andy Lutomirski
This switches to vring_create_virtqueue, simplifying the driver and
adding DMA API support.

Signed-off-by: Andy Lutomirski <l...@kernel.org>
---
 drivers/virtio/virtio_mmio.c | 67 ++--
 1 file changed, 15 insertions(+), 52 deletions(-)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index f499d9da7237..2b9fab52a3cb 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -99,12 +99,6 @@ struct virtio_mmio_vq_info {
/* the actual virtqueue */
struct virtqueue *vq;
 
-   /* the number of entries in the queue */
-   unsigned int num;
-
-   /* the virtual address of the ring queue */
-   void *queue;
-
/* the list node for the virtqueues list */
struct list_head node;
 };
@@ -322,15 +316,13 @@ static void vm_del_vq(struct virtqueue *vq)
 {
struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vq->vdev);
struct virtio_mmio_vq_info *info = vq->priv;
-   unsigned long flags, size;
+   unsigned long flags;
unsigned int index = vq->index;
 
spin_lock_irqsave(_dev->lock, flags);
list_del(>node);
spin_unlock_irqrestore(_dev->lock, flags);
 
-   vring_del_virtqueue(vq);
-
/* Select and deactivate the queue */
writel(index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL);
if (vm_dev->version == 1) {
@@ -340,8 +332,8 @@ static void vm_del_vq(struct virtqueue *vq)
WARN_ON(readl(vm_dev->base + VIRTIO_MMIO_QUEUE_READY));
}
 
-   size = PAGE_ALIGN(vring_size(info->num, VIRTIO_MMIO_VRING_ALIGN));
-   free_pages_exact(info->queue, size);
+   vring_del_virtqueue(vq);
+
kfree(info);
 }
 
@@ -356,8 +348,6 @@ static void vm_del_vqs(struct virtio_device *vdev)
free_irq(platform_get_irq(vm_dev->pdev, 0), vm_dev);
 }
 
-
-
 static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned 
index,
  void (*callback)(struct virtqueue *vq),
  const char *name)
@@ -365,7 +355,8 @@ static struct virtqueue *vm_setup_vq(struct virtio_device 
*vdev, unsigned index,
struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
struct virtio_mmio_vq_info *info;
struct virtqueue *vq;
-   unsigned long flags, size;
+   unsigned long flags;
+   unsigned int num;
int err;
 
if (!name)
@@ -388,66 +379,40 @@ static struct virtqueue *vm_setup_vq(struct virtio_device 
*vdev, unsigned index,
goto error_kmalloc;
}
 
-   /* Allocate pages for the queue - start with a queue as big as
-* possible (limited by maximum size allowed by device), drop down
-* to a minimal size, just big enough to fit descriptor table
-* and two rings (which makes it "alignment_size * 2")
-*/
-   info->num = readl(vm_dev->base + VIRTIO_MMIO_QUEUE_NUM_MAX);
-
-   /* If the device reports a 0 entry queue, we won't be able to
-* use it to perform I/O, and vring_new_virtqueue() can't create
-* empty queues anyway, so don't bother to set up the device.
-*/
-   if (info->num == 0) {
+   num = readl(vm_dev->base + VIRTIO_MMIO_QUEUE_NUM_MAX);
+   if (num == 0) {
err = -ENOENT;
-   goto error_alloc_pages;
-   }
-
-   while (1) {
-   size = PAGE_ALIGN(vring_size(info->num,
-   VIRTIO_MMIO_VRING_ALIGN));
-   /* Did the last iter shrink the queue below minimum size? */
-   if (size < VIRTIO_MMIO_VRING_ALIGN * 2) {
-   err = -ENOMEM;
-   goto error_alloc_pages;
-   }
-
-   info->queue = alloc_pages_exact(size, GFP_KERNEL | __GFP_ZERO);
-   if (info->queue)
-   break;
-
-   info->num /= 2;
+   goto error_new_virtqueue;
}
 
/* Create the vring */
-   vq = vring_new_virtqueue(index, info->num, VIRTIO_MMIO_VRING_ALIGN, 
vdev,
-true, info->queue, vm_notify, callback, name);
+   vq = vring_create_virtqueue(index, num, VIRTIO_MMIO_VRING_ALIGN, vdev,
+true, true, vm_notify, callback, name);
if (!vq) {
err = -ENOMEM;
goto error_new_virtqueue;
}
 
/* Activate the queue */
-   writel(info->num, vm_dev->base + VIRTIO_MMIO_QUEUE_NUM);
+   writel(virtqueue_get_vring_size(vq), vm_dev->base + 
VIRTIO_MMIO_QUEUE_NUM);
if (vm_dev->version == 1) {
writel(PAGE_SIZE, vm_dev->base + VIRTIO_MMIO_QUEUE_ALIGN);
-   writel(virt_to_phys(info->queue) >> PAGE_SHIFT,
+   writel(virtqueue_get_desc_

[PATCH v4 6/6] virtio_pci: Use the DMA API

2015-10-29 Thread Andy Lutomirski
This switches to vring_create_virtqueue, simplifying the driver and
adding DMA API support.

Signed-off-by: Andy Lutomirski <l...@kernel.org>
---
 drivers/virtio/virtio_pci_common.h |  7 -
 drivers/virtio/virtio_pci_legacy.c | 39 +++-
 drivers/virtio/virtio_pci_modern.c | 61 ++
 3 files changed, 19 insertions(+), 88 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.h 
b/drivers/virtio/virtio_pci_common.h
index cd6196b513ad..1a3c689d1b9e 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -35,13 +35,6 @@ struct virtio_pci_vq_info {
/* the actual virtqueue */
struct virtqueue *vq;
 
-   /* the number of entries in the queue */
-   int num;
-
-   /* the ring queue */
-   void *queue;
-   dma_addr_t queue_dma_addr;  /* bus address */
-
/* the list node for the virtqueues list */
struct list_head node;
 
diff --git a/drivers/virtio/virtio_pci_legacy.c 
b/drivers/virtio/virtio_pci_legacy.c
index b5293e5f2af4..8c4e61783441 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -119,7 +119,6 @@ static struct virtqueue *setup_vq(struct virtio_pci_device 
*vp_dev,
  u16 msix_vec)
 {
struct virtqueue *vq;
-   unsigned long size;
u16 num;
int err;
 
@@ -131,29 +130,19 @@ static struct virtqueue *setup_vq(struct 
virtio_pci_device *vp_dev,
if (!num || ioread32(vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN))
return ERR_PTR(-ENOENT);
 
-   info->num = num;
info->msix_vector = msix_vec;
 
-   size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
-   info->queue = dma_zalloc_coherent(_dev->pci_dev->dev, size,
- >queue_dma_addr,
- GFP_KERNEL);
-   if (info->queue == NULL)
+   /* create the vring */
+   vq = vring_create_virtqueue(index, num,
+   VIRTIO_PCI_VRING_ALIGN, _dev->vdev,
+   true, false, vp_notify, callback, name);
+   if (!vq)
return ERR_PTR(-ENOMEM);
 
/* activate the queue */
-   iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
+   iowrite32(virtqueue_get_desc_addr(vq) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
-   /* create the vring */
-   vq = vring_new_virtqueue(index, info->num,
-VIRTIO_PCI_VRING_ALIGN, _dev->vdev,
-true, info->queue, vp_notify, callback, name);
-   if (!vq) {
-   err = -ENOMEM;
-   goto out_activate_queue;
-   }
-
vq->priv = (void __force *)vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY;
 
if (msix_vec != VIRTIO_MSI_NO_VECTOR) {
@@ -161,18 +150,15 @@ static struct virtqueue *setup_vq(struct 
virtio_pci_device *vp_dev,
msix_vec = ioread16(vp_dev->ioaddr + VIRTIO_MSI_QUEUE_VECTOR);
if (msix_vec == VIRTIO_MSI_NO_VECTOR) {
err = -EBUSY;
-   goto out_assign;
+   goto out_deactivate;
}
}
 
return vq;
 
-out_assign:
-   vring_del_virtqueue(vq);
-out_activate_queue:
+out_deactivate:
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
-   dma_free_coherent(_dev->pci_dev->dev, size,
- info->queue, info->queue_dma_addr);
+   vring_del_virtqueue(vq);
return ERR_PTR(err);
 }
 
@@ -180,7 +166,6 @@ static void del_vq(struct virtio_pci_vq_info *info)
 {
struct virtqueue *vq = info->vq;
struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev);
-   unsigned long size;
 
iowrite16(vq->index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
 
@@ -191,14 +176,10 @@ static void del_vq(struct virtio_pci_vq_info *info)
ioread8(vp_dev->ioaddr + VIRTIO_PCI_ISR);
}
 
-   vring_del_virtqueue(vq);
-
/* Select and deactivate the queue */
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
-   size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
-   dma_free_coherent(_dev->pci_dev->dev, size,
- info->queue, info->queue_dma_addr);
+   vring_del_virtqueue(vq);
 }
 
 static const struct virtio_config_ops virtio_pci_config_ops = {
diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index fbe0bd1c4881..50b0cd5a501e 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -287,35 +287,6 @@ static u16 vp_config_vector(struct virtio_pci_device 
*vp_dev, u16 vector)
return vp_io

[PATCH v4 1/6] virtio-net: Stop doing DMA from the stack

2015-10-29 Thread Andy Lutomirski
From: "Michael S. Tsirkin" <m...@redhat.com>

Once virtio starts using the DMA API, we won't be able to safely DMA
from the stack.  virtio-net does a couple of config DMA requests
from small stack buffers -- switch to using dynamically-allocated
memory.

This should have no effect on any performance-critical code paths.

[I wrote the subject and commit message.  mst wrote the code. --luto]

Signed-off-by: Andy Lutomirski <l...@kernel.org>
signed-off-by: Michael S. Tsirkin <m...@redhat.com>
---
 drivers/net/virtio_net.c | 34 +++---
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d8838dedb7a4..f94ab786088f 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -140,6 +140,12 @@ struct virtnet_info {
 
/* CPU hot plug notifier */
struct notifier_block nb;
+
+   /* Control VQ buffers: protected by the rtnl lock */
+   struct virtio_net_ctrl_hdr ctrl_hdr;
+   virtio_net_ctrl_ack ctrl_status;
+   u8 ctrl_promisc;
+   u8 ctrl_allmulti;
 };
 
 struct padded_vnet_hdr {
@@ -976,31 +982,30 @@ static bool virtnet_send_command(struct virtnet_info *vi, 
u8 class, u8 cmd,
 struct scatterlist *out)
 {
struct scatterlist *sgs[4], hdr, stat;
-   struct virtio_net_ctrl_hdr ctrl;
-   virtio_net_ctrl_ack status = ~0;
unsigned out_num = 0, tmp;
 
/* Caller should know better */
BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
 
-   ctrl.class = class;
-   ctrl.cmd = cmd;
+   vi->ctrl_status = ~0;
+   vi->ctrl_hdr.class = class;
+   vi->ctrl_hdr.cmd = cmd;
/* Add header */
-   sg_init_one(, , sizeof(ctrl));
+   sg_init_one(, >ctrl_hdr, sizeof(vi->ctrl_hdr));
sgs[out_num++] = 
 
if (out)
sgs[out_num++] = out;
 
/* Add return status. */
-   sg_init_one(, , sizeof(status));
+   sg_init_one(, >ctrl_status, sizeof(vi->ctrl_status));
sgs[out_num] = 
 
BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
 
if (unlikely(!virtqueue_kick(vi->cvq)))
-   return status == VIRTIO_NET_OK;
+   return vi->ctrl_status == VIRTIO_NET_OK;
 
/* Spin for a response, the kick causes an ioport write, trapping
 * into the hypervisor, so the request should be handled immediately.
@@ -1009,7 +1014,7 @@ static bool virtnet_send_command(struct virtnet_info *vi, 
u8 class, u8 cmd,
   !virtqueue_is_broken(vi->cvq))
cpu_relax();
 
-   return status == VIRTIO_NET_OK;
+   return vi->ctrl_status == VIRTIO_NET_OK;
 }
 
 static int virtnet_set_mac_address(struct net_device *dev, void *p)
@@ -1151,7 +1156,6 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 {
struct virtnet_info *vi = netdev_priv(dev);
struct scatterlist sg[2];
-   u8 promisc, allmulti;
struct virtio_net_ctrl_mac *mac_data;
struct netdev_hw_addr *ha;
int uc_count;
@@ -1163,22 +1167,22 @@ static void virtnet_set_rx_mode(struct net_device *dev)
if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
return;
 
-   promisc = ((dev->flags & IFF_PROMISC) != 0);
-   allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
+   vi->ctrl_promisc = ((dev->flags & IFF_PROMISC) != 0);
+   vi->ctrl_allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
 
-   sg_init_one(sg, , sizeof(promisc));
+   sg_init_one(sg, >ctrl_promisc, sizeof(vi->ctrl_promisc));
 
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
  VIRTIO_NET_CTRL_RX_PROMISC, sg))
dev_warn(>dev, "Failed to %sable promisc mode.\n",
-promisc ? "en" : "dis");
+vi->ctrl_promisc ? "en" : "dis");
 
-   sg_init_one(sg, , sizeof(allmulti));
+   sg_init_one(sg, >ctrl_allmulti, sizeof(vi->ctrl_allmulti));
 
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
dev_warn(>dev, "Failed to %sable allmulti mode.\n",
-allmulti ? "en" : "dis");
+vi->ctrl_allmulti ? "en" : "dis");
 
uc_count = netdev_uc_count(dev);
mc_count = netdev_mc_count(dev);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 3/6] virtio_pci: Use the DMA API

2015-10-29 Thread Andy Lutomirski
This fixes virtio-pci on platforms and busses that have IOMMUs.  This
will break the experimental QEMU Q35 IOMMU support until QEMU is
fixed.  In exchange, it fixes physical virtio hardware as well as
virtio-pci running under Xen.

We should clean up the virtqueue API to do its own allocation and
teach virtqueue_get_avail and virtqueue_get_used to return DMA
addresses directly.

Signed-off-by: Andy Lutomirski <l...@kernel.org>
---
 drivers/virtio/virtio_pci_common.h |  3 ++-
 drivers/virtio/virtio_pci_legacy.c | 19 +++
 drivers/virtio/virtio_pci_modern.c | 34 --
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.h 
b/drivers/virtio/virtio_pci_common.h
index b976d968e793..cd6196b513ad 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -38,8 +38,9 @@ struct virtio_pci_vq_info {
/* the number of entries in the queue */
int num;
 
-   /* the virtual address of the ring queue */
+   /* the ring queue */
void *queue;
+   dma_addr_t queue_dma_addr;  /* bus address */
 
/* the list node for the virtqueues list */
struct list_head node;
diff --git a/drivers/virtio/virtio_pci_legacy.c 
b/drivers/virtio/virtio_pci_legacy.c
index 48bc9797e530..b5293e5f2af4 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -135,12 +135,14 @@ static struct virtqueue *setup_vq(struct 
virtio_pci_device *vp_dev,
info->msix_vector = msix_vec;
 
size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
-   info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
+   info->queue = dma_zalloc_coherent(_dev->pci_dev->dev, size,
+ >queue_dma_addr,
+ GFP_KERNEL);
if (info->queue == NULL)
return ERR_PTR(-ENOMEM);
 
/* activate the queue */
-   iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
+   iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
/* create the vring */
@@ -169,7 +171,8 @@ out_assign:
vring_del_virtqueue(vq);
 out_activate_queue:
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
-   free_pages_exact(info->queue, size);
+   dma_free_coherent(_dev->pci_dev->dev, size,
+ info->queue, info->queue_dma_addr);
return ERR_PTR(err);
 }
 
@@ -194,7 +197,8 @@ static void del_vq(struct virtio_pci_vq_info *info)
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
-   free_pages_exact(info->queue, size);
+   dma_free_coherent(_dev->pci_dev->dev, size,
+ info->queue, info->queue_dma_addr);
 }
 
 static const struct virtio_config_ops virtio_pci_config_ops = {
@@ -227,6 +231,13 @@ int virtio_pci_legacy_probe(struct virtio_pci_device 
*vp_dev)
return -ENODEV;
}
 
+   rc = dma_set_mask_and_coherent(_dev->dev, DMA_BIT_MASK(64));
+   if (rc)
+   rc = dma_set_mask_and_coherent(_dev->dev,
+   DMA_BIT_MASK(32));
+   if (rc)
+   dev_warn(_dev->dev, "Failed to enable 64-bit or 32-bit DMA. 
 Trying to continue, but this might not work.\n");
+
rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy");
if (rc)
return rc;
diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index 8e5cf194cc0b..fbe0bd1c4881 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -293,14 +293,16 @@ static size_t vring_pci_size(u16 num)
return PAGE_ALIGN(vring_size(num, SMP_CACHE_BYTES));
 }
 
-static void *alloc_virtqueue_pages(int *num)
+static void *alloc_virtqueue_pages(struct virtio_pci_device *vp_dev,
+  int *num, dma_addr_t *dma_addr)
 {
void *pages;
 
/* TODO: allocate each queue chunk individually */
for (; *num && vring_pci_size(*num) > PAGE_SIZE; *num /= 2) {
-   pages = alloc_pages_exact(vring_pci_size(*num),
- GFP_KERNEL|__GFP_ZERO|__GFP_NOWARN);
+   pages = dma_zalloc_coherent(
+   _dev->pci_dev->dev, vring_pci_size(*num),
+   dma_addr, GFP_KERNEL|__GFP_NOWARN);
if (pages)
return pages;
}
@@ -309,7 +311,9 @@ static void *alloc_virtqueue_pages(int *num)
return NULL;
 
/* Try to get a single page. You are my only hope! */
-   return alloc_pages_exact

[PATCH v4 4/6] virtio: Add improved queue allocation API

2015-10-29 Thread Andy Lutomirski
This leaves vring_new_virtqueue alone for compatbility, but it
adds two new improved APIs:

vring_create_virtqueue: Creates a virtqueue backed by automatically
allocated coherent memory.  (Some day it this could be extended to
support non-coherent memory, too, if there ends up being a platform
on which it's worthwhile.)

__vring_new_virtqueue: Creates a virtqueue with a manually-specified
layout.  This should allow mic_virtio to work much more cleanly.

Signed-off-by: Andy Lutomirski <l...@kernel.org>
---
 drivers/virtio/virtio_ring.c | 164 +++
 include/linux/virtio.h   |  23 +-
 include/linux/virtio_ring.h  |  35 +
 3 files changed, 190 insertions(+), 32 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index a872eb89587f..f269e1cba00c 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -91,6 +91,11 @@ struct vring_virtqueue
/* How to notify other side. FIXME: commonalize hcalls! */
bool (*notify)(struct virtqueue *vq);
 
+   /* DMA, allocation, and size information */
+   bool we_own_ring;
+   size_t queue_size_in_bytes;
+   dma_addr_t queue_dma_addr;
+
 #ifdef DEBUG
/* They're supposed to lock for us. */
unsigned int in_use;
@@ -821,36 +826,31 @@ irqreturn_t vring_interrupt(int irq, void *_vq)
 }
 EXPORT_SYMBOL_GPL(vring_interrupt);
 
-struct virtqueue *vring_new_virtqueue(unsigned int index,
- unsigned int num,
- unsigned int vring_align,
- struct virtio_device *vdev,
- bool weak_barriers,
- void *pages,
- bool (*notify)(struct virtqueue *),
- void (*callback)(struct virtqueue *),
- const char *name)
+struct virtqueue *__vring_new_virtqueue(unsigned int index,
+   struct vring vring,
+   struct virtio_device *vdev,
+   bool weak_barriers,
+   bool (*notify)(struct virtqueue *),
+   void (*callback)(struct virtqueue *),
+   const char *name)
 {
-   struct vring_virtqueue *vq;
unsigned int i;
+   struct vring_virtqueue *vq;
 
-   /* We assume num is a power of 2. */
-   if (num & (num - 1)) {
-   dev_warn(>dev, "Bad virtqueue length %u\n", num);
-   return NULL;
-   }
-
-   vq = kmalloc(sizeof(*vq) + num * sizeof(struct vring_desc_state),
+   vq = kmalloc(sizeof(*vq) + vring.num * sizeof(struct vring_desc_state),
 GFP_KERNEL);
if (!vq)
return NULL;
 
-   vring_init(>vring, num, pages, vring_align);
+   vq->vring = vring;
vq->vq.callback = callback;
vq->vq.vdev = vdev;
vq->vq.name = name;
-   vq->vq.num_free = num;
+   vq->vq.num_free = vring.num;
vq->vq.index = index;
+   vq->we_own_ring = false;
+   vq->queue_dma_addr = 0;
+   vq->queue_size_in_bytes = 0;
vq->notify = notify;
vq->weak_barriers = weak_barriers;
vq->broken = false;
@@ -871,18 +871,105 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 
/* Put everything in free lists. */
vq->free_head = 0;
-   for (i = 0; i < num-1; i++)
+   for (i = 0; i < vring.num-1; i++)
vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1);
-   memset(vq->desc_state, 0, num * sizeof(struct vring_desc_state));
+   memset(vq->desc_state, 0, vring.num * sizeof(struct vring_desc_state));
 
return >vq;
 }
+EXPORT_SYMBOL_GPL(__vring_new_virtqueue);
+
+struct virtqueue *vring_create_virtqueue(
+   unsigned int index,
+   unsigned int num,
+   unsigned int vring_align,
+   struct virtio_device *vdev,
+   bool weak_barriers,
+   bool may_reduce_num,
+   bool (*notify)(struct virtqueue *),
+   void (*callback)(struct virtqueue *),
+   const char *name)
+{
+   struct virtqueue *vq;
+   void *queue;
+   dma_addr_t dma_addr;
+   size_t queue_size_in_bytes;
+   struct vring vring;
+
+   /* We assume num is a power of 2. */
+   if (num & (num - 1)) {
+   dev_warn(>dev, "Bad virtqueue length %u\n", num);
+   return NULL;
+   }
+
+   /* TODO: allocate each queue chunk individually */
+   for (; num && vring_size(num, vring_align) > PAGE_SIZE; num /= 2) {
+   queue = dma_zalloc_coherent(
+   vdev->dev.par

[PATCH v4 0/6] virtio core DMA API conversion

2015-10-29 Thread Andy Lutomirski
This switches virtio to use the DMA API unconditionally.  I'm sure
it breaks things, but it seems to work on x86 using virtio-pci, with
and without Xen, and using both the modern 1.0 variant and the
legacy variant.

This appears to work on native and Xen x86_64 using both modern and
legacy virtio-pci.  It also appears to work on arm and arm64.

It definitely won't work as-is on s390x, and I haven't been able to
test Christian's patches because I can't get virtio-ccw to work in
QEMU at all.  I don't know what I'm doing wrong.

It doesn't work on ppc64.  Ben, consider yourself pinged to send me
a patch :)

It doesn't work on sparc64.  I didn't realize at Kernel Summit that
sparc64 has the same problem as ppc64.

DaveM, for background, we're trying to fix virtio to use the DMA
API.  That will require that every platform that uses virtio
supplies valid DMA operations on devices that use virtio_ring.
Unfortunately, QEMU historically ignores the IOMMU on virtio
devices.

On x86, this isn't really a problem.  x86 has a nice way for the
platform to describe which devices are behind an IOMMU, and QEMU
will be adjusted accordingly.  The only thing that will break is a
recently-added experimental mode.

Ben's plan for powerpc is to add a quirk for existing virtio-pci
devices and to eventually update the devicetree stuff to allow QEMU
to tell the guest which devices use the IOMMU.

AFAICT sparc has a similar problem to powerpc.  DaveM, can you come
up with a straightforward way to get sparc's DMA API to work
correctly for virtio-pci devices?

NB: Sadly, the platforms I've successfully tested on don't include any
big-endian platforms, so there could still be lurking endian problems.

Changes from v3:
 - More big-endian fixes.
 - Added better virtio-ring APIs that handle allocation and use them in
   virtio-mmio and virtio-pci.
 - Switch to Michael's virtio-net patch.

Changes from v2:
 - Fix vring_mapping_error incorrect argument

Changes from v1:
 - Fix an endian conversion error causing a BUG to hit.
 - Fix a DMA ordering issue (swiotlb=force works now).
 - Minor cleanups.

Andy Lutomirski (5):
  virtio_ring: Support DMA APIs
  virtio_pci: Use the DMA API
  virtio: Add improved queue allocation API
  virtio_mmio: Use the DMA API
  virtio_pci: Use the DMA API

Michael S. Tsirkin (1):
  virtio-net: Stop doing DMA from the stack

 drivers/net/virtio_net.c   |  34 ++--
 drivers/virtio/Kconfig |   2 +-
 drivers/virtio/virtio_mmio.c   |  67 ++-
 drivers/virtio/virtio_pci_common.h |   6 -
 drivers/virtio/virtio_pci_legacy.c |  42 ++---
 drivers/virtio/virtio_pci_modern.c |  61 ++-
 drivers/virtio/virtio_ring.c   | 348 ++---
 include/linux/virtio.h |  23 ++-
 include/linux/virtio_ring.h|  35 
 tools/virtio/linux/dma-mapping.h   |  17 ++
 10 files changed, 426 insertions(+), 209 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 2/6] virtio_ring: Support DMA APIs

2015-10-29 Thread Andy Lutomirski
virtio_ring currently sends the device (usually a hypervisor)
physical addresses of its I/O buffers.  This is okay when DMA
addresses and physical addresses are the same thing, but this isn't
always the case.  For example, this never works on Xen guests, and
it is likely to fail if a physical "virtio" device ever ends up
behind an IOMMU or swiotlb.

The immediate use case for me is to enable virtio on Xen guests.
For that to work, we need vring to support DMA address translation
as well as a corresponding change to virtio_pci or to another
driver.

With this patch, if enabled, virtfs survives kmemleak and
CONFIG_DMA_API_DEBUG.

Signed-off-by: Andy Lutomirski <l...@kernel.org>
---
 drivers/virtio/Kconfig   |   2 +-
 drivers/virtio/virtio_ring.c | 190 +++
 tools/virtio/linux/dma-mapping.h |  17 
 3 files changed, 172 insertions(+), 37 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index cab9f3f63a38..77590320d44c 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -60,7 +60,7 @@ config VIRTIO_INPUT
 
  config VIRTIO_MMIO
tristate "Platform bus driver for memory mapped virtio devices"
-   depends on HAS_IOMEM
+   depends on HAS_IOMEM && HAS_DMA
select VIRTIO
---help---
 This drivers provides support for memory mapped virtio
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 096b857e7b75..a872eb89587f 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef DEBUG
 /* For development, we want to crash whenever the ring is screwed. */
@@ -54,7 +55,14 @@
 #define END_USE(vq)
 #endif
 
-struct vring_virtqueue {
+struct vring_desc_state
+{
+   void *data; /* Data for callback. */
+   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
+};
+
+struct vring_virtqueue
+{
struct virtqueue vq;
 
/* Actual memory layout for this queue */
@@ -92,12 +100,71 @@ struct vring_virtqueue {
ktime_t last_add_time;
 #endif
 
-   /* Tokens for callbacks. */
-   void *data[];
+   /* Per-descriptor state. */
+   struct vring_desc_state desc_state[];
 };
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+/*
+ * The DMA ops on various arches are rather gnarly right now, and
+ * making all of the arch DMA ops work on the vring device itself
+ * is a mess.  For now, we use the parent device for DMA ops.
+ */
+struct device *vring_dma_dev(const struct vring_virtqueue *vq)
+{
+   return vq->vq.vdev->dev.parent;
+}
+
+/* Map one sg entry. */
+static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq,
+  struct scatterlist *sg,
+  enum dma_data_direction direction)
+{
+   /*
+* We can't use dma_map_sg, because we don't use scatterlists in
+* the way it expects (we don't guarantee that the scatterlist
+* will exist for the lifetime of the mapping).
+*/
+   return dma_map_page(vring_dma_dev(vq),
+   sg_page(sg), sg->offset, sg->length,
+   direction);
+}
+
+static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
+  void *cpu_addr, size_t size,
+  enum dma_data_direction direction)
+{
+   return dma_map_single(vring_dma_dev(vq),
+ cpu_addr, size, direction);
+}
+
+static void vring_unmap_one(const struct vring_virtqueue *vq,
+   struct vring_desc *desc)
+{
+   u16 flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+virtio64_to_cpu(vq->vq.vdev, desc->addr),
+virtio32_to_cpu(vq->vq.vdev, desc->len),
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  virtio64_to_cpu(vq->vq.vdev, desc->addr),
+  virtio32_to_cpu(vq->vq.vdev, desc->len),
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+  dma_addr_t addr)
+{
+   return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
 static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
 unsigned int total_sg, gfp_t gfp)
 {
@

Re: [PATCH v2 1/3] virtio_net: Stop doing DMA from the stack

2015-10-29 Thread Andy Lutomirski
On Wed, Oct 28, 2015 at 12:07 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> How about this instead? Less code, more robust.
>
> Warning: untested.  If you do like this approach, Tested-by would be
> appreciated.

I like it.

Tested-by: Andy Lutomirski <l...@kernel.org>

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-10-29 Thread Andy Lutomirski
On Thu, Oct 29, 2015 at 6:09 PM, Andy Lutomirski <l...@kernel.org> wrote:
> This switches virtio to use the DMA API unconditionally.  I'm sure
> it breaks things, but it seems to work on x86 using virtio-pci, with
> and without Xen, and using both the modern 1.0 variant and the
> legacy variant.

...

> Andy Lutomirski (5):
>   virtio_ring: Support DMA APIs
>   virtio_pci: Use the DMA API
>   virtio: Add improved queue allocation API
>   virtio_mmio: Use the DMA API
>   virtio_pci: Use the DMA API

Ugh.  The two virtio_pci patches should be squashed together.  I'll do
that for v5, but I'm not going to send it until there's more feedback.

FWIW, I'm collecting this stuff here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=virtio_dma

That branch includes this series (with the squash) and the s390
patches.  I'll keep it up to date, since it seems silly to declare it
stable enough to stop rebasing yet.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/3] virtio_ring: Support DMA APIs

2015-10-28 Thread Andy Lutomirski
virtio_ring currently sends the device (usually a hypervisor)
physical addresses of its I/O buffers.  This is okay when DMA
addresses and physical addresses are the same thing, but this isn't
always the case.  For example, this never works on Xen guests, and
it is likely to fail if a physical "virtio" device ever ends up
behind an IOMMU or swiotlb.

The immediate use case for me is to enable virtio on Xen guests.
For that to work, we need vring to support DMA address translation
as well as a corresponding change to virtio_pci or to another
driver.

With this patch, if enabled, virtfs survives kmemleak and
CONFIG_DMA_API_DEBUG.

Signed-off-by: Andy Lutomirski <l...@kernel.org>
---
 drivers/virtio/Kconfig   |   2 +-
 drivers/virtio/virtio_ring.c | 187 +++
 tools/virtio/linux/dma-mapping.h |  17 
 3 files changed, 169 insertions(+), 37 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index cab9f3f63a38..77590320d44c 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -60,7 +60,7 @@ config VIRTIO_INPUT
 
  config VIRTIO_MMIO
tristate "Platform bus driver for memory mapped virtio devices"
-   depends on HAS_IOMEM
+   depends on HAS_IOMEM && HAS_DMA
select VIRTIO
---help---
 This drivers provides support for memory mapped virtio
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 096b857e7b75..6962ea37ade0 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef DEBUG
 /* For development, we want to crash whenever the ring is screwed. */
@@ -54,7 +55,14 @@
 #define END_USE(vq)
 #endif
 
-struct vring_virtqueue {
+struct vring_desc_state
+{
+   void *data; /* Data for callback. */
+   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
+};
+
+struct vring_virtqueue
+{
struct virtqueue vq;
 
/* Actual memory layout for this queue */
@@ -92,12 +100,71 @@ struct vring_virtqueue {
ktime_t last_add_time;
 #endif
 
-   /* Tokens for callbacks. */
-   void *data[];
+   /* Per-descriptor state. */
+   struct vring_desc_state desc_state[];
 };
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+/*
+ * The DMA ops on various arches are rather gnarly right now, and
+ * making all of the arch DMA ops work on the vring device itself
+ * is a mess.  For now, we use the parent device for DMA ops.
+ */
+struct device *vring_dma_dev(const struct vring_virtqueue *vq)
+{
+   return vq->vq.vdev->dev.parent;
+}
+
+/* Map one sg entry. */
+static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq,
+  struct scatterlist *sg,
+  enum dma_data_direction direction)
+{
+   /*
+* We can't use dma_map_sg, because we don't use scatterlists in
+* the way it expects (we don't guarantee that the scatterlist
+* will exist for the lifetime of the mapping).
+*/
+   return dma_map_page(vring_dma_dev(vq),
+   sg_page(sg), sg->offset, sg->length,
+   direction);
+}
+
+static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
+  void *cpu_addr, size_t size,
+  enum dma_data_direction direction)
+{
+   return dma_map_single(vring_dma_dev(vq),
+ cpu_addr, size, direction);
+}
+
+static void vring_unmap_one(const struct vring_virtqueue *vq,
+   struct vring_desc *desc)
+{
+   u16 flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+virtio64_to_cpu(vq->vq.vdev, desc->addr),
+virtio32_to_cpu(vq->vq.vdev, desc->len),
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  virtio64_to_cpu(vq->vq.vdev, desc->addr),
+  virtio32_to_cpu(vq->vq.vdev, desc->len),
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+  dma_addr_t addr)
+{
+   return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
 static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
 unsigned int total_sg, gfp_t gfp)
 {
@

[PATCH v3 3/3] virtio_pci: Use the DMA API

2015-10-28 Thread Andy Lutomirski
This fixes virtio-pci on platforms and busses that have IOMMUs.  This
will break the experimental QEMU Q35 IOMMU support until QEMU is
fixed.  In exchange, it fixes physical virtio hardware as well as
virtio-pci running under Xen.

We should clean up the virtqueue API to do its own allocation and
teach virtqueue_get_avail and virtqueue_get_used to return DMA
addresses directly.

Signed-off-by: Andy Lutomirski <l...@kernel.org>
---
 drivers/virtio/virtio_pci_common.h |  3 ++-
 drivers/virtio/virtio_pci_legacy.c | 19 +++
 drivers/virtio/virtio_pci_modern.c | 34 --
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.h 
b/drivers/virtio/virtio_pci_common.h
index b976d968e793..cd6196b513ad 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -38,8 +38,9 @@ struct virtio_pci_vq_info {
/* the number of entries in the queue */
int num;
 
-   /* the virtual address of the ring queue */
+   /* the ring queue */
void *queue;
+   dma_addr_t queue_dma_addr;  /* bus address */
 
/* the list node for the virtqueues list */
struct list_head node;
diff --git a/drivers/virtio/virtio_pci_legacy.c 
b/drivers/virtio/virtio_pci_legacy.c
index 48bc9797e530..b5293e5f2af4 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -135,12 +135,14 @@ static struct virtqueue *setup_vq(struct 
virtio_pci_device *vp_dev,
info->msix_vector = msix_vec;
 
size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
-   info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
+   info->queue = dma_zalloc_coherent(_dev->pci_dev->dev, size,
+ >queue_dma_addr,
+ GFP_KERNEL);
if (info->queue == NULL)
return ERR_PTR(-ENOMEM);
 
/* activate the queue */
-   iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
+   iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
/* create the vring */
@@ -169,7 +171,8 @@ out_assign:
vring_del_virtqueue(vq);
 out_activate_queue:
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
-   free_pages_exact(info->queue, size);
+   dma_free_coherent(_dev->pci_dev->dev, size,
+ info->queue, info->queue_dma_addr);
return ERR_PTR(err);
 }
 
@@ -194,7 +197,8 @@ static void del_vq(struct virtio_pci_vq_info *info)
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
-   free_pages_exact(info->queue, size);
+   dma_free_coherent(_dev->pci_dev->dev, size,
+ info->queue, info->queue_dma_addr);
 }
 
 static const struct virtio_config_ops virtio_pci_config_ops = {
@@ -227,6 +231,13 @@ int virtio_pci_legacy_probe(struct virtio_pci_device 
*vp_dev)
return -ENODEV;
}
 
+   rc = dma_set_mask_and_coherent(_dev->dev, DMA_BIT_MASK(64));
+   if (rc)
+   rc = dma_set_mask_and_coherent(_dev->dev,
+   DMA_BIT_MASK(32));
+   if (rc)
+   dev_warn(_dev->dev, "Failed to enable 64-bit or 32-bit DMA. 
 Trying to continue, but this might not work.\n");
+
rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy");
if (rc)
return rc;
diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index 8e5cf194cc0b..fbe0bd1c4881 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -293,14 +293,16 @@ static size_t vring_pci_size(u16 num)
return PAGE_ALIGN(vring_size(num, SMP_CACHE_BYTES));
 }
 
-static void *alloc_virtqueue_pages(int *num)
+static void *alloc_virtqueue_pages(struct virtio_pci_device *vp_dev,
+  int *num, dma_addr_t *dma_addr)
 {
void *pages;
 
/* TODO: allocate each queue chunk individually */
for (; *num && vring_pci_size(*num) > PAGE_SIZE; *num /= 2) {
-   pages = alloc_pages_exact(vring_pci_size(*num),
- GFP_KERNEL|__GFP_ZERO|__GFP_NOWARN);
+   pages = dma_zalloc_coherent(
+   _dev->pci_dev->dev, vring_pci_size(*num),
+   dma_addr, GFP_KERNEL|__GFP_NOWARN);
if (pages)
return pages;
}
@@ -309,7 +311,9 @@ static void *alloc_virtqueue_pages(int *num)
return NULL;
 
/* Try to get a single page. You are my only hope! */
-   return alloc_pages_exact

[PATCH v3 1/3] virtio_net: Stop doing DMA from the stack

2015-10-28 Thread Andy Lutomirski
From: Andy Lutomirski <l...@amacapital.net>

Once virtio starts using the DMA API, we won't be able to safely DMA
from the stack.  virtio-net does a couple of config DMA requests
from small stack buffers -- switch to using dynamically-allocated
memory.

This should have no effect on any performance-critical code paths.

Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: virtualizat...@lists.linux-foundation.org
Reviewed-by: Joerg Roedel <jroe...@suse.de>
Signed-off-by: Andy Lutomirski <l...@kernel.org>
---

Hi Michael and DaveM-

This is a prerequisite for the virtio DMA fixing project.  It works
as a standalone patch, though.  Would it make sense to apply it to
an appropriate networking tree now?

(This is unchanged from v2.)

drivers/net/virtio_net.c | 53 
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d8838dedb7a4..4f10f8a58811 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -976,31 +976,43 @@ static bool virtnet_send_command(struct virtnet_info *vi, 
u8 class, u8 cmd,
 struct scatterlist *out)
 {
struct scatterlist *sgs[4], hdr, stat;
-   struct virtio_net_ctrl_hdr ctrl;
-   virtio_net_ctrl_ack status = ~0;
+
+   struct {
+   struct virtio_net_ctrl_hdr ctrl;
+   virtio_net_ctrl_ack status;
+   } *buf;
+
unsigned out_num = 0, tmp;
+   bool ret;
 
/* Caller should know better */
BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
 
-   ctrl.class = class;
-   ctrl.cmd = cmd;
+   buf = kmalloc(sizeof(*buf), GFP_ATOMIC);
+   if (!buf)
+   return false;
+   buf->status = ~0;
+
+   buf->ctrl.class = class;
+   buf->ctrl.cmd = cmd;
/* Add header */
-   sg_init_one(, , sizeof(ctrl));
+   sg_init_one(, >ctrl, sizeof(buf->ctrl));
sgs[out_num++] = 
 
if (out)
sgs[out_num++] = out;
 
/* Add return status. */
-   sg_init_one(, , sizeof(status));
+   sg_init_one(, >status, sizeof(buf->status));
sgs[out_num] = 
 
BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
 
-   if (unlikely(!virtqueue_kick(vi->cvq)))
-   return status == VIRTIO_NET_OK;
+   if (unlikely(!virtqueue_kick(vi->cvq))) {
+   ret = (buf->status == VIRTIO_NET_OK);
+   goto out;
+   }
 
/* Spin for a response, the kick causes an ioport write, trapping
 * into the hypervisor, so the request should be handled immediately.
@@ -1009,7 +1021,11 @@ static bool virtnet_send_command(struct virtnet_info 
*vi, u8 class, u8 cmd,
   !virtqueue_is_broken(vi->cvq))
cpu_relax();
 
-   return status == VIRTIO_NET_OK;
+   ret = (buf->status == VIRTIO_NET_OK);
+
+out:
+   kfree(buf);
+   return ret;
 }
 
 static int virtnet_set_mac_address(struct net_device *dev, void *p)
@@ -1151,7 +1167,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 {
struct virtnet_info *vi = netdev_priv(dev);
struct scatterlist sg[2];
-   u8 promisc, allmulti;
+   u8 *cmdbyte;
struct virtio_net_ctrl_mac *mac_data;
struct netdev_hw_addr *ha;
int uc_count;
@@ -1163,22 +1179,25 @@ static void virtnet_set_rx_mode(struct net_device *dev)
if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
return;
 
-   promisc = ((dev->flags & IFF_PROMISC) != 0);
-   allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
+   cmdbyte = kmalloc(sizeof(*cmdbyte), GFP_ATOMIC);
+   if (!cmdbyte)
+   return;
 
-   sg_init_one(sg, , sizeof(promisc));
+   sg_init_one(sg, cmdbyte, sizeof(*cmdbyte));
 
+   *cmdbyte = ((dev->flags & IFF_PROMISC) != 0);
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
  VIRTIO_NET_CTRL_RX_PROMISC, sg))
dev_warn(>dev, "Failed to %sable promisc mode.\n",
-promisc ? "en" : "dis");
-
-   sg_init_one(sg, , sizeof(allmulti));
+*cmdbyte ? "en" : "dis");
 
+   *cmdbyte = ((dev->flags & IFF_ALLMULTI) != 0);
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
dev_warn(>dev, "Failed to %sable allmulti mode.\n",
-allmulti ? "en" : "dis");
+*cmdbyte ? "en" : "dis");
+
+   kfree(cmdbyte);
 
uc_count = netdev_uc_count(dev);
mc_count = netdev_mc_count(dev);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread Andy Lutomirski
This switches virtio to use the DMA API unconditionally.  I'm sure
it breaks things, but it seems to work on x86 using virtio-pci, with
and without Xen, and using both the modern 1.0 variant and the
legacy variant.

Changes from v2:
 - Fix really embarrassing bug.  This version actually works.

Changes from v1:
 - Fix an endian conversion error causing a BUG to hit.
 - Fix a DMA ordering issue (swiotlb=force works now).
 - Minor cleanups.

Andy Lutomirski (3):
  virtio_net: Stop doing DMA from the stack
  virtio_ring: Support DMA APIs
  virtio_pci: Use the DMA API

 drivers/net/virtio_net.c   |  53 +++
 drivers/virtio/Kconfig |   2 +-
 drivers/virtio/virtio_pci_common.h |   3 +-
 drivers/virtio/virtio_pci_legacy.c |  19 +++-
 drivers/virtio/virtio_pci_modern.c |  34 +--
 drivers/virtio/virtio_ring.c   | 187 ++---
 tools/virtio/linux/dma-mapping.h   |  17 
 7 files changed, 246 insertions(+), 69 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread Andy Lutomirski
On Tue, Oct 27, 2015 at 11:53 PM, David Woodhouse <dw...@infradead.org> wrote:
> On Tue, 2015-10-27 at 23:38 -0700, Andy Lutomirski wrote:
>>
>> Changes from v2:
>>  - Fix really embarrassing bug.  This version actually works.
>
> So embarrassing you didn't want to tell us what it was? ...

Shhh, it's a secret!

I somehow managed to test-boot a different kernel than I thought I was booting.

>
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -292,7 +292,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
> vq, desc, total_sg * sizeof(struct vring_desc),
> DMA_TO_DEVICE);
>
> -   if (vring_mapping_error(vq, vq->vring.desc[head].addr))
> +   if (vring_mapping_error(vq, addr))
> goto unmap_release;
>
> vq->vring.desc[head].flags = cpu_to_virtio16(_vq->vdev, 
> VRING_DESC_F_INDIRECT);
>
> That wasn't going to be the reason for Christian's failure, was it?
>

Not obviously, but it's possible.  Now that I'm staring at it, I have
some more big-endian issues, so there'll be a v4.  I'll also play with
Michael's thing.  Expect a long delay, though -- my flight's about to
leave.

The readme notwithstanding, virtme (https://github.com/amluto/virtme)
actually has s390x support, so I can try to debug when I get home.
I'm not about to try doing this on a laptop :)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread Andy Lutomirski
On Wed, Oct 28, 2015 at 9:12 AM, Michael S. Tsirkin  wrote:
> On Wed, Oct 28, 2015 at 11:32:34PM +0900, David Woodhouse wrote:
>> > I don't have a problem with extending DMA API to address
>> > more usecases.
>>
>> No, this isn't an extension. This is fixing a bug, on certain platforms
>> where the DMA API has currently done the wrong thing.
>>
>> We have historically worked around that bug by introducing *another*
>> bug, which is not to *use* the DMA API in the virtio driver.
>>
>> Sure, we can co-ordinate those two bug-fixes. But let's not talk about
>> them as anything other than bug-fixes.
>
> It was pretty practical not to use it. All virtio devices at the time
> without exception bypassed the IOMMU, so it was a question of omitting a
> couple of function calls in virtio versus hacking on DMA implementation
> on multiple platforms. We have more policy options now, so I agree it's
> time to revisit this.
>
> But for me, the most important thing is that we do coordinate.
>
>> > > Drivers use DMA API. No more talky.
>> >
>> > Well for virtio they don't ATM. And 1:1 mapping makes perfect sense
>> > for the wast majority of users, so I can't switch them over
>> > until the DMA API actually addresses all existing usecases.
>>
>> That's still not your business; it's the platform's. And there are
>> hardware implementations of the virtio protocols on real PCI cards. And
>> we have the option of doing IOMMU translation for the virtio devices
>> even in a virtual machine. Just don't get involved.
>>
>> --
>> dwmw2
>>
>>
>
> I'm involved anyway, it's possible not to put all the code in the virtio
> subsystem in guest though.  But I suspect we'll need to find a way for
> non-linux drivers within guest to work correctly too, and they might
> have trouble poking at things at the system level.  So possibly virtio
> subsystem will have to tell platform "this device wants to bypass IOMMU"
> and then DMA API does the right thing.
>

After some discussion at KS, no one came up with an example where it's
necessary, and the patches to convert virtqueue to use the DMA API are
much nicer when they convert it unconditionally.

The two interesting cases we thought of were PPC and x86's emulated
Q35 IOMMU.  PPC will look in to architecting a devicetree-based way to
indicate passthrough status and will add quirks for the existing
virtio devices.  Everyone seems to agree that x86's emulated Q35 thing
is just buggy right now and should be taught to use the existing ACPI
mechanism for enumerating passthrough devices.

I'll send a new version of the series soon.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC 0/4] dma ops and virtio

2015-10-28 Thread Andy Lutomirski
On Tue, Oct 27, 2015 at 3:48 PM, Christian Borntraeger
<borntrae...@de.ibm.com> wrote:
> This is an RFC to check if I am on the right track.  There
> are some attempts to unify the dma ops (Christoph) as well
> as some attempts to make virtio use the dma API (Andy).
>
> At todays discussion at the kernel summit, we concluded that
> we want to use the same code on all platforms, whereever
> possible, so having a dummy dma_op might be the easiest
> solution to keep virtio-ccw as similar as possible to
> virtio-pci. Andy Lutomirski will rework his patchset to
> unconditionally use the dma ops.  We will also need a
> compatibility quirk for powerpc to bypass the iommu mappings
> on older QEMU version (which translates to all versions as
> of today) and per device, e.g.  via device tree.  Ben
> Herrenschmidt will look into that.

The combination of these patches plus my series don't link for me
unless I enable PCI.  Would s390 need to select HAS_DMA from VIRTIO or
something similar?

Also, is it possible to use QEMU to emulate an s390x?  Even just:

qemu-system-s390x -M s390-ccw-virtio

exits immediately.  (I build from QEMU git today.)  I don't know what
options to pass to get -M s390 to do anything useful, and AFAICT it's
considered deprecated.

Help?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] virtio_net: Stop doing DMA from the stack

2015-10-27 Thread Andy Lutomirski
From: Andy Lutomirski <l...@amacapital.net>

Once virtio starts using the DMA API, we won't be able to safely DMA
from the stack.  virtio-net does a couple of config DMA requests
from small stack buffers -- switch to using dynamically-allocated
memory.

This should have no effect on any performance-critical code paths.

Signed-off-by: Andy Lutomirski <l...@kernel.org>
---
 drivers/net/virtio_net.c | 53 
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d8838dedb7a4..4f10f8a58811 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -976,31 +976,43 @@ static bool virtnet_send_command(struct virtnet_info *vi, 
u8 class, u8 cmd,
 struct scatterlist *out)
 {
struct scatterlist *sgs[4], hdr, stat;
-   struct virtio_net_ctrl_hdr ctrl;
-   virtio_net_ctrl_ack status = ~0;
+
+   struct {
+   struct virtio_net_ctrl_hdr ctrl;
+   virtio_net_ctrl_ack status;
+   } *buf;
+
unsigned out_num = 0, tmp;
+   bool ret;
 
/* Caller should know better */
BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
 
-   ctrl.class = class;
-   ctrl.cmd = cmd;
+   buf = kmalloc(sizeof(*buf), GFP_ATOMIC);
+   if (!buf)
+   return false;
+   buf->status = ~0;
+
+   buf->ctrl.class = class;
+   buf->ctrl.cmd = cmd;
/* Add header */
-   sg_init_one(, , sizeof(ctrl));
+   sg_init_one(, >ctrl, sizeof(buf->ctrl));
sgs[out_num++] = 
 
if (out)
sgs[out_num++] = out;
 
/* Add return status. */
-   sg_init_one(, , sizeof(status));
+   sg_init_one(, >status, sizeof(buf->status));
sgs[out_num] = 
 
BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
 
-   if (unlikely(!virtqueue_kick(vi->cvq)))
-   return status == VIRTIO_NET_OK;
+   if (unlikely(!virtqueue_kick(vi->cvq))) {
+   ret = (buf->status == VIRTIO_NET_OK);
+   goto out;
+   }
 
/* Spin for a response, the kick causes an ioport write, trapping
 * into the hypervisor, so the request should be handled immediately.
@@ -1009,7 +1021,11 @@ static bool virtnet_send_command(struct virtnet_info 
*vi, u8 class, u8 cmd,
   !virtqueue_is_broken(vi->cvq))
cpu_relax();
 
-   return status == VIRTIO_NET_OK;
+   ret = (buf->status == VIRTIO_NET_OK);
+
+out:
+   kfree(buf);
+   return ret;
 }
 
 static int virtnet_set_mac_address(struct net_device *dev, void *p)
@@ -1151,7 +1167,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 {
struct virtnet_info *vi = netdev_priv(dev);
struct scatterlist sg[2];
-   u8 promisc, allmulti;
+   u8 *cmdbyte;
struct virtio_net_ctrl_mac *mac_data;
struct netdev_hw_addr *ha;
int uc_count;
@@ -1163,22 +1179,25 @@ static void virtnet_set_rx_mode(struct net_device *dev)
if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
return;
 
-   promisc = ((dev->flags & IFF_PROMISC) != 0);
-   allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
+   cmdbyte = kmalloc(sizeof(*cmdbyte), GFP_ATOMIC);
+   if (!cmdbyte)
+   return;
 
-   sg_init_one(sg, , sizeof(promisc));
+   sg_init_one(sg, cmdbyte, sizeof(*cmdbyte));
 
+   *cmdbyte = ((dev->flags & IFF_PROMISC) != 0);
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
  VIRTIO_NET_CTRL_RX_PROMISC, sg))
dev_warn(>dev, "Failed to %sable promisc mode.\n",
-promisc ? "en" : "dis");
-
-   sg_init_one(sg, , sizeof(allmulti));
+*cmdbyte ? "en" : "dis");
 
+   *cmdbyte = ((dev->flags & IFF_ALLMULTI) != 0);
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
dev_warn(>dev, "Failed to %sable allmulti mode.\n",
-allmulti ? "en" : "dis");
+*cmdbyte ? "en" : "dis");
+
+   kfree(cmdbyte);
 
uc_count = netdev_uc_count(dev);
mc_count = netdev_mc_count(dev);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] virtio DMA API core stuff

2015-10-27 Thread Andy Lutomirski
This switches virtio to use the DMA API unconditionally.  I'm sure
it breaks things, but it seems to work on x86 using virtio-pci, with
and without Xen, and using both the modern 1.0 variant and the
legacy variant.

Andy Lutomirski (3):
  virtio_net: Stop doing DMA from the stack
  virtio_ring: Support DMA APIs
  virtio_pci: Use the DMA API

 drivers/net/virtio_net.c   |  53 +++
 drivers/virtio/Kconfig |   2 +-
 drivers/virtio/virtio_pci_common.h |   3 +-
 drivers/virtio/virtio_pci_legacy.c |  19 +++-
 drivers/virtio/virtio_pci_modern.c |  34 ---
 drivers/virtio/virtio_ring.c   | 179 ++---
 tools/virtio/linux/dma-mapping.h   |  17 
 7 files changed, 241 insertions(+), 66 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] virtio_ring: Support DMA APIs

2015-10-27 Thread Andy Lutomirski
From: Andy Lutomirski <l...@amacapital.net>

virtio_ring currently sends the device (usually a hypervisor)
physical addresses of its I/O buffers.  This is okay when DMA
addresses and physical addresses are the same thing, but this isn't
always the case.  For example, this never works on Xen guests, and
it is likely to fail if a physical "virtio" device ever ends up
behind an IOMMU or swiotlb.

The immediate use case for me is to enable virtio on Xen guests.
For that to work, we need vring to support DMA address translation
as well as a corresponding change to virtio_pci or to another
driver.

With this patch, if enabled, virtfs survives kmemleak and
CONFIG_DMA_API_DEBUG.

Signed-off-by: Andy Lutomirski <l...@amacapital.net>
---
 drivers/virtio/Kconfig   |   2 +-
 drivers/virtio/virtio_ring.c | 179 +++
 tools/virtio/linux/dma-mapping.h |  17 
 3 files changed, 164 insertions(+), 34 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index cab9f3f63a38..77590320d44c 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -60,7 +60,7 @@ config VIRTIO_INPUT
 
  config VIRTIO_MMIO
tristate "Platform bus driver for memory mapped virtio devices"
-   depends on HAS_IOMEM
+   depends on HAS_IOMEM && HAS_DMA
select VIRTIO
---help---
 This drivers provides support for memory mapped virtio
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 096b857e7b75..911fbaa7a260 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef DEBUG
 /* For development, we want to crash whenever the ring is screwed. */
@@ -54,7 +55,14 @@
 #define END_USE(vq)
 #endif
 
-struct vring_virtqueue {
+struct vring_desc_state
+{
+   void *data; /* Data for callback. */
+   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
+};
+
+struct vring_virtqueue
+{
struct virtqueue vq;
 
/* Actual memory layout for this queue */
@@ -92,12 +100,71 @@ struct vring_virtqueue {
ktime_t last_add_time;
 #endif
 
-   /* Tokens for callbacks. */
-   void *data[];
+   /* Per-descriptor state. */
+   struct vring_desc_state desc_state[];
 };
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+/*
+ * The DMA ops on various arches are rather gnarly right now, and
+ * making all of the arch DMA ops work on the vring device itself
+ * is a mess.  For now, we use the parent device for DMA ops.
+ */
+struct device *vring_dma_dev(const struct vring_virtqueue *vq)
+{
+   return vq->vq.vdev->dev.parent;
+}
+
+/* Map one sg entry. */
+static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq,
+  struct scatterlist *sg,
+  enum dma_data_direction direction)
+{
+   /*
+* We can't use dma_map_sg, because we don't use scatterlists in
+* the way it expects (we don't guarantee that the scatterlist
+* will exist for the lifetime of the mapping).
+*/
+   return dma_map_page(vring_dma_dev(vq),
+   sg_page(sg), sg->offset, sg->length,
+   direction);
+}
+
+static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
+  void *cpu_addr, size_t size,
+  enum dma_data_direction direction)
+{
+   return dma_map_single(vring_dma_dev(vq),
+ cpu_addr, size, direction);
+}
+
+static void vring_unmap_one(const struct vring_virtqueue *vq,
+   struct vring_desc *desc)
+{
+   u16 flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+virtio64_to_cpu(vq->vq.vdev, desc->addr),
+virtio32_to_cpu(vq->vq.vdev, desc->len),
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  virtio64_to_cpu(vq->vq.vdev, desc->addr),
+  virtio32_to_cpu(vq->vq.vdev, desc->len),
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+  dma_addr_t addr)
+{
+   return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
 static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
 

[PATCH 3/3] virtio_pci: Use the DMA API

2015-10-27 Thread Andy Lutomirski
From: Andy Lutomirski <l...@amacapital.net>

This fixes virtio-pci on platforms and busses that have IOMMUs.  This
will break the experimental QEMU Q35 IOMMU support until QEMU is
fixed.  In exchange, it fixes physical virtio hardware as well as
virtio-pci running under Xen.

We should clean up the virtqueue API to do its own allocation and
teach virtqueue_get_avail and virtqueue_get_used to return DMA
addresses directly.

Signed-off-by: Andy Lutomirski <l...@kernel.org>
---
 drivers/virtio/virtio_pci_common.h |  3 ++-
 drivers/virtio/virtio_pci_legacy.c | 19 +++
 drivers/virtio/virtio_pci_modern.c | 34 --
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.h 
b/drivers/virtio/virtio_pci_common.h
index b976d968e793..cd6196b513ad 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -38,8 +38,9 @@ struct virtio_pci_vq_info {
/* the number of entries in the queue */
int num;
 
-   /* the virtual address of the ring queue */
+   /* the ring queue */
void *queue;
+   dma_addr_t queue_dma_addr;  /* bus address */
 
/* the list node for the virtqueues list */
struct list_head node;
diff --git a/drivers/virtio/virtio_pci_legacy.c 
b/drivers/virtio/virtio_pci_legacy.c
index 48bc9797e530..b5293e5f2af4 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -135,12 +135,14 @@ static struct virtqueue *setup_vq(struct 
virtio_pci_device *vp_dev,
info->msix_vector = msix_vec;
 
size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
-   info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
+   info->queue = dma_zalloc_coherent(_dev->pci_dev->dev, size,
+ >queue_dma_addr,
+ GFP_KERNEL);
if (info->queue == NULL)
return ERR_PTR(-ENOMEM);
 
/* activate the queue */
-   iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
+   iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
/* create the vring */
@@ -169,7 +171,8 @@ out_assign:
vring_del_virtqueue(vq);
 out_activate_queue:
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
-   free_pages_exact(info->queue, size);
+   dma_free_coherent(_dev->pci_dev->dev, size,
+ info->queue, info->queue_dma_addr);
return ERR_PTR(err);
 }
 
@@ -194,7 +197,8 @@ static void del_vq(struct virtio_pci_vq_info *info)
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
-   free_pages_exact(info->queue, size);
+   dma_free_coherent(_dev->pci_dev->dev, size,
+ info->queue, info->queue_dma_addr);
 }
 
 static const struct virtio_config_ops virtio_pci_config_ops = {
@@ -227,6 +231,13 @@ int virtio_pci_legacy_probe(struct virtio_pci_device 
*vp_dev)
return -ENODEV;
}
 
+   rc = dma_set_mask_and_coherent(_dev->dev, DMA_BIT_MASK(64));
+   if (rc)
+   rc = dma_set_mask_and_coherent(_dev->dev,
+   DMA_BIT_MASK(32));
+   if (rc)
+   dev_warn(_dev->dev, "Failed to enable 64-bit or 32-bit DMA. 
 Trying to continue, but this might not work.\n");
+
rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy");
if (rc)
return rc;
diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index 8e5cf194cc0b..fbe0bd1c4881 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -293,14 +293,16 @@ static size_t vring_pci_size(u16 num)
return PAGE_ALIGN(vring_size(num, SMP_CACHE_BYTES));
 }
 
-static void *alloc_virtqueue_pages(int *num)
+static void *alloc_virtqueue_pages(struct virtio_pci_device *vp_dev,
+  int *num, dma_addr_t *dma_addr)
 {
void *pages;
 
/* TODO: allocate each queue chunk individually */
for (; *num && vring_pci_size(*num) > PAGE_SIZE; *num /= 2) {
-   pages = alloc_pages_exact(vring_pci_size(*num),
- GFP_KERNEL|__GFP_ZERO|__GFP_NOWARN);
+   pages = dma_zalloc_coherent(
+   _dev->pci_dev->dev, vring_pci_size(*num),
+   dma_addr, GFP_KERNEL|__GFP_NOWARN);
if (pages)
return pages;
}
@@ -309,7 +311,9 @@ static void *alloc_virtqueue_pages(int *num)
return NULL;
 
/* Try to get a single page.

Re: [PATCH 2/3] virtio_ring: Support DMA APIs

2015-10-27 Thread Andy Lutomirski
On Tue, Oct 27, 2015 at 7:27 PM, Christian Borntraeger
<borntrae...@de.ibm.com> wrote:
> Am 28.10.2015 um 10:17 schrieb Andy Lutomirski:
> @@ -423,27 +522,42 @@ EXPORT_SYMBOL_GPL(virtqueue_kick);
>>
>>  static void detach_buf(struct vring_virtqueue *vq, unsigned int head)
>>  {
>> - unsigned int i;
>> + unsigned int i, j;
>> + u16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
>>
>>   /* Clear data ptr. */
>> - vq->data[head] = NULL;
>> + vq->desc_state[head].data = NULL;
>>
>> - /* Put back on free list: find end */
>> + /* Put back on free list: unmap first-level descriptors and find end */
>>   i = head;
>>
>> - /* Free the indirect table */
>> - if (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, 
>> VRING_DESC_F_INDIRECT))
>> - kfree(phys_to_virt(virtio64_to_cpu(vq->vq.vdev, 
>> vq->vring.desc[i].addr)));
>> -
>> - while (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, 
>> VRING_DESC_F_NEXT)) {
>> + while (vq->vring.desc[i].flags & nextflag) {
>> + vring_unmap_one(vq, >vring.desc[i]);
>>   i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
>>   vq->vq.num_free++;
>>   }
>>
>> + vring_unmap_one(vq, >vring.desc[i]);
>>   vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
>>   vq->free_head = head;
>> +
>>   /* Plus final descriptor */
>>   vq->vq.num_free++;
>> +
>> + /* Free the indirect table, if any, now that it's unmapped. */
>> + if (vq->desc_state[head].indir_desc) {
>> + struct vring_desc *indir_desc = 
>> vq->desc_state[head].indir_desc;
>> + u32 len = vq->vring.desc[head].len;
>> +
>> + BUG_ON(!(vq->vring.desc[head].flags & VRING_DESC_F_INDIRECT));
>> + BUG_ON(len == 0 || len % sizeof(struct vring_desc));
>> +
>> + for (j = 0; j < len / sizeof(struct vring_desc); j++)
>> + vring_unmap_one(vq, _desc[j]);
>> +
>> + kfree(vq->desc_state[head].indir_desc);
>> + vq->desc_state[head].indir_desc = NULL;
>> + }
>>  }
>
> something seems to be broken with indirect descriptors with that change

You're big endian, right?  I had an endian handling bug, and I'll fix it in v2.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] context_tracking: remove duplicate enabled check

2015-10-27 Thread Andy Lutomirski
On Tue, Oct 27, 2015 at 6:39 PM, Paolo Bonzini <pbonz...@redhat.com> wrote:
> All calls to context_tracking_enter and context_tracking_exit
> are already checking context_tracking_is_enabled, except the
> context_tracking_user_enter and context_tracking_user_exit
> functions left in for the benefit of assembly calls.
>
> Pull the check up to those functions, by making them simple
> wrappers around the user_enter and user_exit inline functions.

This makes my brain hurt.  Assuming that this survives a boot with
CONFIG_CONTEXT_TRACKING_FORCE=y and CONFIG_PROVE_LOCKING=y (with the
implied CONFIG_PROVE_RCU), then:

Acked-by: Andy Lutomirski <l...@kernel.org>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] context_tracking: avoid irq_save/irq_restore on guest entry and exit

2015-10-27 Thread Andy Lutomirski
On Tue, Oct 27, 2015 at 6:39 PM, Paolo Bonzini  wrote:
> guest_enter and guest_exit must be called with interrupts disabled,
> since they take the vtime_seqlock with write_seq{lock,unlock}.
> Therefore, it is not necessary to check for exceptions, nor to
> save/restore the IRQ state, when context tracking functions are
> called by guest_enter and guest_exit.
>
> Split the body of context_tracking_entry and context_tracking_exit
> out to __-prefixed functions, and use them from KVM.
>
> Rik van Riel has measured this to speed up a tight vmentry/vmexit
> loop by about 2%.

Looks generally sensible.  I'm not familiar enough with the code to
call it reviewed-by while sitting on the airport shuttle.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] virtio_net: Stop doing DMA from the stack

2015-10-27 Thread Andy Lutomirski
On Tue, Oct 27, 2015 at 7:07 PM, Joerg Roedel <jroe...@suse.de> wrote:
> On Tue, Oct 27, 2015 at 06:17:08PM -0700, Andy Lutomirski wrote:
>> From: Andy Lutomirski <l...@amacapital.net>
>>
>> Once virtio starts using the DMA API, we won't be able to safely DMA
>> from the stack.  virtio-net does a couple of config DMA requests
>> from small stack buffers -- switch to using dynamically-allocated
>> memory.
>>
>> This should have no effect on any performance-critical code paths.
>>
>> Signed-off-by: Andy Lutomirski <l...@kernel.org>
>> ---
>>  drivers/net/virtio_net.c | 53 
>> 
>>  1 file changed, 36 insertions(+), 17 deletions(-)
>
> Reviewed-by: Joerg Roedel <jroe...@suse.de>

Should I just send this to netdev now for 4.4?  We could get this one
out of the way, maybe.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/3] virtio DMA API core stuff

2015-10-27 Thread Andy Lutomirski
This switches virtio to use the DMA API unconditionally.  I'm sure
it breaks things, but it seems to work on x86 using virtio-pci, with
and without Xen, and using both the modern 1.0 variant and the
legacy variant.

Changes from v1:
 - Fix an endian conversion error causing a BUG to hit.
 - Fix a DMA ordering issue (swiotlb=force works now).
 - Minor cleanups.

Andy Lutomirski (3):
  virtio_net: Stop doing DMA from the stack
  virtio_ring: Support DMA APIs
  virtio_pci: Use the DMA API

 drivers/net/virtio_net.c   |  53 +++
 drivers/virtio/Kconfig |   2 +-
 drivers/virtio/virtio_pci_common.h |   3 +-
 drivers/virtio/virtio_pci_legacy.c |  19 +++-
 drivers/virtio/virtio_pci_modern.c |  34 +--
 drivers/virtio/virtio_ring.c   | 187 ++---
 tools/virtio/linux/dma-mapping.h   |  17 
 7 files changed, 246 insertions(+), 69 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] virtio_ring: Support DMA APIs

2015-10-27 Thread Andy Lutomirski
On Tue, Oct 27, 2015 at 7:06 PM, Joerg Roedel <jroe...@suse.de> wrote:
> Hi Andy,
>
> On Tue, Oct 27, 2015 at 06:17:09PM -0700, Andy Lutomirski wrote:
>> From: Andy Lutomirski <l...@amacapital.net>
>>
>> virtio_ring currently sends the device (usually a hypervisor)
>> physical addresses of its I/O buffers.  This is okay when DMA
>> addresses and physical addresses are the same thing, but this isn't
>> always the case.  For example, this never works on Xen guests, and
>> it is likely to fail if a physical "virtio" device ever ends up
>> behind an IOMMU or swiotlb.
>
> The overall code looks good, but I havn't seen and dma_sync* calls.
> When swiotlb=force is in use this would break.
>
>> + vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, 
>> vring_map_single(
>> + vq,
>> + desc, total_sg * sizeof(struct vring_desc),
>> + DMA_TO_DEVICE));
>

Are you talking about a dma_sync call on the descriptor ring itself?
Isn't dma_alloc_coherent supposed to make that unnecessary?  I should
move the allocation into the virtqueue code.

The docs suggest that I might need to "flush the processor's write
buffers before telling devices to read that memory".  I'm not sure how
to do that.

> Nit: I think readability could be improved here by using a temporary
> variable for the return value of vring_map_single().
>

I'll do something like that for v2.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] virtio_ring: Support DMA APIs

2015-10-27 Thread Andy Lutomirski
On Tue, Oct 27, 2015 at 7:21 PM, Joerg Roedel <jroe...@suse.de> wrote:
> On Tue, Oct 27, 2015 at 07:13:56PM -0700, Andy Lutomirski wrote:
>> On Tue, Oct 27, 2015 at 7:06 PM, Joerg Roedel <jroe...@suse.de> wrote:
>> > Hi Andy,
>> >
>> > On Tue, Oct 27, 2015 at 06:17:09PM -0700, Andy Lutomirski wrote:
>> >> From: Andy Lutomirski <l...@amacapital.net>
>> >>
>> >> virtio_ring currently sends the device (usually a hypervisor)
>> >> physical addresses of its I/O buffers.  This is okay when DMA
>> >> addresses and physical addresses are the same thing, but this isn't
>> >> always the case.  For example, this never works on Xen guests, and
>> >> it is likely to fail if a physical "virtio" device ever ends up
>> >> behind an IOMMU or swiotlb.
>> >
>> > The overall code looks good, but I havn't seen and dma_sync* calls.
>> > When swiotlb=force is in use this would break.
>> >
>> >> + vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, 
>> >> vring_map_single(
>> >> + vq,
>> >> + desc, total_sg * sizeof(struct vring_desc),
>> >> + DMA_TO_DEVICE));
>> >
>>
>> Are you talking about a dma_sync call on the descriptor ring itself?
>> Isn't dma_alloc_coherent supposed to make that unnecessary?  I should
>> move the allocation into the virtqueue code.
>>
>> The docs suggest that I might need to "flush the processor's write
>> buffers before telling devices to read that memory".  I'm not sure how
>> to do that.
>
> The write buffers should be flushed by the dma-api functions if
> necessary.  For dma_alloc_coherent allocations you don't need to call
> dma_sync*, but for the map_single/map_page/map_sg ones, as these might
> be bounce-buffered.

I think that all the necessary barriers are already there.  I had a
nasty bug that swiotlb=force exposed, though, which I've fixed.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] x86: context_tracking: avoid irq_save/irq_restore on kernel entry and exit

2015-10-27 Thread Andy Lutomirski
On Tue, Oct 27, 2015 at 6:39 PM, Paolo Bonzini  wrote:
> x86 always calls user_enter and user_exit with interrupt disabled.
> Therefore, it is not necessary to check for exceptions, nor to
> save/restore the IRQ state, when context tracking functions are
> called by guest_enter and guest_exit.
>
> Use the previously introduced __context_tracking_entry and
> __context_tracking_exit.

x86 isn't ready for this yet.  We could do a quick-and-dirty fix with
explicit IRQs-on-and-off much protected by the static key, or we could
just wait until I finish the syscall cleanup.  I favor the latter, but
you're all welcome to do the former and I'll review it.

BTW, Frederic, I have a static key patch coming that I think you'll
like.  I'll send it tomorrow once I'm in front of a real computer.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/3] virtio_ring: Support DMA APIs

2015-10-27 Thread Andy Lutomirski
virtio_ring currently sends the device (usually a hypervisor)
physical addresses of its I/O buffers.  This is okay when DMA
addresses and physical addresses are the same thing, but this isn't
always the case.  For example, this never works on Xen guests, and
it is likely to fail if a physical "virtio" device ever ends up
behind an IOMMU or swiotlb.

The immediate use case for me is to enable virtio on Xen guests.
For that to work, we need vring to support DMA address translation
as well as a corresponding change to virtio_pci or to another
driver.

With this patch, if enabled, virtfs survives kmemleak and
CONFIG_DMA_API_DEBUG.

Signed-off-by: Andy Lutomirski <l...@kernel.org>
---
 drivers/virtio/Kconfig   |   2 +-
 drivers/virtio/virtio_ring.c | 187 +++
 tools/virtio/linux/dma-mapping.h |  17 
 3 files changed, 169 insertions(+), 37 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index cab9f3f63a38..77590320d44c 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -60,7 +60,7 @@ config VIRTIO_INPUT
 
  config VIRTIO_MMIO
tristate "Platform bus driver for memory mapped virtio devices"
-   depends on HAS_IOMEM
+   depends on HAS_IOMEM && HAS_DMA
select VIRTIO
---help---
 This drivers provides support for memory mapped virtio
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 096b857e7b75..e71bf0ca59a2 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef DEBUG
 /* For development, we want to crash whenever the ring is screwed. */
@@ -54,7 +55,14 @@
 #define END_USE(vq)
 #endif
 
-struct vring_virtqueue {
+struct vring_desc_state
+{
+   void *data; /* Data for callback. */
+   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
+};
+
+struct vring_virtqueue
+{
struct virtqueue vq;
 
/* Actual memory layout for this queue */
@@ -92,12 +100,71 @@ struct vring_virtqueue {
ktime_t last_add_time;
 #endif
 
-   /* Tokens for callbacks. */
-   void *data[];
+   /* Per-descriptor state. */
+   struct vring_desc_state desc_state[];
 };
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+/*
+ * The DMA ops on various arches are rather gnarly right now, and
+ * making all of the arch DMA ops work on the vring device itself
+ * is a mess.  For now, we use the parent device for DMA ops.
+ */
+struct device *vring_dma_dev(const struct vring_virtqueue *vq)
+{
+   return vq->vq.vdev->dev.parent;
+}
+
+/* Map one sg entry. */
+static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq,
+  struct scatterlist *sg,
+  enum dma_data_direction direction)
+{
+   /*
+* We can't use dma_map_sg, because we don't use scatterlists in
+* the way it expects (we don't guarantee that the scatterlist
+* will exist for the lifetime of the mapping).
+*/
+   return dma_map_page(vring_dma_dev(vq),
+   sg_page(sg), sg->offset, sg->length,
+   direction);
+}
+
+static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
+  void *cpu_addr, size_t size,
+  enum dma_data_direction direction)
+{
+   return dma_map_single(vring_dma_dev(vq),
+ cpu_addr, size, direction);
+}
+
+static void vring_unmap_one(const struct vring_virtqueue *vq,
+   struct vring_desc *desc)
+{
+   u16 flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+virtio64_to_cpu(vq->vq.vdev, desc->addr),
+virtio32_to_cpu(vq->vq.vdev, desc->len),
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  virtio64_to_cpu(vq->vq.vdev, desc->addr),
+  virtio32_to_cpu(vq->vq.vdev, desc->len),
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+  dma_addr_t addr)
+{
+   return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
 static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
 unsigned int total_sg, gfp_t gfp)
 {
@

[PATCH v2 1/3] virtio_net: Stop doing DMA from the stack

2015-10-27 Thread Andy Lutomirski
From: Andy Lutomirski <l...@amacapital.net>

Once virtio starts using the DMA API, we won't be able to safely DMA
from the stack.  virtio-net does a couple of config DMA requests
from small stack buffers -- switch to using dynamically-allocated
memory.

This should have no effect on any performance-critical code paths.

Cc: net...@vger.kernel.org
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: virtualizat...@lists.linux-foundation.org
Reviewed-by: Joerg Roedel <jroe...@suse.de>
Signed-off-by: Andy Lutomirski <l...@kernel.org>
---

Hi Michael and DaveM-

This is a prerequisite for the virtio DMA fixing project.  It works
as a standalone patch, though.  Would it make sense to apply it to
an appropriate networking tree now?

 drivers/net/virtio_net.c | 53 
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d8838dedb7a4..4f10f8a58811 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -976,31 +976,43 @@ static bool virtnet_send_command(struct virtnet_info *vi, 
u8 class, u8 cmd,
 struct scatterlist *out)
 {
struct scatterlist *sgs[4], hdr, stat;
-   struct virtio_net_ctrl_hdr ctrl;
-   virtio_net_ctrl_ack status = ~0;
+
+   struct {
+   struct virtio_net_ctrl_hdr ctrl;
+   virtio_net_ctrl_ack status;
+   } *buf;
+
unsigned out_num = 0, tmp;
+   bool ret;
 
/* Caller should know better */
BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
 
-   ctrl.class = class;
-   ctrl.cmd = cmd;
+   buf = kmalloc(sizeof(*buf), GFP_ATOMIC);
+   if (!buf)
+   return false;
+   buf->status = ~0;
+
+   buf->ctrl.class = class;
+   buf->ctrl.cmd = cmd;
/* Add header */
-   sg_init_one(, , sizeof(ctrl));
+   sg_init_one(, >ctrl, sizeof(buf->ctrl));
sgs[out_num++] = 
 
if (out)
sgs[out_num++] = out;
 
/* Add return status. */
-   sg_init_one(, , sizeof(status));
+   sg_init_one(, >status, sizeof(buf->status));
sgs[out_num] = 
 
BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
 
-   if (unlikely(!virtqueue_kick(vi->cvq)))
-   return status == VIRTIO_NET_OK;
+   if (unlikely(!virtqueue_kick(vi->cvq))) {
+   ret = (buf->status == VIRTIO_NET_OK);
+   goto out;
+   }
 
/* Spin for a response, the kick causes an ioport write, trapping
 * into the hypervisor, so the request should be handled immediately.
@@ -1009,7 +1021,11 @@ static bool virtnet_send_command(struct virtnet_info 
*vi, u8 class, u8 cmd,
   !virtqueue_is_broken(vi->cvq))
cpu_relax();
 
-   return status == VIRTIO_NET_OK;
+   ret = (buf->status == VIRTIO_NET_OK);
+
+out:
+   kfree(buf);
+   return ret;
 }
 
 static int virtnet_set_mac_address(struct net_device *dev, void *p)
@@ -1151,7 +1167,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 {
struct virtnet_info *vi = netdev_priv(dev);
struct scatterlist sg[2];
-   u8 promisc, allmulti;
+   u8 *cmdbyte;
struct virtio_net_ctrl_mac *mac_data;
struct netdev_hw_addr *ha;
int uc_count;
@@ -1163,22 +1179,25 @@ static void virtnet_set_rx_mode(struct net_device *dev)
if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
return;
 
-   promisc = ((dev->flags & IFF_PROMISC) != 0);
-   allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
+   cmdbyte = kmalloc(sizeof(*cmdbyte), GFP_ATOMIC);
+   if (!cmdbyte)
+   return;
 
-   sg_init_one(sg, , sizeof(promisc));
+   sg_init_one(sg, cmdbyte, sizeof(*cmdbyte));
 
+   *cmdbyte = ((dev->flags & IFF_PROMISC) != 0);
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
  VIRTIO_NET_CTRL_RX_PROMISC, sg))
dev_warn(>dev, "Failed to %sable promisc mode.\n",
-promisc ? "en" : "dis");
-
-   sg_init_one(sg, , sizeof(allmulti));
+*cmdbyte ? "en" : "dis");
 
+   *cmdbyte = ((dev->flags & IFF_ALLMULTI) != 0);
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
dev_warn(>dev, "Failed to %sable allmulti mode.\n",
-allmulti ? "en" : "dis");
+*cmdbyte ? "en" : "dis");
+
+   kfree(cmdbyte);
 
uc_count = netdev_uc_count(dev);
mc_count = netdev_mc_count(dev);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/3] virtio_pci: Use the DMA API

2015-10-27 Thread Andy Lutomirski
From: Andy Lutomirski <l...@amacapital.net>

This fixes virtio-pci on platforms and busses that have IOMMUs.  This
will break the experimental QEMU Q35 IOMMU support until QEMU is
fixed.  In exchange, it fixes physical virtio hardware as well as
virtio-pci running under Xen.

We should clean up the virtqueue API to do its own allocation and
teach virtqueue_get_avail and virtqueue_get_used to return DMA
addresses directly.

Signed-off-by: Andy Lutomirski <l...@kernel.org>
---

drivers/virtio/virtio_pci_common.h |  3 ++-
 drivers/virtio/virtio_pci_legacy.c | 19 +++
 drivers/virtio/virtio_pci_modern.c | 34 --
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.h 
b/drivers/virtio/virtio_pci_common.h
index b976d968e793..cd6196b513ad 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -38,8 +38,9 @@ struct virtio_pci_vq_info {
/* the number of entries in the queue */
int num;
 
-   /* the virtual address of the ring queue */
+   /* the ring queue */
void *queue;
+   dma_addr_t queue_dma_addr;  /* bus address */
 
/* the list node for the virtqueues list */
struct list_head node;
diff --git a/drivers/virtio/virtio_pci_legacy.c 
b/drivers/virtio/virtio_pci_legacy.c
index 48bc9797e530..b5293e5f2af4 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -135,12 +135,14 @@ static struct virtqueue *setup_vq(struct 
virtio_pci_device *vp_dev,
info->msix_vector = msix_vec;
 
size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
-   info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
+   info->queue = dma_zalloc_coherent(_dev->pci_dev->dev, size,
+ >queue_dma_addr,
+ GFP_KERNEL);
if (info->queue == NULL)
return ERR_PTR(-ENOMEM);
 
/* activate the queue */
-   iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
+   iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
/* create the vring */
@@ -169,7 +171,8 @@ out_assign:
vring_del_virtqueue(vq);
 out_activate_queue:
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
-   free_pages_exact(info->queue, size);
+   dma_free_coherent(_dev->pci_dev->dev, size,
+ info->queue, info->queue_dma_addr);
return ERR_PTR(err);
 }
 
@@ -194,7 +197,8 @@ static void del_vq(struct virtio_pci_vq_info *info)
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
-   free_pages_exact(info->queue, size);
+   dma_free_coherent(_dev->pci_dev->dev, size,
+ info->queue, info->queue_dma_addr);
 }
 
 static const struct virtio_config_ops virtio_pci_config_ops = {
@@ -227,6 +231,13 @@ int virtio_pci_legacy_probe(struct virtio_pci_device 
*vp_dev)
return -ENODEV;
}
 
+   rc = dma_set_mask_and_coherent(_dev->dev, DMA_BIT_MASK(64));
+   if (rc)
+   rc = dma_set_mask_and_coherent(_dev->dev,
+   DMA_BIT_MASK(32));
+   if (rc)
+   dev_warn(_dev->dev, "Failed to enable 64-bit or 32-bit DMA. 
 Trying to continue, but this might not work.\n");
+
rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy");
if (rc)
return rc;
diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index 8e5cf194cc0b..fbe0bd1c4881 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -293,14 +293,16 @@ static size_t vring_pci_size(u16 num)
return PAGE_ALIGN(vring_size(num, SMP_CACHE_BYTES));
 }
 
-static void *alloc_virtqueue_pages(int *num)
+static void *alloc_virtqueue_pages(struct virtio_pci_device *vp_dev,
+  int *num, dma_addr_t *dma_addr)
 {
void *pages;
 
/* TODO: allocate each queue chunk individually */
for (; *num && vring_pci_size(*num) > PAGE_SIZE; *num /= 2) {
-   pages = alloc_pages_exact(vring_pci_size(*num),
- GFP_KERNEL|__GFP_ZERO|__GFP_NOWARN);
+   pages = dma_zalloc_coherent(
+   _dev->pci_dev->dev, vring_pci_size(*num),
+   dma_addr, GFP_KERNEL|__GFP_NOWARN);
if (pages)
return pages;
}
@@ -309,7 +311,9 @@ static void *alloc_virtqueue_pages(int *num)
return NULL;
 
/* Try to get a single page.

Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Andy Lutomirski
On Thu, Oct 1, 2015 at 3:33 PM, Dave Hansen  wrote:
> On 10/01/2015 01:39 PM, Kees Cook wrote:
>> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar  wrote:
>>> So could we try to add an (opt-in) kernel option that enables this 
>>> transparently
>>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>>> user-space changes and syscalls necessary?
>>
>> I would like this very much. :)
>
> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
> if I boot with this, though.

Somebody really ought to rework things so that a crash in init prints
out a normal indication of the unhandled signal and optionally leaves
everything else running.

Also...

EPT seems to have separate R, W, and X flags.  I wonder if it would
make sense to add a KVM paravirt feature that maps the entire guest
physical space an extra time at a monstrous offset with R cleared in
the EPT and passes through a #PF or other notification (KVM-specific
thing? #VE?) on a read fault.

This wouldn't even need a whole duplicate paging hierarchy -- it would
just duplicate the EPT PML4 entries, so it would add exactly zero
runtime memory usage.

The guest would use it by treating the high bit of the physical
address as a "may read" bit.

This reminds me -- we should probably wire up X86_TRAP_VE with a stub
that OOPSes until someone figures out some more useful thing to do.
We're probably not doing anyone any favors by unconditionally
promoting them to double-faults.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops

2015-09-30 Thread Andy Lutomirski
On Wed, Sep 30, 2015 at 7:01 AM, Ingo Molnar <mi...@kernel.org> wrote:
>
> * Peter Zijlstra <pet...@infradead.org> wrote:
>
>> On Mon, Sep 21, 2015 at 09:36:15AM -0700, Linus Torvalds wrote:
>> > On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar <mi...@kernel.org> wrote:
>> > >
>> > > Linus, what's your preference?
>> >
>> > So quite frankly, is there any reason we don't just implement
>> > native_read_msr() as just
>> >
>> >unsigned long long native_read_msr(unsigned int msr)
>> >{
>> >   int err;
>> >   unsigned long long val;
>> >
>> >   val = native_read_msr_safe(msr, );
>> >   WARN_ON_ONCE(err);
>> >   return val;
>> >}
>> >
>> > Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be
>> > done with it. I don't see the downside.
>> >
>> > How many msr reads are so critical that the function call
>> > overhead would matter? Get rid of the inline version of the _safe()
>> > thing too, and put that thing there too.
>>
>> There are a few in the perf code, and esp. on cores without a stack engine 
>> the
>> call overhead is noticeable. Also note that the perf MSRs are generally
>> optimized MSRs and less slow (we cannot say fast, they're still MSRs) than
>> regular MSRs.
>
> These could still be open coded in an inlined fashion, like the scheduler 
> usage.
>

We could have a raw_rdmsr for those.

OTOH, I'm still not 100% convinced that this warn-but-don't-die
behavior is worth the effort.  This isn't a frequent source of bugs to
my knowledge, and we don't try to recover from incorrect cr writes,
out-of-bounds MMIO, etc, so do we really gain much by rigging a
recovery mechanism for rdmsr and wrmsr failures for code that doesn't
use the _safe variants?

--Andy

> Thanks,
>
> Ingo



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread Andy Lutomirski
On Tue, Sep 29, 2015 at 10:50 AM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
> On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski <l...@amacapital.net> wrote:
>>
>> Does anyone know what happens if you stick a non-accessed segment in
>> the GDT, map the GDT RO, and access it?
>
> You should get a #PF, as you guess, but go ahead and test it if you
> want to make sure.
>

Then I think that, if we do this, the patch series should first make
it percpu and fixmapped but RW and then flip it RO as a separate patch
in case we need to revert the actual RO bit.  I don't want to break
Wine or The Witcher 2 because of this, and we might need various
fixups.  I really hope that no one uses get_thread_area to check
whether TLS has been accessed.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread Andy Lutomirski
On Sep 29, 2015 2:01 AM, "Ingo Molnar"  wrote:
>
>
> * Denys Vlasenko  wrote:
>
> > On 09/28/2015 09:58 AM, Ingo Molnar wrote:
> > >
> > > * Denys Vlasenko  wrote:
> > >
> > >> On 09/26/2015 09:50 PM, H. Peter Anvin wrote:
> > >>> NAK.  We really should map the GDT read-only on all 64 bit systems,
> > >>> since we can't hide the address from SLDT.  Same with the IDT.
> > >>
> > >> Sorry, I don't understand your point.
> > >
> > > So the problem is that right now the SGDT instruction (which is 
> > > unprivileged)
> > > leaks the real address of the kernel image:
> > >
> > >  fomalhaut:~> ./sgdt
> > >  SGDT: 88303fd89000 / 007f
> > >
> > > that '88303fd89000' is a kernel address.
> >
> > Thank you.
> > I do know that SGDT and friends are unprivileged on x86
> > and thus they allow userspace (and guest kernels in paravirt)
> > learn things they don't need to know.
> >
> > I don't see how making GDT page-aligned and page-sized
> > changes anything in this regard. SGDT will still work,
> > and still leak GDT address.
>
> Well, as I try to explain it in the other part of my mail, doing so enables 
> us to
> remap the GDT to a less security sensitive virtual address that does not leak 
> the
> kernel's randomized address:
>
> > > Your observation in the changelog and your patch:
> > >
> >  It is page-sized because of paravirt. [...]
> > >
> > > ... conflicts with the intention to mark (remap) the primary GDT address 
> > > read-only
> > > on native kernels as well.
> > >
> > > So what we should do instead is to use the page alignment properly and 
> > > remap the
> > > GDT to a read-only location, and load that one.
> >
> > If we'd have a small GDT (i.e. what my patch does), we still can remap the
> > entire page which contains small GDT, and simply don't care that some other 
> > data
> > is also visible through that RO page.
>
> That's generally considered fragile: suppose an attacker has a limited 
> information
> leak that can read absolute addresses with system privilege but he doesn't 
> know
> the kernel's randomized base offset. With a 'partial page' mapping there 
> could be
> function pointers near the GDT, part of the page the GDT happens to be on, 
> that
> leak this information.
>
> (Same goes for crypto keys or other critical information (like canary 
> information,
> salts, etc.) accidentally ending up nearby.)
>
> Arguably it's a bit tenuous, but when playing remapping games it's generally
> considered good to be page aligned and page sized, with zero padding.
>
> > > This would have a couple of advantages:
> > >
> > >  - This would give kernel address randomization more teeth on x86.
> > >
> > >  - An additional advantage would be that rootkits overwriting the GDT 
> > > would have
> > >a bit more work to do.
> > >
> > >  - A third advantage would be that for NUMA systems we could 'mirror' the 
> > > GDT into
> > >node-local memory and load those. This makes GDT load cache-misses a 
> > > bit less
> > >expensive.
> >
> > GDT is per-cpu. Isn't per-cpu memory already NUMA-local?
>
> Indeed it is:
>
> fomalhaut:~> for ((cpu=1; cpu<9; cpu++)); do taskset $cpu ./sgdt ; done
> SGDT: 88103fa09000 / 007f
> SGDT: 88103fa29000 / 007f
> SGDT: 88103fa29000 / 007f
> SGDT: 88103fa49000 / 007f
> SGDT: 88103fa49000 / 007f
> SGDT: 88103fa49000 / 007f
> SGDT: 88103fa29000 / 007f
> SGDT: 88103fa69000 / 007f
>
> I confused it with the IDT, which is still global.
>
> This also means that the GDT in itself does not leak kernel addresses at the
> moment, except it leaks the layout of the percpu area.
>
> So my suggestion would be to:
>
>  - make the GDT unconditionally page aligned and sized, then remap it to a
>read-only address unconditionally as well, like we do it for the IDT.

Does anyone know what happens if you stick a non-accessed segment in
the GDT, map the GDT RO, and access it?  The docs are extremely vague
on the interplay between segmentation and paging on the segmentation
structures themselves.  My guess is that it causes #PF.  This might
break set_thread_area users unless we change set_thread_area to force
the accessed bit on.

There's a possible worse failure mode: if someone pokes an un-accessed
segment into SS or CS using sigreturn, then it's within the realm of
possibility that IRET would generate #PF (hey Intel and AMD, please
document this!).  I don't think that would be rootable, but at the
very least we'd want to make sure it doesn't OOPS by either making it
impossible or adding an explicit test to sigreturn.c.

hpa pointed out in another thread that the GDT *must* be writable on
32-bit kernels because we use a task gate for NMI and jumping through
a task gate writes to the GDT.

On another note, SGDT is considerably faster than LSL, at least on
Sandy Bridge.  The vdso might be able to take advantage of that for
getcpu.

--Andy
--
To unsubscribe from this list: send the line 

Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread Andy Lutomirski
On Tue, Sep 29, 2015 at 11:18 AM, H. Peter Anvin  wrote:
> SGDT would be easy to use, and it is logical that it is faster since it reads 
> an internal register.  SIDT does too but unlike the GDT has a secondary limit 
> (it can never be larger than 4096 bytes) and so all limits in the range 
> 4095-65535 are exactly equivalent.
>

Using the IDT limit would have been a great ideal if Intel hadn't
decided to clobber it on every VM exit.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rdmsr_safe in Linux PV (under Xen) gets an #GP:Re: [Fedora-xen] Running fedora xen on top of KVM?

2015-09-22 Thread Andy Lutomirski
On Tue, Sep 22, 2015 at 11:23 AM, Konrad Rzeszutek Wilk
<konrad.w...@oracle.com> wrote:
> On Sun, Sep 20, 2015 at 09:49:04PM -0700, Andy Lutomirski wrote:
>> On Fri, Sep 18, 2015 at 12:04 PM, Borislav Petkov <b...@alien8.de> wrote:
>> > On Fri, Sep 18, 2015 at 08:20:46AM -0700, Andy Lutomirski wrote:
>> >> In any event, Borislav, you must have typed rdmsr_safe for a reason :)
>> >
>> > Wasn't me:
>> >
>> > 6c62aa4a3c12 ("x86: make amd.c have 64bit support code")
>> >
>> > I think the error handling of rdmsrl_safe() was needed to do the pfn
>> > games which are done in the if-clause.
>>
>> I just tried it.  rdmsrl_safe and friends definitely work fine in that
>> code.  I think that Linux's Xen startup code is buggy and fails to set
>> up early exception handling.
>>
>> Try this (horribly whitespace damaged):
>>
>>  static void __init early_identify_cpu(struct cpuinfo_x86 *c)
>>  {
>> +   u64 tmp;
>>  #ifdef CONFIG_X86_64
>> c->x86_clflush_size = 64;
>> c->x86_phys_bits = 36;
>> @@ -752,6 +753,9 @@ static void __init early_identify_cpu(struct cpuinfo_x86 
>> *c)
>> c->cpu_index = 0;
>> filter_cpuid_features(c, false);
>>
>> +   pr_err("trying to crash\n");
>> +   rdmsrl_safe(0x12345678, );
>> +
>>
>> It works fine.  I bet it crashes on a Xen guest, though.  I assume
>> that Xen just works in most cases by luck.
>
> (d31) mapping kernel into physical memory
> (d31) about to get started...
> (XEN) traps.c:3151: GPF (): 82d0801a31ed -> 82d08023c77b
> (XEN) traps.c:459:d31v0 Unhandled general protection fault fault/trap [#13] 
> on VCPU 0 [ec=]
> (XEN) domain_crash_sync called from entry.S: fault at 82d080238213 
> create_bounce_frame+0x12b/0x13a
> (XEN) Domain 31 (vcpu#0) crashed on cpu#35:
> (XEN) [ Xen-4.5.0  x86_64  debug=n  Not tainted ]
> (XEN) CPU:35
> (XEN) RIP:e033:[]
> (XEN) RFLAGS: 0246   EM: 1   CONTEXT: pv guest
> (XEN) rax:    rbx: 81c03e64   rcx: 12345678
> (XEN) rdx: 81c03de8   rsi: 81c03dec   rdi: 12345278
> (XEN) rbp: 81c03e48   rsp: 81c03dd0   r8:  7420676e69797274
> (XEN) r9:  6873617263206f74   r10:    r11: 
> (XEN) r12: 12345678   r13: 81c03f00   r14: 
> (XEN) r15:    cr0: 8005003b   cr4: 001526f0
> (XEN) cr3: 0014e8c97000   cr2: 
> (XEN) ds:    es:    fs:    gs:    ss: e02b   cs: e033
> (XEN) Guest stack trace from rsp=81c03dd0:
> (XEN)12345678   81041b64
> (XEN)0001e030 00010046 81c03e18 e02b
> (XEN)81041b5d 81c03e48 00811809 
> (XEN)01a0 0100 82009000 81c03e68
> (XEN)81d211ea   81c03ed8
> (XEN)81d1be59 81c03ed8 811892ab 0010
> (XEN)81c03ee8 81c03ea8 697a696c61697469 81f15442
> (XEN) 81db3900  
> (XEN) 81c03f28 81d10f0a 
> (XEN)   
> (XEN)   81c03f38
> (XEN)81d10603 81c03ff8 81d15f5c 
> (XEN)   
> (XEN)   
> (XEN)   
> (XEN)   
> (XEN) ffd83a031f898b75 22400800 0001
> (XEN)  00010102464c457f 
> (XEN)0001003e0003 0940 0040 12a0
> (XEN)00380040 001100120044 00050001 
> [root@ovs107 ~]#
>
> (gdb) x/20i 0x81041b64
>0x81041b64:  rdmsr
>

Exactly.  I think that Xen is missing code to wire up early_fixup_exception.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops

2015-09-21 Thread Andy Lutomirski
On Mon, Sep 21, 2015 at 9:49 AM, Arjan van de Ven  wrote:
> On 9/21/2015 9:36 AM, Linus Torvalds wrote:
>>
>> On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar  wrote:
>>>
>>>
>>> Linus, what's your preference?
>>
>>
>> So quite frankly, is there any reason we don't just implement
>> native_read_msr() as just
>>
>> unsigned long long native_read_msr(unsigned int msr)
>> {
>>int err;
>>unsigned long long val;
>>
>>val = native_read_msr_safe(msr, );
>>WARN_ON_ONCE(err);
>>return val;
>> }
>>
>> Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be
>> done with it. I don't see the downside.
>>
>> How many msr reads are so critical that the function call
>> overhead would matter?
>
>
> if anything qualifies it'd be switch_to() and friends.

And maybe the KVM user return notifier.  Unfortunately, switch_to
might gain another two MSR accesses at some point if we decide to fix
the bugs in there.  Sigh.

>
> note that I'm not entirely happy about the notion of "safe" MSRs.
> They're safe as in "won't fault".
> Reading random MSRs isn't a generic safe operation though, but the name sort
> of gives people
> the impression that it is. Even with _safe variants, you still need to KNOW
> the MSR exists (by means
> of CPUID or similar) unfortunately.
>

I tend to agree.

Anyway, the fully out-of-line approach isn't obviously a bad idea, and
it simplifies the whole mess (we can drop most of the paravirt
patches, too).  I'll give it a try and see what happens.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops

2015-09-21 Thread Andy Lutomirski
On Mon, Sep 21, 2015 at 9:36 AM, Linus Torvalds
 wrote:
> On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar  wrote:
>>
>> Linus, what's your preference?
>
> So quite frankly, is there any reason we don't just implement
> native_read_msr() as just
>
>unsigned long long native_read_msr(unsigned int msr)
>{
>   int err;
>   unsigned long long val;
>
>   val = native_read_msr_safe(msr, );
>   WARN_ON_ONCE(err);
>   return val;
>}
>
> Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be
> done with it. I don't see the downside.

In the interest of sanity, I want to drop the "native_", too, since
there appear to be few or no good use cases for native_read_msr as
such.  I'm tempted to add new functions read_msr and write_msr that
forward to rdmsrl_safe and wrmsrl_safe.

It looks like the msr helpers are every bit as bad as the TSC helpers
used to be :(

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops

2015-09-20 Thread Andy Lutomirski
This demotes an OOPS and likely panic due to a failed non-"safe" MSR
access to a WARN_ON_ONCE and a return of zero (in the RDMSR case).
We still write a pr_info entry unconditionally for debugging.

To be clear, this type of failure should *not* happen.  This patch
exists to minimize the chance of nasty undebuggable failures due on
systems that used to work due to a now-fixed CONFIG_PARAVIRT=y bug.

Signed-off-by: Andy Lutomirski <l...@kernel.org>
---
 arch/x86/kernel/traps.c | 55 +
 1 file changed, 55 insertions(+)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 346eec73f7db..f82987643e32 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -437,6 +437,58 @@ exit_trap:
do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);
 }
 
+static bool paper_over_kernel_gpf(struct pt_regs *regs)
+{
+   /*
+* Try to decode the opcode that failed.  So far, we only care
+* about boring two-byte unprefixed opcodes, so we don't need
+* the full instruction decoder machinery.
+*/
+   u16 opcode;
+
+   if (probe_kernel_read(, (const void *)regs->ip, sizeof(opcode)))
+   return false;
+
+   if (opcode == 0x320f) {
+   /* RDMSR */
+   pr_info("bad kernel RDMSR from non-existent MSR 0x%x",
+   (unsigned int)regs->cx);
+   if (!panic_on_oops) {
+   WARN_ON_ONCE(true);
+
+   /*
+* Pretend that RDMSR worked and returned zero.  We
+* chose zero because zero seems less likely to
+* cause further malfunctions than any other value.
+*/
+   regs->ax = 0;
+   regs->dx = 0;
+   regs->ip += 2;
+   return true;
+   } else {
+   /* Don't fix it up. */
+   return false;
+   }
+   } else if (opcode == 0x300f) {
+   /* WRMSR */
+   pr_info("bad kernel WRMSR writing 0x%08x%08x to MSR 0x%x",
+   (unsigned int)regs->dx, (unsigned int)regs->ax,
+   (unsigned int)regs->cx);
+   if (!panic_on_oops) {
+   WARN_ON_ONCE(true);
+
+   /* Pretend it worked and carry on. */
+   regs->ip += 2;
+   return true;
+   } else {
+   /* Don't fix it up. */
+   return false;
+   }
+   }
+
+   return false;
+}
+
 dotraplinkage void
 do_general_protection(struct pt_regs *regs, long error_code)
 {
@@ -456,6 +508,9 @@ do_general_protection(struct pt_regs *regs, long error_code)
if (fixup_exception(regs))
return;
 
+   if (paper_over_kernel_gpf(regs))
+   return;
+
tsk->thread.error_code = error_code;
tsk->thread.trap_nr = X86_TRAP_GP;
if (notify_die(DIE_GPF, "general protection fault", regs, 
error_code,
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] x86/msr: Set the return value to zero when native_rdmsr_safe fails

2015-09-20 Thread Andy Lutomirski
This will cause unchecked native_rdmsr_safe failures to return
deterministic results.

Signed-off-by: Andy Lutomirski <l...@kernel.org>
---
 arch/x86/include/asm/msr.h | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 77d8b284e4a7..9eda52205d42 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -73,7 +73,10 @@ static inline unsigned long long 
native_read_msr_safe(unsigned int msr,
asm volatile("2: rdmsr ; xor %[err],%[err]\n"
 "1:\n\t"
 ".section .fixup,\"ax\"\n\t"
-"3:  mov %[fault],%[err] ; jmp 1b\n\t"
+"3: mov %[fault],%[err]\n\t"
+"xorl %%eax, %%eax\n\t"
+"xorl %%edx, %%edx\n\t"
+"jmp 1b\n\t"
 ".previous\n\t"
 _ASM_EXTABLE(2b, 3b)
 : [err] "=r" (*err), EAX_EDX_RET(val, low, high)
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/2] x86/msr: MSR access failure changes

2015-09-20 Thread Andy Lutomirski
This applies on top of my earlier paravirt MSR series.

Changes from v1:
 - Return zero instead of poison on bad RDMSRs.

Andy Lutomirski (2):
  x86/msr: Carry on after a non-"safe" MSR access fails without
!panic_on_oops
  x86/msr: Set the return value to zero when native_rdmsr_safe fails

 arch/x86/include/asm/msr.h |  5 -
 arch/x86/kernel/traps.c| 55 ++
 2 files changed, 59 insertions(+), 1 deletion(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops

2015-09-20 Thread Andy Lutomirski
On Sep 20, 2015 5:15 PM, "Linus Torvalds" <torva...@linux-foundation.org> wrote:
>
> On Sun, Sep 20, 2015 at 5:02 PM, Andy Lutomirski <l...@kernel.org> wrote:
> > This demotes an OOPS and likely panic due to a failed non-"safe" MSR
> > access to a WARN_ON_ONCE and a return of zero (in the RDMSR case).
> > We still write a pr_info entry unconditionally for debugging.
>
> No, this is wrong.
>
> If you really want to do something like this, then just make all MSR
> reads safe. So the only difference between "safe" and "unsafe" is that
> the unsafe version just doesn't check the return value, and silently
> just returns zero for reads (or writes nothing).
>
> To quote Obi-Wan: "Use the exception table, Luke".
>
> Because decoding instructions is just too ugly. We'll do it for CPU
> errata where we might have to do it for user space code too (ie the
> AMD prefetch mess), but for code that _we_ control? Hell no.
>
> So NAK on this.

My personal preference is to just not do this at all.  A couple people
disagree.  If we make the unsafe variants not oops, then I think we
want to have the nice loud warning, since these issues are bugs if
they happen.

We could certainly use the exception table for this, but it'll result
in bigger core, since each MSR access will need an exception table
entry and an associated fixup to call some helper that warns and sets
the result to zero.

I'd be happy to implement that, but only if it'll be applied.
Otherwise I'd rather just drop this patch and keep the rest of the
series.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rdmsr_safe in Linux PV (under Xen) gets an #GP:Re: [Fedora-xen] Running fedora xen on top of KVM?

2015-09-20 Thread Andy Lutomirski
On Fri, Sep 18, 2015 at 12:04 PM, Borislav Petkov <b...@alien8.de> wrote:
> On Fri, Sep 18, 2015 at 08:20:46AM -0700, Andy Lutomirski wrote:
>> In any event, Borislav, you must have typed rdmsr_safe for a reason :)
>
> Wasn't me:
>
> 6c62aa4a3c12 ("x86: make amd.c have 64bit support code")
>
> I think the error handling of rdmsrl_safe() was needed to do the pfn
> games which are done in the if-clause.

I just tried it.  rdmsrl_safe and friends definitely work fine in that
code.  I think that Linux's Xen startup code is buggy and fails to set
up early exception handling.

Try this (horribly whitespace damaged):

 static void __init early_identify_cpu(struct cpuinfo_x86 *c)
 {
+   u64 tmp;
 #ifdef CONFIG_X86_64
c->x86_clflush_size = 64;
c->x86_phys_bits = 36;
@@ -752,6 +753,9 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
c->cpu_index = 0;
filter_cpuid_features(c, false);

+   pr_err("trying to crash\n");
+   rdmsrl_safe(0x12345678, );
+

It works fine.  I bet it crashes on a Xen guest, though.  I assume
that Xen just works in most cases by luck.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rdmsr_safe in Linux PV (under Xen) gets an #GP:Re: [Fedora-xen] Running fedora xen on top of KVM?

2015-09-18 Thread Andy Lutomirski
On Fri, Sep 18, 2015 at 6:54 AM, Borislav Petkov <b...@alien8.de> wrote:
> On Thu, Sep 17, 2015 at 01:23:31PM -0700, Andy Lutomirski wrote:
>> Cc: Borislav.  Is TSEG guaranteed to exist?  Can we defer that until
>
> Why not? It is the tseg base address.
>
> I think this is kvm injecting a #GP as it is not ignoring this MSR.
> Presumably modprobing kvm with "ignore_msrs=1" or so should hide the
> issue IIUC...

But the issue should be fixed, no?

Does anyone know why this affects Xen on KVM but not native Linux on
KVM?  Or is there something weird about the particular KVM
configuration?

In any event, Borislav, you must have typed rdmsr_safe for a reason :)
 Either rdmsr_safe in that context is broken on Xen and not native
(which is entirely possible -- I didn't check carefully) or the
rdmsr_safe isn't "safe" on any type of kernel.  Persumably either the
kernel or KVM (or both) should be fixed.

Given that we can handle fixups in the decompressor, surely it
wouldn't be so hard to make early GPF fixups work in the main kernel.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   >