[tip:x86/urgent] x86/hyper-v: Fix definition of HV_MAX_FLUSH_REP_COUNT

2019-02-28 Thread tip-bot for Lan Tianyu
Commit-ID:  9cd05ad2910b55238e3c720c99ad896dc538301b
Gitweb: https://git.kernel.org/tip/9cd05ad2910b55238e3c720c99ad896dc538301b
Author: Lan Tianyu 
AuthorDate: Mon, 25 Feb 2019 22:31:14 +0800
Committer:  Thomas Gleixner 
CommitDate: Thu, 28 Feb 2019 11:58:29 +0100

x86/hyper-v: Fix definition of HV_MAX_FLUSH_REP_COUNT

The max flush rep count of HvFlushGuestPhysicalAddressList hypercall is
equal with how many entries of union hv_gpa_page_range can be populated
into the input parameter page.

The code lacks parenthesis around PAGE_SIZE - 2 * sizeof(u64) which results
in bogus computations. Add them.

Fixes: cc4edae4b924 ("x86/hyper-v: Add HvFlushGuestAddressList hypercall 
support")
Signed-off-by: Lan Tianyu 
Signed-off-by: Thomas Gleixner 
Cc: k...@microsoft.com
Cc: haiya...@microsoft.com
Cc: sthem...@microsoft.com
Cc: sas...@kernel.org
Cc: b...@alien8.de
Cc: h...@zytor.com
Cc: gre...@linuxfoundation.org
Cc: de...@linuxdriverproject.org
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20190225143114.5149-1-tianyu@microsoft.com
---
 arch/x86/include/asm/hyperv-tlfs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
b/arch/x86/include/asm/hyperv-tlfs.h
index 705dafc2d11a..2bdbbbcfa393 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -841,7 +841,7 @@ union hv_gpa_page_range {
  * count is equal with how many entries of union hv_gpa_page_range can
  * be populated into the input parameter page.
  */
-#define HV_MAX_FLUSH_REP_COUNT (PAGE_SIZE - 2 * sizeof(u64) /  \
+#define HV_MAX_FLUSH_REP_COUNT ((PAGE_SIZE - 2 * sizeof(u64)) /\
sizeof(union hv_gpa_page_range))
 
 struct hv_guest_mapping_flush_list {


Re: [Resend Patch] KVM/x86: Fix wrong macro references of X86_CR0_PG_BIT and X86_CR4_PAE_BIT in kvm_valid_sregs()

2018-01-18 Thread Lan Tianyu
Hi Eric:
Great thanks for your review.
On Thu, Jan 18, 2018 at 10:39:04AM -0800, Eric Biggers wrote:
> On Tue, Jan 16, 2018 at 05:34:07PM +0800, Tianyu Lan wrote:
> > kvm_valid_sregs() should use X86_CR0_PG and X86_CR4_PAE to check bit
> > status rather than X86_CR0_PG_BIT and X86_CR4_PAE_BIT. This patch is
> > to fix it.
> > 
> > Fixes: f29810335965a(KVM/x86: Check input paging mode when cs.l is set)
> > Reported-by: Jeremi Piotrowski 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Signed-off-by: Tianyu Lan 
> > ---
> > Sorry for noise. Missed kvm maillist.
> > 
> >  arch/x86/kvm/x86.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 1cec2c6..c53298d 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -7496,13 +7496,13 @@ EXPORT_SYMBOL_GPL(kvm_task_switch);
> >  
> >  int kvm_valid_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
> >  {
> > -   if ((sregs->efer & EFER_LME) && (sregs->cr0 & X86_CR0_PG_BIT)) {
> > +   if ((sregs->efer & EFER_LME) && (sregs->cr0 & X86_CR0_PG)) {
> > /*
> >  * When EFER.LME and CR0.PG are set, the processor is in
> >  * 64-bit mode (though maybe in a 32-bit code segment).
> >  * CR4.PAE and EFER.LMA must be set.
> >  */
> > -   if (!(sregs->cr4 & X86_CR4_PAE_BIT)
> > +   if (!(sregs->cr4 & X86_CR4_PAE)
> > || !(sregs->efer & EFER_LMA))
> > return -EINVAL;
> > } else {
> > -- 
> > 2.7.4
> > 
> 
> I came across this too and was just about to send the exact same patch.  It
> looks good to me as long as the bits it's supposed to be checking were correct
> in the first place.  Patch title could maybe be shortened a bit, e.g. 
> "KVM/x86:
> Fix references to CR0.PG and CR4.PAE in kvm_valid_sregs()".  The "Fixes:" line
> is also formatted incorrectly.

That will be better and will update.

> 
> Thanks,
> 
> Eric


[PATCH V2] KVM/Eventfd: Avoid crash when assign and deassign specific eventfd in parallel.

2017-12-22 Thread Lan Tianyu
Syzroot reports crash in kvm_irqfd_assign() is caused by use-after-free.
Because kvm_irqfd_assign() and kvm_irqfd_deassign() can't run in parallel
for one specific eventfd. When assign path hasn't been finished after irqfd
has been added to kvm->irqfds.items list, another thead may deassign the
eventfd and free struct kvm_kernel_irqfd(). This causes assign path still
uses struct kvm_kernel_irqfd freed by deassign path. To avoid such issue,
make irqfd's reference under kvm->irq_srcu protection after irqfd added to
kvm->irqfds.iterms list and call synchronize_srcu() in irq_shutdown() to
make sure that irqfd has been fully initialized in the assign path.

Reported-by: Dmitry Vyukov 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Dmitry Vyukov 
Signed-off-by: Tianyu Lan 
---
 virt/kvm/eventfd.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index a334399fafec..0abc2e9ddafb 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -119,8 +119,12 @@ irqfd_shutdown(struct work_struct *work)
 {
struct kvm_kernel_irqfd *irqfd =
container_of(work, struct kvm_kernel_irqfd, shutdown);
+   struct kvm *kvm = irqfd->kvm;
u64 cnt;
 
+   /* Make sure irqfd has been initalized in assign path. */
+   synchronize_srcu(&kvm->irq_srcu);
+
/*
 * Synchronize with the wait-queue and unhook ourselves to prevent
 * further events.
@@ -387,7 +391,6 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 
idx = srcu_read_lock(&kvm->irq_srcu);
irqfd_update(kvm, irqfd);
-   srcu_read_unlock(&kvm->irq_srcu, idx);
 
list_add_tail(&irqfd->list, &kvm->irqfds.items);
 
@@ -421,6 +424,7 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
}
 #endif
 
+   srcu_read_unlock(&kvm->irq_srcu, idx);
return 0;
 
 fail:
-- 
2.15.1.433.g936d1b9.dirty



Re: [PATCH linux-next] KVM: x86: don't forget vcpu_put() in kvm_arch_vcpu_ioctl_set_sregs()

2017-12-20 Thread Lan Tianyu
On 2017年12月21日 08:30, Paolo Bonzini wrote:
> On 21/12/2017 01:24, Eric Biggers wrote:
>> From: Eric Biggers 
>>
>> Due to a bad merge resolution between commit f29810335965 ("KVM/x86:
>> Check input paging mode when cs.l is set") and commit b4ef9d4e8cb8
>> ("KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_sregs"),
>> there is a case in kvm_arch_vcpu_ioctl_set_sregs() where vcpu_put() is
>> not called after vcpu_get().  Fix it.
>>
>> Reported-by: syzbot 
>> Signed-off-by: Eric Biggers 
>> ---
>>  arch/x86/kvm/x86.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index ea3a98196753..f4e8b5089b28 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -7624,7 +7624,7 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu 
>> *vcpu,
>>  goto out;
>>  
>>  if (kvm_valid_sregs(vcpu, sregs))
>> -return -EINVAL;
>> +goto out;
>>  
>>  apic_base_msr.data = sregs->apic_base;
>>  apic_base_msr.host_initiated = true;
>>
> 
> Thanks very much Eric, that was fast!  Adding Stephen and the linux-next
> mailing list to Cc.  Adding the kvm/master tree has already paid off.
> 
> Paolo
> 

Hi Paolo:
Should we check input sregs before loading vcpu? If input sregs is
invalid, the operation is redundant.
-- 
Best regards
Tianyu Lan


Re: [PATCH] KVM/Eventfd: Avoid crash when assign and deassign same eventfd in parallel.

2017-12-20 Thread Lan Tianyu
On 2017年12月19日 18:41, Paolo Bonzini wrote:
> On 19/12/2017 07:35, Lan Tianyu wrote:
>> On 2017年12月18日 16:50, Paolo Bonzini wrote:
>>> On 18/12/2017 09:30, David Hildenbrand wrote:
>>>> The ugly thing in kvm_irqfd_assign() is that we access irqfd without
>>>> holding a lock. I think that should rather be fixed than working around
>>>> that issue. (e.g. lock() -> lookup again -> verify still in list ->
>>>> unlock())
>>>
>>> I wonder if it's even simpler:
>>>
>>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>>> index f2ac53ab8243..17ed298bd66f 100644
>>> --- a/virt/kvm/eventfd.c
>>> +++ b/virt/kvm/eventfd.c
>>> @@ -387,7 +387,6 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd 
>>> *args)
>>>  
>>> idx = srcu_read_lock(&kvm->irq_srcu);
>>> irqfd_update(kvm, irqfd);
>>> -   srcu_read_unlock(&kvm->irq_srcu, idx);
>>>  
>>> list_add_tail(&irqfd->list, &kvm->irqfds.items);
>>>  
>>> @@ -420,10 +419,12 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd 
>>> *args)
>>> irqfd->consumer.token, ret);
>>> }
>>>  #endif
>>> +   srcu_read_unlock(&kvm->irq_srcu, idx);
>>>  
>>> return 0;
>>>  
>>>  fail:
>>> +   /* irq_srcu is *not* held here.  */
>>> if (irqfd->resampler)
>>> irqfd_resampler_shutdown(irqfd);
>>>  
>>>
>>
>> Hi Paolo:
>>  This patch still can't prevent from freeing struct irqfd in
>> irq_shutdown() during assign irqfd. Once irqfd is added to
>> kvm->irqfds.items list, deassign path can get irqfd and free it.
> 
> You're right, that also has to be protected by SRCU.  So a new bool is
> needed (probably "active" more than "initialized") in order to replace
> list_del_init with list_del_rcu.
> 

Adding new flag is sufficient to fix the crash issue. All list
operations for kvm->irqfds.items list are already under protection
kvm->irqfds.lock. Do you mean we should use list_add/del_rcu for
kvm->irqfds.items?

-- 
Best regards
Tianyu Lan


Re: [PATCH] KVM/Eventfd: Avoid crash when assign and deassign same eventfd in parallel.

2017-12-18 Thread Lan Tianyu
On 2017年12月18日 16:50, Paolo Bonzini wrote:
> On 18/12/2017 09:30, David Hildenbrand wrote:
>> The ugly thing in kvm_irqfd_assign() is that we access irqfd without
>> holding a lock. I think that should rather be fixed than working around
>> that issue. (e.g. lock() -> lookup again -> verify still in list ->
>> unlock())
> 
> I wonder if it's even simpler:
> 
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index f2ac53ab8243..17ed298bd66f 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -387,7 +387,6 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
>  
>   idx = srcu_read_lock(&kvm->irq_srcu);
>   irqfd_update(kvm, irqfd);
> - srcu_read_unlock(&kvm->irq_srcu, idx);
>  
>   list_add_tail(&irqfd->list, &kvm->irqfds.items);
>  
> @@ -420,10 +419,12 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd 
> *args)
>   irqfd->consumer.token, ret);
>   }
>  #endif
> + srcu_read_unlock(&kvm->irq_srcu, idx);
>  
>   return 0;
>  
>  fail:
> + /* irq_srcu is *not* held here.  */
>   if (irqfd->resampler)
>   irqfd_resampler_shutdown(irqfd);
>  
> 

Hi Paolo:
This patch still can't prevent from freeing struct irqfd in
irq_shutdown() during assign irqfd. Once irqfd is added to
kvm->irqfds.items list, deassign path can get irqfd and free it.
-- 
Best regards
Tianyu Lan


Re: [PATCH] KVM/Eventfd: Avoid crash when assign and deassign same eventfd in parallel.

2017-12-18 Thread Lan Tianyu
Hi David:
Thanks for your review.

On 2017年12月18日 16:30, David Hildenbrand wrote:
> On 18.12.2017 00:40, Lan Tianyu wrote:
>> Syzroot reports crash in kvm_irqfd_assign() is caused by use-after-free.
>> Because kvm_irqfd_assign() and kvm_irqfd_deassign() can't run in parallel
>> for same eventfd. When assign path hasn't been finished after irqfd
>> has been added to kvm->irqfds.items list, another thead may deassign the
>> eventfd and free struct kvm_kernel_irqfd(). This causes assign path still
>> uses struct kvm_kernel_irqfd freed by deassign path. To avoid such issue,
>> add "initialized" flag in the struct kvm_kernel_irqfd and check the flag 
>> before
>> deactivating irqfd. If irqfd is still in initialization, deassign path
>> return fault.>
>> Reported-by: Dmitry Vyukov 
>> Cc: Paolo Bonzini 
>> Cc: Radim Krčmář 
>> Cc: Dmitry Vyukov 
>> Cc: Wanpeng Li 
>> Signed-off-by: Tianyu Lan 
>> ---
>>  include/linux/kvm_irqfd.h |  1 +
>>  virt/kvm/eventfd.c| 11 +--
>>  2 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/kvm_irqfd.h b/include/linux/kvm_irqfd.h
>> index 76c2fbc..be6b254 100644
>> --- a/include/linux/kvm_irqfd.h
>> +++ b/include/linux/kvm_irqfd.h
>> @@ -66,6 +66,7 @@ struct kvm_kernel_irqfd {
>>  struct work_struct shutdown;
>>  struct irq_bypass_consumer consumer;
>>  struct irq_bypass_producer *producer;
>> +u8 initialized:1;
>>  };
>>  
>>  #endif /* __LINUX_KVM_IRQFD_H */
>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>> index a334399..80f06e6 100644
>> --- a/virt/kvm/eventfd.c
>> +++ b/virt/kvm/eventfd.c
>> @@ -421,6 +421,7 @@ int  __attribute__((weak)) kvm_arch_update_irqfd_routing(
>>  }
>>  #endif
>>  
>> +irqfd->initialized = 1;
> 
> The ugly thing in kvm_irqfd_assign() is that we access irqfd without
> holding a lock. I think that should rather be fixed than working around
> that issue. (e.g. lock() -> lookup again -> verify still in list ->
> unlock())

The new lock should be always held in assign path otherwise we need to
lookup irqfds list frequently, right?

At first, I tried to use a mutex lock between assign and deassign path
but assign path already involves some locks and add new lock maybe
introduce dead lock. So I used flag check to replace with new lock.

> 
>>  return 0;
>>  
>>  fail:
>> @@ -525,6 +526,7 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
>>  {
> 
> Which tool are you using to generate diffs? git format-patch?

Yes, I used git version 1.8.3.1 :)
This also confused me. I will try newer version.

> 
> Mentioning, because the indicated function here  (kvm_irqfd_deassign)
> 
>>  struct kvm_kernel_irqfd *irqfd, *tmp;
>>  struct eventfd_ctx *eventfd;
>> +int ret = 0;
>>  
>>  eventfd = eventfd_ctx_fdget(args->fd);
>>  if (IS_ERR(eventfd))
>> @@ -543,7 +545,12 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
>>  write_seqcount_begin(&irqfd->irq_entry_sc);
>>  irqfd->irq_entry.type = 0;
>>  write_seqcount_end(&irqfd->irq_entry_sc);
>> -irqfd_deactivate(irqfd);
>> +
>> +if (irqfd->initialized)
>> +irqfd_deactivate(irqfd);
>> +else
>> +ret = -EFAULT;
>> +
>>  }
>>  }
>>  
>> @@ -557,7 +564,7 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
>>   */
> 
> and here are wrong and misleading (kvm_irqfd_deassig). (and just noticed
> also in the second hunk)
> 
>>  flush_workqueue(irqfd_cleanup_wq);
>>  
>> -return 0;
>> +return ret;
>>  }
>>  
>>  int
>>
> 
> 


-- 
Best regards
Tianyu Lan


[PATCH] KVM/Eventfd: Avoid crash when assign and deassign same eventfd in parallel.

2017-12-17 Thread Lan Tianyu
Syzroot reports crash in kvm_irqfd_assign() is caused by use-after-free.
Because kvm_irqfd_assign() and kvm_irqfd_deassign() can't run in parallel
for same eventfd. When assign path hasn't been finished after irqfd
has been added to kvm->irqfds.items list, another thead may deassign the
eventfd and free struct kvm_kernel_irqfd(). This causes assign path still
uses struct kvm_kernel_irqfd freed by deassign path. To avoid such issue,
add "initialized" flag in the struct kvm_kernel_irqfd and check the flag before
deactivating irqfd. If irqfd is still in initialization, deassign path
return fault.

Reported-by: Dmitry Vyukov 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Dmitry Vyukov 
Cc: Wanpeng Li 
Signed-off-by: Tianyu Lan 
---
 include/linux/kvm_irqfd.h |  1 +
 virt/kvm/eventfd.c| 11 +--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_irqfd.h b/include/linux/kvm_irqfd.h
index 76c2fbc..be6b254 100644
--- a/include/linux/kvm_irqfd.h
+++ b/include/linux/kvm_irqfd.h
@@ -66,6 +66,7 @@ struct kvm_kernel_irqfd {
struct work_struct shutdown;
struct irq_bypass_consumer consumer;
struct irq_bypass_producer *producer;
+   u8 initialized:1;
 };
 
 #endif /* __LINUX_KVM_IRQFD_H */
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index a334399..80f06e6 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -421,6 +421,7 @@ int  __attribute__((weak)) kvm_arch_update_irqfd_routing(
}
 #endif
 
+   irqfd->initialized = 1;
return 0;
 
 fail:
@@ -525,6 +526,7 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
 {
struct kvm_kernel_irqfd *irqfd, *tmp;
struct eventfd_ctx *eventfd;
+   int ret = 0;
 
eventfd = eventfd_ctx_fdget(args->fd);
if (IS_ERR(eventfd))
@@ -543,7 +545,12 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
write_seqcount_begin(&irqfd->irq_entry_sc);
irqfd->irq_entry.type = 0;
write_seqcount_end(&irqfd->irq_entry_sc);
-   irqfd_deactivate(irqfd);
+
+   if (irqfd->initialized)
+   irqfd_deactivate(irqfd);
+   else
+   ret = -EFAULT;
+
}
}
 
@@ -557,7 +564,7 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
 */
flush_workqueue(irqfd_cleanup_wq);
 
-   return 0;
+   return ret;
 }
 
 int
-- 
1.8.3.1



Re: Re: KASAN: use-after-free Write in irq_bypass_register_consumer

2017-12-16 Thread Lan, Tianyu

The root cause is that kvm_irqfd_assign() and kvm_irqfd_deassign() can't
be run in parallel. Some data structure(e.g, irqfd->consumer) will be
crashed because irqfd may be freed in deassign path before they are used
in assign path. The other data maybe used in deassign path before
initialization. Syzbot test hit such case. Add mutx between
kvm_irqfd_assign() and kvm_irqfd_deassign() can fix such issue. Will
send patch to fix it.

On 12/16/2017 12:53 PM, Tianyu Lan wrote:

I reproduced the issue. Will have a look.

-- Best regards Tianyu Lan 2017-12-15 18:14 GMT+08:00 syzbot 
:

syzkaller has found reproducer for the following crash on
82bcf1def3b5f1251177ad47c44f7e17af039b4b
git://git.cmpxchg.org/linux-mmots.git/master
compiler: gcc (GCC) 7.1.1 20170620
.config is attached
Raw console output is attached.
C reproducer is attached
syzkaller reproducer is attached. Seehttps://goo.gl/kgGztJ
for information about syzkaller reproducers


==
BUG: KASAN: use-after-free in __list_add include/linux/list.h:64 [inline]
BUG: KASAN: use-after-free in list_add include/linux/list.h:79 [inline]
BUG: KASAN: use-after-free in irq_bypass_register_consumer+0x4b4/0x500
virt/lib/irqbypass.c:217
Write of size 8 at addr 8801cdf51180 by task syzkaller436086/15031

CPU: 1 PID: 15031 Comm: syzkaller436086 Not tainted 4.15.0-rc2-mm1+ #39
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x194/0x257 lib/dump_stack.c:53
  print_address_description+0x73/0x250 mm/kasan/report.c:252
  kasan_report_error mm/kasan/report.c:351 [inline]
  kasan_report+0x25b/0x340 mm/kasan/report.c:409
  __asan_report_store8_noabort+0x17/0x20 mm/kasan/report.c:435
  __list_add include/linux/list.h:64 [inline]
  list_add include/linux/list.h:79 [inline]
  irq_bypass_register_consumer+0x4b4/0x500 virt/lib/irqbypass.c:217
  kvm_irqfd_assign arch/x86/kvm/../../../virt/kvm/eventfd.c:417 [inline]
  kvm_irqfd+0x137f/0x1d50 arch/x86/kvm/../../../virt/kvm/eventfd.c:572
  kvm_vm_ioctl+0x1079/0x1c40 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2992
  vfs_ioctl fs/ioctl.c:46 [inline]
  do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:686
  SYSC_ioctl fs/ioctl.c:701 [inline]
  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
  entry_SYSCALL_64_fastpath+0x1f/0x96
RIP: 0033:0x44d379
RSP: 002b:7fc5ff9a9d08 EFLAGS: 0246 ORIG_RAX: 0010
RAX: ffda RBX: 7fc5ff9aa700 RCX: 0044d379
RDX: 20080fe0 RSI: 4020ae76 RDI: 0005
RBP: 007ff900 R08: 7fc5ff9aa700 R09: 7fc5ff9aa700
R10: 7fc5ff9aa700 R11: 0246 R12: 
R13: 007ff8ff R14: 7fc5ff9aa9c0 R15: 

Allocated by task 15031:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
  set_track mm/kasan/kasan.c:459 [inline]
  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
  kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3614
  kmalloc include/linux/slab.h:516 [inline]
  kzalloc include/linux/slab.h:705 [inline]
  kvm_irqfd_assign arch/x86/kvm/../../../virt/kvm/eventfd.c:296 [inline]
  kvm_irqfd+0x16c/0x1d50 arch/x86/kvm/../../../virt/kvm/eventfd.c:572
  kvm_vm_ioctl+0x1079/0x1c40 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2992
  vfs_ioctl fs/ioctl.c:46 [inline]
  do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:686
  SYSC_ioctl fs/ioctl.c:701 [inline]
  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
  entry_SYSCALL_64_fastpath+0x1f/0x96

Freed by task 1402:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
  set_track mm/kasan/kasan.c:459 [inline]
  kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
  __cache_free mm/slab.c:3492 [inline]
  kfree+0xca/0x250 mm/slab.c:3807
  irqfd_shutdown+0x13c/0x1a0 arch/x86/kvm/../../../virt/kvm/eventfd.c:148
  process_one_work+0xbfd/0x1bc0 kernel/workqueue.c:2113
  worker_thread+0x223/0x1990 kernel/workqueue.c:2247
  kthread+0x37a/0x440 kernel/kthread.c:238
  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:524

The buggy address belongs to the object at 8801cdf51000
  which belongs to the cache kmalloc-512 of size 512
The buggy address is located 384 bytes inside of
  512-byte region [8801cdf51000, 8801cdf51200)
The buggy address belongs to the page:
page:d08a0d63 count:1 mapcount:0 mapping:d54c7be6 index:0x0
flags: 0x2fffc000100(slab)
raw: 02fffc000100 8801cdf51000  00010006
raw: ea00073a7660 ea000737f3a0 8801dac00940 
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  8801cdf51080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  8801cdf51100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

8801cdf51180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

^
  8801cdf51200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  8801cdf51280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Re: [PATCH V2] KVM/x86: Check input paging mode when cs.l is set

2017-12-14 Thread Lan, Tianyu



On 12/14/2017 7:41 PM, Paolo Bonzini wrote:

On 14/12/2017 04:55, Lan Tianyu wrote:

+* When EFER.LME and CR0.PG are set, CR4.PAE and EFER.LMA
+* must be set.
+*/
+   if ((sregs->efer & EFER_LME) && (sregs->cr0 & X86_CR0_PG_BIT)) {
+   if (!(sregs->cr4 & X86_CR4_PAE_BIT))
+   return -EINVAL;
+   if (!(sregs->efer & EFER_LMA))
+   return -EINVAL;
+   } else if (sregs->efer & EFER_LMA)


This can just be "(sregs->efer & EFER_LMA) || sregs->cs.l", making the
next "if" redundant.  Even better written as

if ((sregs->efer & EFER_LME) && (sregs->cr0 & X86_CR0_PG_BIT)) {
/*
 * When EFER.LME and CR0.PG are set, the processor is in
 * 64-bit mode (though maybe in a 32-bit code segment).
 * CR4.PAE and EFER.LMA must be set.
 */
if (... || ...)
return -EINVAL;
} else {
/*
 * Not in 64-bit mode: EFER.LMA is clear and the code
 * segment cannot be 64-bit.
 */
if (... || ...)
return -EINVAL;
}

Paolo


OK. Thanks for your guide. just send v3. Please have a look.


[PATCH V3] KVM/x86: Check input paging mode when cs.l is set

2017-12-14 Thread Lan Tianyu
Reported by syzkaller:
WARNING: CPU: 0 PID: 27962 at arch/x86/kvm/emulate.c:5631 
x86_emulate_insn+0x557/0x15f0 [kvm]
Modules linked in: kvm_intel kvm [last unloaded: kvm]
CPU: 0 PID: 27962 Comm: syz-executor Tainted: GB   W
4.15.0-rc2-next-20171208+ #32
Hardware name: Intel Corporation S1200SP/S1200SP, BIOS 
S1200SP.86B.01.03.0006.040720161253 04/07/2016
RIP: 0010:x86_emulate_insn+0x557/0x15f0 [kvm]
RSP: 0018:8807234476d0 EFLAGS: 00010282
RAX:  RBX: 88072d0237a0 RCX: a0065c4d
RDX: 1100e5a046f9 RSI: 0003 RDI: 88072d0237c8
RBP: 880723447728 R08: 88072d02 R09: a008d240
R10: 0002 R11: ed00e7d87db3 R12: 88072d0237c8
R13: 88072d023870 R14: 88072d0238c2 R15: a008d080
FS:  7f8a68666700() GS:88080220() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 2009506c CR3: 00071fec4005 CR4: 003626f0
Call Trace:
 x86_emulate_instruction+0x3bc/0xb70 [kvm]
 ? reexecute_instruction.part.162+0x130/0x130 [kvm]
 vmx_handle_exit+0x46d/0x14f0 [kvm_intel]
 ? trace_event_raw_event_kvm_entry+0xe7/0x150 [kvm]
 ? handle_vmfunc+0x2f0/0x2f0 [kvm_intel]
 ? wait_lapic_expire+0x25/0x270 [kvm]
 vcpu_enter_guest+0x720/0x1ef0 [kvm]
 ...

When CS.L is set, vcpu should run in the 64 bit paging mode.
Current kvm set_sregs function doesn't have such check when
userspace inputs sreg values. This will lead unexpected behavior.
This patch is to add checks for CS.L, EFER.LME, EFER.LMA and
CR4.PAE when get SREG inputs from userspace in order to avoid
unexpected behavior.

Suggested-by: Paolo Bonzini 
Reported-by: Dmitry Vyukov 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Dmitry Vyukov 
Cc: Jim Mattson 
Signed-off-by: Tianyu Lan 
---
Change since v2:
   Change coding style.
Change since v1:
   Add check for EFER.LME, EFER.LMA and CR4.PAE.
---
 arch/x86/kvm/x86.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1e1e617..9eb1217 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7485,6 +7485,29 @@ int kvm_task_switch(struct kvm_vcpu *vcpu, u16 
tss_selector, int idt_index,
 }
 EXPORT_SYMBOL_GPL(kvm_task_switch);
 
+int kvm_valid_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
+{
+   if ((sregs->efer & EFER_LME) && (sregs->cr0 & X86_CR0_PG_BIT)) {
+   /*
+* When EFER.LME and CR0.PG are set, the processor is in
+* 64-bit mode (though maybe in a 32-bit code segment).
+* CR4.PAE and EFER.LMA must be set.
+*/
+   if (!(sregs->cr4 & X86_CR4_PAE_BIT)
+   || !(sregs->efer & EFER_LMA))
+   return -EINVAL;
+   } else {
+   /*
+* Not in 64-bit mode: EFER.LMA is clear and the code
+* segment cannot be 64-bit.
+*/
+   if (sregs->efer & EFER_LMA || sregs->cs.l)
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
  struct kvm_sregs *sregs)
 {
@@ -7497,6 +7520,9 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
(sregs->cr4 & X86_CR4_OSXSAVE))
return -EINVAL;
 
+   if (kvm_valid_sregs(vcpu, sregs))
+   return -EINVAL;
+
apic_base_msr.data = sregs->apic_base;
apic_base_msr.host_initiated = true;
if (kvm_set_apic_base(vcpu, &apic_base_msr))
-- 
1.8.3.1



[PATCH V2] KVM/x86: Check input paging mode when cs.l is set

2017-12-14 Thread Lan Tianyu
Reported by syzkaller:
WARNING: CPU: 0 PID: 27962 at arch/x86/kvm/emulate.c:5631 
x86_emulate_insn+0x557/0x15f0 [kvm]
Modules linked in: kvm_intel kvm [last unloaded: kvm]
CPU: 0 PID: 27962 Comm: syz-executor Tainted: GB   W
4.15.0-rc2-next-20171208+ #32
Hardware name: Intel Corporation S1200SP/S1200SP, BIOS 
S1200SP.86B.01.03.0006.040720161253 04/07/2016
RIP: 0010:x86_emulate_insn+0x557/0x15f0 [kvm]
RSP: 0018:8807234476d0 EFLAGS: 00010282
RAX:  RBX: 88072d0237a0 RCX: a0065c4d
RDX: 1100e5a046f9 RSI: 0003 RDI: 88072d0237c8
RBP: 880723447728 R08: 88072d02 R09: a008d240
R10: 0002 R11: ed00e7d87db3 R12: 88072d0237c8
R13: 88072d023870 R14: 88072d0238c2 R15: a008d080
FS:  7f8a68666700() GS:88080220() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 2009506c CR3: 00071fec4005 CR4: 003626f0
Call Trace:
 x86_emulate_instruction+0x3bc/0xb70 [kvm]
 ? reexecute_instruction.part.162+0x130/0x130 [kvm]
 vmx_handle_exit+0x46d/0x14f0 [kvm_intel]
 ? trace_event_raw_event_kvm_entry+0xe7/0x150 [kvm]
 ? handle_vmfunc+0x2f0/0x2f0 [kvm_intel]
 ? wait_lapic_expire+0x25/0x270 [kvm]
 vcpu_enter_guest+0x720/0x1ef0 [kvm]
 ...

When CS.L is set, vcpu should run in the 64 bit paging mode.
Current kvm set_sregs function doesn't have such check when
userspace inputs sreg values. This will lead unexpected behavior.
This patch is to add checks for CS.L, EFER.LME, EFER.LMA and
CR4.PAE when get SREG inputs from userspace in order to avoid
unexpected behavior.

Suggested-by: Paolo Bonzini 
Reported-by: Dmitry Vyukov 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Dmitry Vyukov 
Cc: Jim Mattson 
Signed-off-by: Tianyu Lan 
---
Change since v1:
   Add check for EFER.LME, EFER.LMA and CR4.PAE
in order to make sure that if EFER.LME=1 and CR0.PG=1,
EFER.LMA and CR4.PAE must be 1.

---
 arch/x86/kvm/x86.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1e1e617..8aca950 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7485,6 +7485,30 @@ int kvm_task_switch(struct kvm_vcpu *vcpu, u16 
tss_selector, int idt_index,
 }
 EXPORT_SYMBOL_GPL(kvm_task_switch);
 
+int kvm_valid_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
+{
+   /*
+* When EFER.LME and CR0.PG are set, CR4.PAE and EFER.LMA
+* must be set.
+*/
+   if ((sregs->efer & EFER_LME) && (sregs->cr0 & X86_CR0_PG_BIT)) {
+   if (!(sregs->cr4 & X86_CR4_PAE_BIT))
+   return -EINVAL;
+   if (!(sregs->efer & EFER_LMA))
+   return -EINVAL;
+   } else if (sregs->efer & EFER_LMA)
+   return -EINVAL;
+
+   /* When CS.L is set, vcpu should run in 64-bit mode. */
+   if (sregs->cs.l)
+   if (!((sregs->cr0 & X86_CR0_PG_BIT) &&
+ (sregs->cr4 & X86_CR4_PAE_BIT) &&
+ (sregs->efer & EFER_LME)))
+   return -EINVAL;
+
+   return 0;
+}
+
 int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
  struct kvm_sregs *sregs)
 {
@@ -7497,6 +7521,9 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
(sregs->cr4 & X86_CR4_OSXSAVE))
return -EINVAL;
 
+   if (kvm_valid_sregs(vcpu, sregs))
+   return -EINVAL;
+
apic_base_msr.data = sregs->apic_base;
apic_base_msr.host_initiated = true;
if (kvm_set_apic_base(vcpu, &apic_base_msr))
-- 
1.8.3.1



Re: [PATCH 1/1] KVM/x86: Check input paging mode when cs.l is set

2017-12-13 Thread Lan Tianyu
On 2017年12月13日 20:20, Paolo Bonzini wrote:
> On 13/12/2017 05:17, Lan Tianyu wrote:
>> Reported by syzkaller:
>> WARNING: CPU: 0 PID: 27962 at arch/x86/kvm/emulate.c:5631 
>> x86_emulate_insn+0x557/0x15f0 [kvm]
>> Modules linked in: kvm_intel kvm [last unloaded: kvm]
>> CPU: 0 PID: 27962 Comm: syz-executor Tainted: GB   W
>> 4.15.0-rc2-next-20171208+ #32
>> Hardware name: Intel Corporation S1200SP/S1200SP, BIOS 
>> S1200SP.86B.01.03.0006.040720161253 04/07/2016
>> RIP: 0010:x86_emulate_insn+0x557/0x15f0 [kvm]
>> RSP: 0018:8807234476d0 EFLAGS: 00010282
>> RAX:  RBX: 88072d0237a0 RCX: a0065c4d
>> RDX: 1100e5a046f9 RSI: 0003 RDI: 88072d0237c8
>> RBP: 880723447728 R08: 88072d02 R09: a008d240
>> R10: 0002 R11: ed00e7d87db3 R12: 88072d0237c8
>> R13: 88072d023870 R14: 88072d0238c2 R15: a008d080
>> FS:  7f8a68666700() GS:88080220() 
>> knlGS:
>> CS:  0010 DS:  ES:  CR0: 80050033
>> CR2: 2009506c CR3: 00071fec4005 CR4: 003626f0
>> Call Trace:
>>  x86_emulate_instruction+0x3bc/0xb70 [kvm]
>>  ? reexecute_instruction.part.162+0x130/0x130 [kvm]
>>  vmx_handle_exit+0x46d/0x14f0 [kvm_intel]
>>  ? trace_event_raw_event_kvm_entry+0xe7/0x150 [kvm]
>>  ? handle_vmfunc+0x2f0/0x2f0 [kvm_intel]
>>  ? wait_lapic_expire+0x25/0x270 [kvm]
>>  vcpu_enter_guest+0x720/0x1ef0 [kvm]
>>  ...
>>
>> When cs.l is set, vcpu should run in the 64 bit paging mode.
>> Current kvm set_sregs function doesn't have such check when userspace
>> inputs sreg values. This will lead unexpected behavior. This patch
>> is to add such check.
>>
>> Suggested-by: Paolo Bonzini 
>> Reported-by: Dmitry Vyukov 
>> Cc: Paolo Bonzini 
>> Cc: Radim Krčmář 
>> Cc: Dmitry Vyukov 
>> Cc: Jim Mattson 
>> Signed-off-by: Tianyu Lan 
>> Signed-off-by: Lan Tianyu 
>> ---
>>  arch/x86/kvm/x86.c | 15 +++
>>  1 file changed, 15 insertions(+)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 1e1e617..07d6d6c 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -7485,6 +7485,18 @@ int kvm_task_switch(struct kvm_vcpu *vcpu, u16 
>> tss_selector, int idt_index,
>>  }
>>  EXPORT_SYMBOL_GPL(kvm_task_switch);
>>  
>> +int kvm_valid_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
>> +{
>> +/* When cs.l is set, vcpu should run in 64-bit mode */
>> +if (sregs->cs.l)
>> +if (!((sregs->cr0 & X86_CR0_PG_BIT) &&
>> +  (sregs->cr4 & X86_CR4_PAE_BIT) &&
>> +  (sregs->efer & EFER_LME)))
>> +return -EINVAL;
>> +
>> +return 0;
>> +}
>> +
>>  int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
>>struct kvm_sregs *sregs)
>>  {
>> @@ -7497,6 +7509,9 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu 
>> *vcpu,
>>  (sregs->cr4 & X86_CR4_OSXSAVE))
>>  return -EINVAL;
>>  
>> +if (kvm_valid_sregs(vcpu, sregs))
>> +return -EINVAL;
>> +
>>  apic_base_msr.data = sregs->apic_base;
>>  apic_base_msr.host_initiated = true;
>>  if (kvm_set_apic_base(vcpu, &apic_base_msr))
>>
> 
> Thanks, this is the right direction.  However, there are many more
> checks that are missing and can trigger more unexpected behavior.

Yes, we may add these checks step by step.

>  In
> particular:
> 
> - if EFER.LME=1 and CR0.PG=1, EFER.LMA and CR4.PAE must be 1
> 
> - otherwise, EFER.LMA and CS.L must be 0

I will add these in the new version. Thanks.

-- 
Best regards
Tianyu Lan


[PATCH 1/1] KVM/x86: Check input paging mode when cs.l is set

2017-12-13 Thread Lan Tianyu
Reported by syzkaller:
WARNING: CPU: 0 PID: 27962 at arch/x86/kvm/emulate.c:5631 
x86_emulate_insn+0x557/0x15f0 [kvm]
Modules linked in: kvm_intel kvm [last unloaded: kvm]
CPU: 0 PID: 27962 Comm: syz-executor Tainted: GB   W
4.15.0-rc2-next-20171208+ #32
Hardware name: Intel Corporation S1200SP/S1200SP, BIOS 
S1200SP.86B.01.03.0006.040720161253 04/07/2016
RIP: 0010:x86_emulate_insn+0x557/0x15f0 [kvm]
RSP: 0018:8807234476d0 EFLAGS: 00010282
RAX:  RBX: 88072d0237a0 RCX: a0065c4d
RDX: 1100e5a046f9 RSI: 0003 RDI: 88072d0237c8
RBP: 880723447728 R08: 88072d02 R09: a008d240
R10: 0002 R11: ed00e7d87db3 R12: 88072d0237c8
R13: 88072d023870 R14: 88072d0238c2 R15: a008d080
FS:  7f8a68666700() GS:88080220() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 2009506c CR3: 00071fec4005 CR4: 003626f0
Call Trace:
 x86_emulate_instruction+0x3bc/0xb70 [kvm]
 ? reexecute_instruction.part.162+0x130/0x130 [kvm]
 vmx_handle_exit+0x46d/0x14f0 [kvm_intel]
 ? trace_event_raw_event_kvm_entry+0xe7/0x150 [kvm]
 ? handle_vmfunc+0x2f0/0x2f0 [kvm_intel]
 ? wait_lapic_expire+0x25/0x270 [kvm]
 vcpu_enter_guest+0x720/0x1ef0 [kvm]
 ...

When cs.l is set, vcpu should run in the 64 bit paging mode.
Current kvm set_sregs function doesn't have such check when userspace
inputs sreg values. This will lead unexpected behavior. This patch
is to add such check.

Suggested-by: Paolo Bonzini 
Reported-by: Dmitry Vyukov 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Dmitry Vyukov 
Cc: Jim Mattson 
Signed-off-by: Tianyu Lan 
Signed-off-by: Lan Tianyu 
---
 arch/x86/kvm/x86.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1e1e617..07d6d6c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7485,6 +7485,18 @@ int kvm_task_switch(struct kvm_vcpu *vcpu, u16 
tss_selector, int idt_index,
 }
 EXPORT_SYMBOL_GPL(kvm_task_switch);
 
+int kvm_valid_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
+{
+   /* When cs.l is set, vcpu should run in 64-bit mode */
+   if (sregs->cs.l)
+   if (!((sregs->cr0 & X86_CR0_PG_BIT) &&
+ (sregs->cr4 & X86_CR4_PAE_BIT) &&
+ (sregs->efer & EFER_LME)))
+   return -EINVAL;
+
+   return 0;
+}
+
 int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
  struct kvm_sregs *sregs)
 {
@@ -7497,6 +7509,9 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
(sregs->cr4 & X86_CR4_OSXSAVE))
return -EINVAL;
 
+   if (kvm_valid_sregs(vcpu, sregs))
+   return -EINVAL;
+
apic_base_msr.data = sregs->apic_base;
apic_base_msr.host_initiated = true;
if (kvm_set_apic_base(vcpu, &apic_base_msr))
-- 
1.8.3.1



Re: Re: WARNING in x86_emulate_insn

2017-12-12 Thread Lan Tianyu
On 2017年12月12日 06:45, Paolo Bonzini wrote:
> On 08/12/2017 09:28, Tianyu Lan wrote:
>> I find this is pop instruction emulation issue. According "SDM VOL2,
>> chapter INSTRUCTION
>> SET REFERENCE. POP—Pop a Value from the Stack"
>>
>> Protected Mode Exceptions
>> #GP(0) If attempt is made to load SS register with NULL segment selector.
> 
> This is not what the testcase is testing; this is already covered by 
> __load_segment_descriptor:
> 
> if (null_selector) {
> if (seg == VCPU_SREG_CS || seg == VCPU_SREG_TR)
> goto exception;
> 
> if (seg == VCPU_SREG_SS) {
> if (ctxt->mode != X86EMUL_MODE_PROT64 || rpl != cpl)
> goto exception;
>   ...
>   }

Yes, __load_segment_descriptor() does such check. I find em_pop doesn't
load SS segment. SS isn't loaded before calling em_pop in the test case.
Should this be fixed?

> 
> Is there a path that can return X86EMUL_PROPAGATE_FAULT without setting
> ctxt->exception.vector and/or without going through emulate_exception?
> 
> I don't think it's possible to write a test in kvm-unit-tests, because the
> state has "impossible" segment descriptor cache contents.

Sent out a fix patch for the issue. Please have a look. Thanks.
https://marc.info/?l=kvm&m=151306208214733&w=2

> 
> Paolo
> 
>> This test case hits it but current code doesn't check such case.
>> The following patch can fix the issue.
>>
>> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
>> index abe74f7..e2ac5cc 100644
>> --- a/arch/x86/kvm/emulate.c
>> +++ b/arch/x86/kvm/emulate.c
>> @@ -1844,6 +1844,9 @@ static int emulate_pop(struct x86_emulate_ctxt *ctxt,
>> int rc;
>> struct segmented_address addr;
>>
>> +   if ( !get_segment_selector(ctxt, VCPU_SREG_SS))
>> +   return emulate_gp(ctxt, 0);
>> +
>> addr.ea = reg_read(ctxt, VCPU_REGS_RSP) & stack_mask(ctxt);
>> addr.seg = VCPU_SREG_SS;
>> rc = segmented_read(ctxt, addr, dest, len);
> 


-- 
Best regards
Tianyu Lan


Re: Re: [PATCH] KVM/Emulate: Mask linear address with actual address width in order to avoid conflict with UNMAPPED_GVA

2017-12-11 Thread Lan Tianyu
On 2017年12月12日 14:42, Wanpeng Li wrote:
> 2017-12-12 7:23 GMT+08:00 Lan Tianyu :
>>
>> Reported by syzkaller:
>> WARNING: CPU: 0 PID: 27962 at arch/x86/kvm/emulate.c:5631 
>> x86_emulate_insn+0x557/0x15f0 [kvm]
>> Modules linked in: kvm_intel kvm [last unloaded: kvm]
>> CPU: 0 PID: 27962 Comm: syz-executor Tainted: GB   W
>> 4.15.0-rc2-next-20171208+ #32
>> Hardware name: Intel Corporation S1200SP/S1200SP, BIOS 
>> S1200SP.86B.01.03.0006.040720161253 04/07/2016
>> RIP: 0010:x86_emulate_insn+0x557/0x15f0 [kvm]
>> RSP: 0018:8807234476d0 EFLAGS: 00010282
>> RAX:  RBX: 88072d0237a0 RCX: a0065c4d
>> RDX: 1100e5a046f9 RSI: 0003 RDI: 88072d0237c8
>> RBP: 880723447728 R08: 88072d02 R09: a008d240
>> R10: 0002 R11: ed00e7d87db3 R12: 88072d0237c8
>> R13: 88072d023870 R14: 88072d0238c2 R15: a008d080
>> FS:  7f8a68666700() GS:88080220() 
>> knlGS:
>> CS:  0010 DS:  ES:  CR0: 80050033
>> CR2: 2009506c CR3: 00071fec4005 CR4: 003626f0
>> Call Trace:
>>  x86_emulate_instruction+0x3bc/0xb70 [kvm]
>>  ? reexecute_instruction.part.162+0x130/0x130 [kvm]
>>  vmx_handle_exit+0x46d/0x14f0 [kvm_intel]
>>  ? trace_event_raw_event_kvm_entry+0xe7/0x150 [kvm]
>>  ? handle_vmfunc+0x2f0/0x2f0 [kvm_intel]
>>  ? wait_lapic_expire+0x25/0x270 [kvm]
>>  vcpu_enter_guest+0x720/0x1ef0 [kvm]
>>  ...
>>
>> Syzkaller tests KVM emulation code path with setting RSP 0x
>> and running in 64bit non paging mode. It triggers warning that exception
>> vector is > 1f(it's initialized to be 0xff in x86_emulate_instruction())
>> during emulation of pop instruction. This is due to pop emulation callback
>> em_pop() returns X86EMUL_PROPAGATE_FAULT while not populate exception
>> vector. POP emulation code accesses RSP with linear address
>> 0x and KVM mmu returns GVA as GPA during nonpaging mode.
>> In this case, GPA 0x(~0) conflicts with error code
>> UNMAPPED_GVA which is defined to be (~(u64)0). Caller vcpu_mmio_gva_to
>> _gpa() treats the return address as error and this also causes emulator
>> _read_write_onepage() returns X86EMUL_PROPAGATE_FAULT without a validated
>> exception vector.
>>
>> This patch is to mask linear address with address width to fix such issue
>> since linear address won't occupy all 64-bit even if in 5 level paging
>> mode.
>>
>> Reported-by: Dmitry Vyukov 
>> Cc: Paolo Bonzini 
>> Cc: Radim Krčmář 
>> Cc: Dmitry Vyukov 
>> Cc: Jim Mattson 
>> Cc: Wanpeng Li 
>> Signed-off-by: Lan Tianyu 
> 
> Tested-by: Wanpeng Li 
> 

Thanks a lot.

-- 
Best regards
Tianyu Lan


[PATCH] KVM/Emulate: Mask linear address with actual address width in order to avoid conflict with UNMAPPED_GVA

2017-12-11 Thread Lan Tianyu

Reported by syzkaller:
WARNING: CPU: 0 PID: 27962 at arch/x86/kvm/emulate.c:5631 
x86_emulate_insn+0x557/0x15f0 [kvm]
Modules linked in: kvm_intel kvm [last unloaded: kvm]
CPU: 0 PID: 27962 Comm: syz-executor Tainted: GB   W
4.15.0-rc2-next-20171208+ #32
Hardware name: Intel Corporation S1200SP/S1200SP, BIOS 
S1200SP.86B.01.03.0006.040720161253 04/07/2016
RIP: 0010:x86_emulate_insn+0x557/0x15f0 [kvm]
RSP: 0018:8807234476d0 EFLAGS: 00010282
RAX:  RBX: 88072d0237a0 RCX: a0065c4d
RDX: 1100e5a046f9 RSI: 0003 RDI: 88072d0237c8
RBP: 880723447728 R08: 88072d02 R09: a008d240
R10: 0002 R11: ed00e7d87db3 R12: 88072d0237c8
R13: 88072d023870 R14: 88072d0238c2 R15: a008d080
FS:  7f8a68666700() GS:88080220() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 2009506c CR3: 00071fec4005 CR4: 003626f0
Call Trace:
 x86_emulate_instruction+0x3bc/0xb70 [kvm]
 ? reexecute_instruction.part.162+0x130/0x130 [kvm]
 vmx_handle_exit+0x46d/0x14f0 [kvm_intel]
 ? trace_event_raw_event_kvm_entry+0xe7/0x150 [kvm]
 ? handle_vmfunc+0x2f0/0x2f0 [kvm_intel]
 ? wait_lapic_expire+0x25/0x270 [kvm]
 vcpu_enter_guest+0x720/0x1ef0 [kvm]
 ...

Syzkaller tests KVM emulation code path with setting RSP 0x
and running in 64bit non paging mode. It triggers warning that exception
vector is > 1f(it's initialized to be 0xff in x86_emulate_instruction())
during emulation of pop instruction. This is due to pop emulation callback
em_pop() returns X86EMUL_PROPAGATE_FAULT while not populate exception
vector. POP emulation code accesses RSP with linear address
0x and KVM mmu returns GVA as GPA during nonpaging mode.
In this case, GPA 0x(~0) conflicts with error code
UNMAPPED_GVA which is defined to be (~(u64)0). Caller vcpu_mmio_gva_to
_gpa() treats the return address as error and this also causes emulator
_read_write_onepage() returns X86EMUL_PROPAGATE_FAULT without a validated
exception vector.

This patch is to mask linear address with address width to fix such issue
since linear address won't occupy all 64-bit even if in 5 level paging
mode.

Reported-by: Dmitry Vyukov 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Dmitry Vyukov 
Cc: Jim Mattson 
Cc: Wanpeng Li 
Signed-off-by: Lan Tianyu 
---
 arch/x86/kvm/emulate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index abe74f7..9b21412 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -697,8 +697,8 @@ static __always_inline int __linearize(struct 
x86_emulate_ctxt *ctxt,
*max_size = 0;
switch (mode) {
case X86EMUL_MODE_PROT64:
-   *linear = la;
va_bits = ctxt_virt_addr_bits(ctxt);
+   *linear = la & ((1ul << va_bits) - 1);
if (get_canonical(la, va_bits) != la)
goto bad;
 
-- 
1.8.3.1



Re: Re: WARNING in x86_emulate_insn

2017-12-08 Thread Lan, Tianyu


On 12/8/2017 5:27 PM, Wanpeng Li wrote:

2017-12-08 16:28 GMT+08:00 Tianyu Lan :

Hi Jim&Wanpeng:
  Thanks for your help.

2017-12-08 5:25 GMT+08:00 Jim Mattson :

Try disabling the module parameter, "unrestricted_guest." Make sure
that the module parameter, "emulate_invalid_guest_state" is enabled.
This combination allows userspace to feed invalid guest state into the
in-kernel emulator.


Yes, you are right. I need to disable unrestricted_guest to reproduce the issue.


I can observe ctxt->exception.vector == 0xff which triggers Dmitry's
report. Do you figure out the reason?



Yes, this is caused by that emulation callback returns error code while
not emulate exception and not set exception vector.
ctxt->exception.vector is default to be 0xff in emulate instruction code
path.


Re: [PATCH] KVM/VMX: Avoid CR3 VMEXIT during guest real mode when "unrestricted guest" is supported.

2017-08-16 Thread Lan Tianyu
On 2017年08月16日 17:25, Paolo Bonzini wrote:
> On 16/08/2017 03:58, Lan Tianyu wrote:
>> These CR3 VMEXITs was introduced for platform without "unrestricted guest"
>> support. This is to set ept identity table to guest CR3 in guest real
>> mode because these platforms don't support ept real mode(CR0.PE and CR0.PG
>> must be set to 1). But these VMEXITs is redundant for platforms with
>> "unrestricted guest" support.
> 
> This is true, but I'm not sure it's a good idea to make things more
> complex...  It is working code and is not a bottleneck anywhere.

Yes, this is just code clear up and will not affect function.
I thought this change was missed when introduced unrestricted guest support.

-- 
Best regards
Tianyu Lan
> 
> Paolo
> 
>> Signed-off-by: Lan Tianyu 
>> ---
>>  arch/x86/kvm/vmx.c | 22 +-
>>  1 file changed, 13 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 9b21b12..46dcf50 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -4221,18 +4221,20 @@ static void ept_update_paging_mode_cr0(unsigned long 
>> *hw_cr0,
>>  vmx_decache_cr3(vcpu);
>>  if (!(cr0 & X86_CR0_PG)) {
>>  /* From paging/starting to nonpaging */
>> -vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
>> - vmcs_read32(CPU_BASED_VM_EXEC_CONTROL) |
>> - (CPU_BASED_CR3_LOAD_EXITING |
>> -  CPU_BASED_CR3_STORE_EXITING));
>> +if (!enable_unrestricted_guest)
>> +vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
>> + vmcs_read32(CPU_BASED_VM_EXEC_CONTROL) |
>> + (CPU_BASED_CR3_LOAD_EXITING |
>> +  CPU_BASED_CR3_STORE_EXITING));
>>  vcpu->arch.cr0 = cr0;
>>  vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
>>  } else if (!is_paging(vcpu)) {
>>  /* From nonpaging to paging */
>> -vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
>> - vmcs_read32(CPU_BASED_VM_EXEC_CONTROL) &
>> - ~(CPU_BASED_CR3_LOAD_EXITING |
>> -   CPU_BASED_CR3_STORE_EXITING));
>> +if (!enable_unrestricted_guest)
>> +vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
>> + vmcs_read32(CPU_BASED_VM_EXEC_CONTROL) &
>> + ~(CPU_BASED_CR3_LOAD_EXITING |
>> +   CPU_BASED_CR3_STORE_EXITING));
>>  vcpu->arch.cr0 = cr0;
>>  vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
>>  }
>> @@ -4311,7 +4313,9 @@ static void vmx_set_cr3(struct kvm_vcpu *vcpu, 
>> unsigned long cr3)
>>  }
>>  
>>  vmx_flush_tlb(vcpu);
>> -vmcs_writel(GUEST_CR3, guest_cr3);
>> +
>> +if (!enable_unrestricted_guest || !enable_ept)
>> +vmcs_writel(GUEST_CR3, guest_cr3);
>>  }
>>  
>>  static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>>
> 




Re: [PATCH] KVM/VMX: Avoid CR3 VMEXIT during guest real mode when "unrestricted guest" is supported.

2017-08-16 Thread Lan Tianyu
On 2017年08月16日 21:26, Radim Krčmář wrote:
> 2017-08-15 21:58-0400, Lan Tianyu:
>> These CR3 VMEXITs was introduced for platform without "unrestricted guest"
>> support. This is to set ept identity table to guest CR3 in guest real
>> mode because these platforms don't support ept real mode(CR0.PE and CR0.PG
>> must be set to 1). But these VMEXITs is redundant for platforms with
>> "unrestricted guest" support.
>>
>> Signed-off-by: Lan Tianyu 
>> ---
>>  arch/x86/kvm/vmx.c | 22 +-
>>  1 file changed, 13 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> @@ -4311,7 +4313,9 @@ static void vmx_set_cr3(struct kvm_vcpu *vcpu, 
>> unsigned long cr3)
>>  }
>>  
>>  vmx_flush_tlb(vcpu);
>> -vmcs_writel(GUEST_CR3, guest_cr3);
>> +
>> +if (!enable_unrestricted_guest || !enable_ept)
>> +vmcs_writel(GUEST_CR3, guest_cr3);
> 
> This looks wrong -- it would prevent update GUEST_CR3 outside of
> non-root mode with enable_unrestricted_guest.
> 

OK. Do you mean nest mode? I didn't consider that case.
I thought there were three cases here.

1) Shadow page mode(enable_ept=0)

2) ept mode without unrestricted guest mode
   (ept=1, enable_unrestricted_guest = 0)

3) ept mode with unrestricted guest mode
   (ept=1, enable_unrestricted_guest = 1)

>From my understanding, only (1) and (2) need to update guest cr3.
If nest mode is still needed to update guest CR3, we can add
is_guest_mode() in the if condition. Other choice is to just ignore
setting guest cr3 for case3. The condition maybe changed to

if (!(enable_unrestricted_guest && enable_ept))
vmcs_writel(GUEST_CR3, guest_cr3);





-- 
Best regards
Tianyu Lan


[PATCH] KVM/VMX: Avoid CR3 VMEXIT during guest real mode when "unrestricted guest" is supported.

2017-08-16 Thread Lan Tianyu
These CR3 VMEXITs was introduced for platform without "unrestricted guest"
support. This is to set ept identity table to guest CR3 in guest real
mode because these platforms don't support ept real mode(CR0.PE and CR0.PG
must be set to 1). But these VMEXITs is redundant for platforms with
"unrestricted guest" support.

Signed-off-by: Lan Tianyu 
---
 arch/x86/kvm/vmx.c | 22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9b21b12..46dcf50 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4221,18 +4221,20 @@ static void ept_update_paging_mode_cr0(unsigned long 
*hw_cr0,
vmx_decache_cr3(vcpu);
if (!(cr0 & X86_CR0_PG)) {
/* From paging/starting to nonpaging */
-   vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
-vmcs_read32(CPU_BASED_VM_EXEC_CONTROL) |
-(CPU_BASED_CR3_LOAD_EXITING |
- CPU_BASED_CR3_STORE_EXITING));
+   if (!enable_unrestricted_guest)
+   vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
+vmcs_read32(CPU_BASED_VM_EXEC_CONTROL) |
+(CPU_BASED_CR3_LOAD_EXITING |
+ CPU_BASED_CR3_STORE_EXITING));
vcpu->arch.cr0 = cr0;
vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
} else if (!is_paging(vcpu)) {
/* From nonpaging to paging */
-   vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
-vmcs_read32(CPU_BASED_VM_EXEC_CONTROL) &
-~(CPU_BASED_CR3_LOAD_EXITING |
-  CPU_BASED_CR3_STORE_EXITING));
+   if (!enable_unrestricted_guest)
+   vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
+vmcs_read32(CPU_BASED_VM_EXEC_CONTROL) &
+~(CPU_BASED_CR3_LOAD_EXITING |
+  CPU_BASED_CR3_STORE_EXITING));
vcpu->arch.cr0 = cr0;
vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
}
@@ -4311,7 +4313,9 @@ static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned 
long cr3)
}
 
vmx_flush_tlb(vcpu);
-   vmcs_writel(GUEST_CR3, guest_cr3);
+
+   if (!enable_unrestricted_guest || !enable_ept)
+   vmcs_writel(GUEST_CR3, guest_cr3);
 }
 
 static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
-- 
1.8.3.1



[PATCH V2] KVM/x86: Increase max vcpu number to 8192

2017-08-15 Thread Lan Tianyu
For HPC usage case, it will create a huge VM with vcpus number as same as host
cpus and this requires more vcpus support in a single VM. This patch is to
increase max vcpu number from 288 to 8192 which is current default maximum cpu
number for Linux kernel.

Signed-off-by: Lan Tianyu 
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 87ac4fb..7946394 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -35,7 +35,7 @@
 #include 
 #include 
 
-#define KVM_MAX_VCPUS 288
+#define KVM_MAX_VCPUS 8192
 #define KVM_SOFT_MAX_VCPUS 240
 #define KVM_MAX_VCPU_ID 1023
 #define KVM_USER_MEM_SLOTS 509
-- 
1.8.3.1



Re: [PATCH] KVM/x86: Increase max vcpu number to 352

2017-08-15 Thread Lan Tianyu
On 2017年08月15日 22:10, Konrad Rzeszutek Wilk wrote:
> On Tue, Aug 15, 2017 at 11:00:04AM +0800, Lan Tianyu wrote:
>> On 2017年08月12日 03:35, Konrad Rzeszutek Wilk wrote:
>>> On Fri, Aug 11, 2017 at 03:00:20PM +0200, Radim Krčmář wrote:
>>>> 2017-08-11 10:11+0200, David Hildenbrand:
>>>>> On 11.08.2017 09:49, Lan Tianyu wrote:
>>>>>> Hi Konrad:
>>>>>>  Thanks for your review.
>>>>>>
>>>>>> On 2017年08月11日 01:50, Konrad Rzeszutek Wilk wrote:
>>>>>>> On Thu, Aug 10, 2017 at 06:00:59PM +0800, Lan Tianyu wrote:
>>>>>>>> Intel Xeon phi chip will support 352 logical threads. For HPC usage
>>>>>>>> case, it will create a huge VM with vcpu number as same as host cpus. 
>>>>>>>> This
>>>>>>>> patch is to increase max vcpu number to 352.
>>>>>>>
>>>>>>> Why not 1024 or 4096?
>>>>>>
>>>>>> This is on demand. We can set a higher number since KVM already has
>>>>>> x2apic and vIOMMU interrupt remapping support.
>>>>>>
>>>>>>>
>>>>>>> Are there any issues with increasing the value from 288 to 352 right 
>>>>>>> now?
>>>>>>
>>>>>> No found.
>>>>
>>>> Yeah, the only issue until around 2^20 (when we reach the maximum of
>>>> logical x2APIC addressing) should be the size of per-VM arrays when only
>>>> few VCPUs are going to be used.
>>>
>>> Migration with 352 CPUs all being busy dirtying memory and also poking
>>> at various I/O ports (say all of them dirtying the VGA) is no problem?
>>
>> This depends on what kind of workload is running during migration. I
>> think this may affect service down time since there maybe a lot of dirty
>> memory data to transfer after stopping vcpus. This also depends on how
>> user sets "migrate_set_downtime" for qemu. But I think increasing vcpus
>> will break migration function.
> 
> OK, so let me take a step back.
> 
> I see this nice 'supported' CPU count that is exposed in kvm module.
> 
> Then there is QEMU throwing out a warning if you crank up the CPU count
> above that number.
> 
> Red Hat's web-pages talk about CPU count as well.
> 
> And I am assuming all of those are around what has been tested and
> what has shown to work. And one of those test-cases surely must
> be migration.
> 

Sorry. This is a typo. I originally meant increasing vcpu shouldn't
break migration function and just affect service downtime. If there was
such issue, we should fix it.


> Ergo, if the vCPU count increase will break migration, then it is
> a regression.
> 
> Or a fix/work needs to be done to support a higher CPU count for
> migrating?
> 
> 
> Is my understanding incorrect?

You are right.

> 
>>
>>>
>>>
>>>>
>>>>>>> Also perhaps this should be made in an Kconfig entry?
>>>>>>
>>>>>> That will be anther option but I find different platforms will define
>>>>>> different MAX_VCPU. If we introduce a generic Kconfig entry, different
>>>>>> platforms should have different range.
>>>
>>>
>>> By different platforms you mean q35 vs the older one, and such?
>>
>> I meant x86, arm, sparc and other vendors' code define different max
>> vcpu number.
> 
> Right, and?

If we introduce a general kconfig of max vcpus for all vendors, it
should have different max vcpu range for different vendor.




-- 
Best regards
Tianyu Lan


Re: [PATCH] KVM/x86: Increase max vcpu number to 352

2017-08-14 Thread Lan Tianyu
On 2017年08月12日 03:35, Konrad Rzeszutek Wilk wrote:
> On Fri, Aug 11, 2017 at 03:00:20PM +0200, Radim Krčmář wrote:
>> 2017-08-11 10:11+0200, David Hildenbrand:
>>> On 11.08.2017 09:49, Lan Tianyu wrote:
>>>> Hi Konrad:
>>>>Thanks for your review.
>>>>
>>>> On 2017年08月11日 01:50, Konrad Rzeszutek Wilk wrote:
>>>>> On Thu, Aug 10, 2017 at 06:00:59PM +0800, Lan Tianyu wrote:
>>>>>> Intel Xeon phi chip will support 352 logical threads. For HPC usage
>>>>>> case, it will create a huge VM with vcpu number as same as host cpus. 
>>>>>> This
>>>>>> patch is to increase max vcpu number to 352.
>>>>>
>>>>> Why not 1024 or 4096?
>>>>
>>>> This is on demand. We can set a higher number since KVM already has
>>>> x2apic and vIOMMU interrupt remapping support.
>>>>
>>>>>
>>>>> Are there any issues with increasing the value from 288 to 352 right now?
>>>>
>>>> No found.
>>
>> Yeah, the only issue until around 2^20 (when we reach the maximum of
>> logical x2APIC addressing) should be the size of per-VM arrays when only
>> few VCPUs are going to be used.
> 
> Migration with 352 CPUs all being busy dirtying memory and also poking
> at various I/O ports (say all of them dirtying the VGA) is no problem?

This depends on what kind of workload is running during migration. I
think this may affect service down time since there maybe a lot of dirty
memory data to transfer after stopping vcpus. This also depends on how
user sets "migrate_set_downtime" for qemu. But I think increasing vcpus
will break migration function.

> 
> 
>>
>>>>> Also perhaps this should be made in an Kconfig entry?
>>>>
>>>> That will be anther option but I find different platforms will define
>>>> different MAX_VCPU. If we introduce a generic Kconfig entry, different
>>>> platforms should have different range.
> 
> 
> By different platforms you mean q35 vs the older one, and such?

I meant x86, arm, sparc and other vendors' code define different max
vcpu number.

> Not whether the underlaying accelerator is tcg, Xen, KVM, or bHyve?
> 
> What I was trying to understand whether it makes even sense for
> the platforms to have such limits in the first place - and instead the
> accelerators should be the ones setting it?
> 
> 
>>>>
>>>> Radim & Paolo, Could you give some input? In qemu thread, we will set
>>>> max vcpu to 8192 for x86 VM. In KVM, The length of vcpu pointer array in
>>>> struct kvm and dest_vcpu_bitmap in kvm_irq_delivery_to_apic() are
>>>> specified by KVM_MAX_VCPUS. Should we keep align with Qemu?
>>
>> That would be great.
>>
>>> commit 682f732ecf7396e9d6fe24d44738966699fae6c0
>>> Author: Radim Krčmář 
>>> Date:   Tue Jul 12 22:09:29 2016 +0200
>>>
>>> KVM: x86: bump MAX_VCPUS to 288
>>>
>>> 288 is in high demand because of Knights Landing CPU.
>>> We cannot set the limit to 640k, because that would be wasting space.
>>>
>>> I think we want to keep it small as long as possible. I remember a patch
>>> series from Radim which would dynamically allocate memory for these
>>> arrays (using a new VM creation ioctl, specifying the max # of vcpus).
>>> Wonder what happened to that (I remember requesting a simply remalloc
>>> instead of a new VM creation ioctl :] ).
>>
>> Eh, I forgot about them ...  I didn't like the dynamic allocation as we
>> would need to protect the memory, which would result in a much bigger
>> changeset, or fragile macros.
>>
>> I can't recall the disgust now, so I'll send a RFC with the dynamic
>> version to see how it turned out.
>>
>> Thanks.


-- 
Best regards
Tianyu Lan


Re: [PATCH] KVM/x86: Increase max vcpu number to 352

2017-08-11 Thread Lan Tianyu
Hi Konrad:
Thanks for your review.

On 2017年08月11日 01:50, Konrad Rzeszutek Wilk wrote:
> On Thu, Aug 10, 2017 at 06:00:59PM +0800, Lan Tianyu wrote:
>> Intel Xeon phi chip will support 352 logical threads. For HPC usage
>> case, it will create a huge VM with vcpu number as same as host cpus. This
>> patch is to increase max vcpu number to 352.
> 
> Why not 1024 or 4096?

This is on demand. We can set a higher number since KVM already has
x2apic and vIOMMU interrupt remapping support.

> 
> Are there any issues with increasing the value from 288 to 352 right now?

No found.

> 
> Also perhaps this should be made in an Kconfig entry?

That will be anther option but I find different platforms will define
different MAX_VCPU. If we introduce a generic Kconfig entry, different
platforms should have different range.

Radim & Paolo, Could you give some input? In qemu thread, we will set
max vcpu to 8192 for x86 VM. In KVM, The length of vcpu pointer array in
struct kvm and dest_vcpu_bitmap in kvm_irq_delivery_to_apic() are
specified by KVM_MAX_VCPUS. Should we keep align with Qemu?

-- 
Best regards
Tianyu Lan


[PATCH] KVM/x86: Increase max vcpu number to 352

2017-08-10 Thread Lan Tianyu
Intel Xeon phi chip will support 352 logical threads. For HPC usage
case, it will create a huge VM with vcpu number as same as host cpus. This
patch is to increase max vcpu number to 352.

Signed-off-by: Lan Tianyu 
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 87ac4fb..2cdc043 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -35,7 +35,7 @@
 #include 
 #include 
 
-#define KVM_MAX_VCPUS 288
+#define KVM_MAX_VCPUS 352
 #define KVM_SOFT_MAX_VCPUS 240
 #define KVM_MAX_VCPU_ID 1023
 #define KVM_USER_MEM_SLOTS 509
-- 
1.8.3.1



[PATCH 5/7] KVM: Replace smp_mb() with smp_load_acquire() in the kvm_flush_remote_tlbs()

2016-03-12 Thread Lan Tianyu
smp_load_acquire() is enough here and it's cheaper than smp_mb().
Adding a comment about reusing memory barrier of kvm_make_all_cpus_request()
here to keep order between modifications to the page tables and reading mode.

Signed-off-by: Lan Tianyu 
---
 virt/kvm/kvm_main.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ec5aa8d..39ebee9a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -191,9 +191,23 @@ bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned 
int req)
 #ifndef CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL
 void kvm_flush_remote_tlbs(struct kvm *kvm)
 {
-   long dirty_count = kvm->tlbs_dirty;
+   /*
+* Read tlbs_dirty before setting KVM_REQ_TLB_FLUSH in
+* kvm_make_all_cpus_request.
+*/
+   long dirty_count = smp_load_acquire(&kvm->tlbs_dirty);
 
-   smp_mb();
+   /*
+* We want to publish modifications to the page tables before reading
+* mode. Pairs with a memory barrier in arch-specific code.
+* - x86: smp_mb__after_srcu_read_unlock in vcpu_enter_guest
+* and smp_mb in walk_shadow_page_lockless_begin/end.
+* - powerpc: smp_mb in kvmppc_prepare_to_enter.
+*
+* There is already an smp_mb__after_atomic() before
+* kvm_make_all_cpus_request() reads vcpu->mode. We reuse that
+* barrier here.
+*/
if (kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
++kvm->stat.remote_tlb_flush;
cmpxchg(&kvm->tlbs_dirty, dirty_count, 0);
-- 
1.8.4.rc0.1.g8f6a3e5.dirty



[PATCH 6/7] KVM/x86: update the comment of memory barrier in the vcpu_enter_guest()

2016-03-12 Thread Lan Tianyu
The barrier also orders the write to mode from any reads
to the page tables done and so update the comment.

Signed-off-by: Lan Tianyu 
---
 arch/x86/kvm/x86.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bcbce0f..4bdb4e6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6589,8 +6589,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
 
-   /* We should set ->mode before check ->requests,
-* see the comment in make_all_cpus_request.
+   /*
+* We should set ->mode before check ->requests,
+* Please see the comment in kvm_make_all_cpus_request.
+* This also orders the write to mode from any reads
+* to the page tables done while the VCPU is running.
+* Please see the comment in kvm_flush_remote_tlbs.
 */
smp_mb__after_srcu_read_unlock();
 
-- 
1.8.4.rc0.1.g8f6a3e5.dirty



[PATCH 4/7] KVM/x86: Call smp_wmb() before increasing tlbs_dirty

2016-03-12 Thread Lan Tianyu
Update spte before increasing tlbs_dirty to make sure no tlb flush
in lost after spte is zapped. This pairs with the barrier in the
kvm_flush_remote_tlbs().

Signed-off-by: Lan Tianyu 
---
 arch/x86/kvm/paging_tmpl.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index e159a81..d34475e 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -949,6 +949,12 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
return 0;
 
if (FNAME(prefetch_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) {
+   /*
+* Update spte before increasing tlbs_dirty to make sure
+* no tlb flush in lost after spte is zapped, see the
+* comments in kvm_flush_remote_tlbs().
+*/
+   smp_wmb();
vcpu->kvm->tlbs_dirty++;
continue;
}
@@ -964,6 +970,11 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
 
if (gfn != sp->gfns[i]) {
drop_spte(vcpu->kvm, &sp->spt[i]);
+   /*
+* The same as above where we are doing
+* prefetch_invalid_gpte().
+*/
+   smp_wmb();
vcpu->kvm->tlbs_dirty++;
continue;
}
-- 
1.8.4.rc0.1.g8f6a3e5.dirty



[PATCH 7/7] KVM/PPC: update the comment of memory barrier in the kvmppc_prepare_to_enter()

2016-03-12 Thread Lan Tianyu
The barrier also orders the write to mode from any reads
to the page tables done and so update the comment.

Signed-off-by: Lan Tianyu 
---
 arch/powerpc/kvm/powerpc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 19aa59b..6a68730 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -96,6 +96,9 @@ int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu)
 * so we don't miss a request because the requester sees
 * OUTSIDE_GUEST_MODE and assumes we'll be checking requests
 * before next entering the guest (and thus doesn't IPI).
+* This also orders the write to mode from any reads
+* to the page tables done while the VCPU is running.
+* Please see the comment in kvm_flush_remote_tlbs.
 */
smp_mb();
 
-- 
1.8.4.rc0.1.g8f6a3e5.dirty



[PATCH 1/7] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()

2016-03-12 Thread Lan Tianyu
There is already a barrier inside of kvm_flush_remote_tlbs() which can
help to make sure everyone sees our modifications to the page tables and
see changes to vcpu->mode here. So remove the smp_mb in the
kvm_mmu_commit_zap_page() and update the comment.

Signed-off-by: Lan Tianyu 
---
 arch/x86/kvm/mmu.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2463de0..5e795af 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2390,14 +2390,12 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
return;
 
/*
-* wmb: make sure everyone sees our modifications to the page tables
-* rmb: make sure we see changes to vcpu->mode
-*/
-   smp_mb();
-
-   /*
-* Wait for all vcpus to exit guest mode and/or lockless shadow
-* page table walks.
+* We need to make sure everyone sees our modifications to
+* the page tables and see changes to vcpu->mode here. The barrier
+* in the kvm_flush_remote_tlbs() helps us to achieve these. This pairs
+* with vcpu_enter_guest and walk_shadow_page_lockless_begin/end.
+* In addition, wait for all vcpus to exit guest mode and/or lockless
+* shadow page table walks.
 */
kvm_flush_remote_tlbs(kvm);
 
-- 
1.8.4.rc0.1.g8f6a3e5.dirty



[PATCH 2/7] KVM/x86: Replace smp_mb() with smp_store_mb/release() in the walk_shadow_page_lockless_begin/end()

2016-03-12 Thread Lan Tianyu
Signed-off-by: Lan Tianyu 
---
 arch/x86/kvm/mmu.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5e795af..d1ee68c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -632,12 +632,12 @@ static void walk_shadow_page_lockless_begin(struct 
kvm_vcpu *vcpu)
 * kvm_flush_remote_tlbs() IPI to all active vcpus.
 */
local_irq_disable();
-   vcpu->mode = READING_SHADOW_PAGE_TABLES;
+
/*
 * Make sure a following spte read is not reordered ahead of the write
 * to vcpu->mode.
 */
-   smp_mb();
+   smp_store_mb(vcpu->mode, READING_SHADOW_PAGE_TABLES);
 }
 
 static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
@@ -647,8 +647,7 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu 
*vcpu)
 * reads to sptes.  If it does, kvm_commit_zap_page() can see us
 * OUTSIDE_GUEST_MODE and proceed to free the shadow page table.
 */
-   smp_mb();
-   vcpu->mode = OUTSIDE_GUEST_MODE;
+   smp_store_release(&vcpu->mode, OUTSIDE_GUEST_MODE);
local_irq_enable();
 }
 
-- 
1.8.4.rc0.1.g8f6a3e5.dirty



[PATCH 3/7] KVM: Replace smp_mb() with smp_mb_after_atomic() in the kvm_make_all_cpus_request()

2016-03-12 Thread Lan Tianyu
Signed-off-by: Lan Tianyu 
---
 virt/kvm/kvm_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1eae052..ec5aa8d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -170,8 +170,8 @@ bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned 
int req)
kvm_make_request(req, vcpu);
cpu = vcpu->cpu;
 
-   /* Set ->requests bit before we read ->mode */
-   smp_mb();
+   /* Set ->requests bit before we read ->mode. */
+   smp_mb__after_atomic();
 
if (cpus != NULL && cpu != -1 && cpu != me &&
  kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
-- 
1.8.4.rc0.1.g8f6a3e5.dirty



[PATCH 0/7] KVM/X86/PPC: Clear up kvm mmu memory barriers and update related comments

2016-03-12 Thread Lan Tianyu
This series is to clear up kvm mmu memory barriers.
1) Remove redundant barrier (PATCH 1)
2) Replace origin barrier functions with preferrable ones (PATCH 2, 3, 5)
3) Fix unpaired barriers (PATCH 4)
4) Update or add barrier related comments (PATCH 6, 7)

Lan Tianyu (7):
  KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()
  KVM/x86: Replace smp_mb() with smp_store_mb/release() in the
walk_shadow_page_lockless_begin/end()
  KVM: Replace smp_mb() with smp_mb_after_atomic() in the
kvm_make_all_cpus_request()
  KVM/x86: Call smp_wmb() before increasing tlbs_dirty
  KVM: Replace smp_mb() with smp_load_acquire() in the
kvm_flush_remote_tlbs()
  KVM/x86: update the comment of memory barrier in the
vcpu_enter_guest()
  KVM/PPC:  update the comment of memory barrier in the
kvmppc_prepare_to_enter()

 arch/powerpc/kvm/powerpc.c |  3 +++
 arch/x86/kvm/mmu.c | 23 ++-
 arch/x86/kvm/paging_tmpl.h | 11 +++
 arch/x86/kvm/x86.c |  8 ++--
 virt/kvm/kvm_main.c| 22 ++
 5 files changed, 48 insertions(+), 19 deletions(-)

-- 
1.8.4.rc0.1.g8f6a3e5.dirty



Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()

2016-03-10 Thread Lan Tianyu
On 2016年03月10日 23:26, Paolo Bonzini wrote:
> 
> 
> On 10/03/2016 15:40, Xiao Guangrong wrote:
>>  long dirty_count = kvm->tlbs_dirty;
>>
>> +/*
>> + * read tlbs_dirty before doing tlb flush to make sure not tlb
>> request is
>> + * lost.
>> + */
>>  smp_mb();
>> +
>>  if (kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
>>  ++kvm->stat.remote_tlb_flush;
>>  cmpxchg(&kvm->tlbs_dirty, dirty_count, 0);
>>
>>
>> Any comment?
> 
> Compared to smp_load_acquire(), smp_mb() adds an ordering between stores
> and loads.  Is the
> 
> The load of kvm->tlbs_dirty should then be
> 
>   /*
>* Read tlbs_dirty before setting KVM_REQ_TLB_FLUSH in
>* kvm_make_all_cpus_request.  This
>*/
>   long dirty_count = smp_load_acquire(kvm->tlbs_dirty);
> 
> Tianyu, I think Xiao provided the information that I was missing.  Would
> you like to prepare the patch?

Paolo:
Sure. I will do that.

Guangrong:
Thanks a lot for your input.

-- 
Best regards
Tianyu Lan


Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()

2016-03-08 Thread Lan Tianyu
On 2016年03月08日 23:27, Paolo Bonzini wrote:
> Unfortunately that patch added a bad memory barrier: 1) it lacks a
> comment; 2) it lacks obvious pairing; 3) it is an smp_mb() after a read,
> so it's not even obvious that this memory barrier has to do with the
> immediately preceding read of kvm->tlbs_dirty.  It also is not
> documented in Documentation/virtual/kvm/mmu.txt (Guangrong documented
> there most of his other work, back in 2013, but not this one :)).
> 
> The cmpxchg is ordered anyway against the read, because 1) x86 has
> implicit ordering between earlier loads and later stores; 2) even
> store-load barriers are unnecessary for accesses to the same variable
> (in this case kvm->tlbs_dirty).
> 
> So offhand, I cannot say what it orders.  There are two possibilities:
> 
> 1) it orders the read of tlbs_dirty with the read of mode.  In this
> case, a smp_rmb() would have been enough, and it's not clear where is
> the matching smp_wmb().
> 
> 2) it orders the read of tlbs_dirty with the KVM_REQ_TLB_FLUSH request.
>  In this case a smp_load_acquire would be better.
> 
> 3) it does the same as kvm_mmu_commit_zap_page's smp_mb() but for other
> callers of kvm_flush_remote_tlbs().  In this case, we know what's the
> matching memory barrier (walk_shadow_page_lockless_*).
> 
> 4) it is completely unnecessary.
> 
> My guess is either (2) or (3), but I am not sure.  We know that
> anticipating kvm->tlbs_dirty should be safe; worst case, it causes the
> cmpxchg to fail and an extra TLB flush on the next call to the MMU
> notifier.  But I'm not sure of what happens if the processor moves the
> read later.

I found the smp_mb() in the kvm_mmu_commit_zap_page() was removed by
commit 5befdc38 and the commit was reverted by commit a086f6a1e.

The remove/revert reason is whether kvm_flush_remote_tlbs() is under
mmu_lock or not.

The mode and request aren't always under mmu_lock and so I think the
smp_mb() should not be related with mode and request when introduced.

> 
>> > The smp_mb() in the kvm_mmu_commit_zap_page() was introduced by commit
>> > c142786c6 which was merged later than commit a4ee1ca4. It pairs with
>> > smp_mb() in the walk_shadow_page_lockless_begin/end() to keep order
>> > between modifications of the page tables and reading mode.
> Yes; it also pairs with the smp_mb__after_srcu_unlock() in vcpu_enter_guest.
> 
>> > The smp_mb() in the kvm_make_all_cpus_request() was introduced by commit
>> > 6b7e2d09. It keeps order between setting request bit and reading mode.
> Yes.
> 
>>> >>  So you can:
>>> >>
>>> >> 1) add a comment to kvm_flush_remote_tlbs like:
>>> >>
>>> >>  /*
>>> >>   * We want to publish modifications to the page tables before reading
>>> >>   * mode.  Pairs with a memory barrier in arch-specific code.
>>> >>   * - x86: smp_mb__after_srcu_read_unlock in vcpu_enter_guest.
>>> >>   * - powerpc: smp_mb in kvmppc_prepare_to_enter.
>>> >>   */
>>> >>
>>> >> 2) add a comment to vcpu_enter_guest and kvmppc_prepare_to_enter, saying
>>> >> that the memory barrier also orders the write to mode from any reads
>>> >> to the page tables done while the VCPU is running.  In other words, on
>>> >> entry a single memory barrier achieves two purposes (write ->mode before
>>> >> reading requests, write ->mode before reading page tables).
>> > 
>> > These sounds good.
>> > 
>>> >> The same should be true in kvm_flush_remote_tlbs().  So you may 
>>> >> investigate
>>> >> removing the barrier from kvm_flush_remote_tlbs, because
>>> >> kvm_make_all_cpus_request already includes a memory barrier.  Like
>>> >> Thomas suggested, leave a comment in kvm_flush_remote_tlbs(),
>>> >> saying which memory barrier you are relying on and for what.
>> > 
>> > If we remove the smp_mb() in the kvm_flush_remote_tlbs(), we need to
>> > leave comments both in the kvm_flush_remote_tlbs() and
>> > kvm_mmu_commit_zap_page(), right?
> Yes.  In fact, instead of removing it, I would change it to
> 
>   smp_mb__before_atomic();
> 
> with a comment that points to the addition of the barrier in commit
> a4ee1ca4.  Unless Guangrong can enlighten us. :)
> 

How about the following comments.

Log for kvm_mmu_commit_zap_page()
/*
 * We need to make sure everyone sees our modifications to
 * the page tables and see changes to vcpu->mode here. The
 * barrier in the kvm_flush_remote_tlbs() helps us to achieve
 * these. Otherwise, wait for all vcpus to exit guest mode
 * and/or lockless shadow page table walks.
 */
kvm_flush_remote_tlbs(kvm);


Log for kvm_flush_remote_tlbs()
/*
 * We want to publish modifications to the page tables before
 * reading mode. Pairs with a memory barrier in arch-specific
 * code.
 * - x86: smp_mb__after_srcu_read_unlock in vcpu_enter_guest.
 * - powerpc: smp_mb in kvmppc_prepare_to_enter.
 */
 smp_mb__before_atomic();


>>> >>
>>> >> And finally, the memory barrier in kvm_mak

Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()

2016-03-08 Thread Lan Tianyu
On 2016年03月04日 16:49, Paolo Bonzini wrote:
> On 04/03/2016 08:12, Lan Tianyu wrote:
>>>>/*
>>>> -   * wmb: make sure everyone sees our modifications to the page tables
>>>> -   * rmb: make sure we see changes to vcpu->mode
>>>
>>> You want to leave the comment explaining the memory barriers and tell that
>>> kvm_flush_remote_tlbs() contains the smp_mb().
>>
>> That sounds more reasonable. Will update. Thanks.
> 
> In fact, the reason for kvm_flush_remote_tlbs()'s barrier is exactly
> what was in this comment.

Hi Paolo:
Thanks for your comments.

Summary about smp_mb()s we met in this thread. If misunderstood, please
correct me. Thanks.

The smp_mb() in the kvm_flush_remote_tlbs() was introduced by the commit
a4ee1ca4 and it seems to keep the order of reading and cmpxchg
kvm->tlbs_dirty.

commit a4ee1ca4a36e7857d90ae8c2b85f1bde9a042c10
Author: Xiao Guangrong 
Date:   Tue Nov 23 11:13:00 2010 +0800

KVM: MMU: delay flush all tlbs on sync_page path

Quote from Avi:
| I don't think we need to flush immediately; set a "tlb dirty" bit
somewhere
| that is cleareded when we flush the tlb.
kvm_mmu_notifier_invalidate_page()
| can consult the bit and force a flush if set.

Signed-off-by: Xiao Guangrong 
Signed-off-by: Marcelo Tosatti 

The smp_mb() in the kvm_mmu_commit_zap_page() was introduced by commit
c142786c6 which was merged later than commit a4ee1ca4. It pairs with
smp_mb() in the walk_shadow_page_lockless_begin/end() to keep order
between modifications of the page tables and reading mode.

commit c142786c6291189b5c85f53d91743e1eefbd8fe0
Author: Avi Kivity 
Date:   Mon May 14 15:44:06 2012 +0300

KVM: MMU: Don't use RCU for lockless shadow walking

Using RCU for lockless shadow walking can increase the amount of memory
in use by the system, since RCU grace periods are unpredictable.  We
also
have an unconditional write to a shared variable (reader_counter), which
isn't good for scaling.

Replace that with a scheme similar to x86's get_user_pages_fast():
disable
interrupts during lockless shadow walk to force the freer
(kvm_mmu_commit_zap_page()) to wait for the TLB flush IPI to find the
processor with interrupts enabled.

We also add a new vcpu->mode, READING_SHADOW_PAGE_TABLES, to prevent
kvm_flush_remote_tlbs() from avoiding the IPI.

Signed-off-by: Avi Kivity 
Signed-off-by: Marcelo Tosatti 


The smp_mb() in the kvm_make_all_cpus_request() was introduced by commit
6b7e2d09. It keeps order between setting request bit and reading mode.

commit 6b7e2d0991489559a1df4500d77f7b76c4607ed0
Author: Xiao Guangrong 
Date:   Wed Jan 12 15:40:31 2011 +0800

KVM: Add "exiting guest mode" state

Currently we keep track of only two states: guest mode and host
mode.  This patch adds an "exiting guest mode" state that tells
us that an IPI will happen soon, so unless we need to wait for the
IPI, we can avoid it completely.

Also
1: No need atomically to read/write ->mode in vcpu's thread

2: reorganize struct kvm_vcpu to make ->mode and ->requests
   in the same cache line explicitly

Signed-off-by: Xiao Guangrong 
Signed-off-by: Avi Kivity 



>  So you can:
> 
> 1) add a comment to kvm_flush_remote_tlbs like:
> 
>   /*
>* We want to publish modifications to the page tables before reading
>* mode.  Pairs with a memory barrier in arch-specific code.
>* - x86: smp_mb__after_srcu_read_unlock in vcpu_enter_guest.
>* - powerpc: smp_mb in kvmppc_prepare_to_enter.
>*/
> 
> 2) add a comment to vcpu_enter_guest and kvmppc_prepare_to_enter, saying
> that the memory barrier also orders the write to mode from any reads
> to the page tables done while the VCPU is running.  In other words, on
> entry a single memory barrier achieves two purposes (write ->mode before
> reading requests, write ->mode before reading page tables).

These sounds good.

> 
> The same should be true in kvm_flush_remote_tlbs().  So you may investigate
> removing the barrier from kvm_flush_remote_tlbs, because
> kvm_make_all_cpus_request already includes a memory barrier.  Like
> Thomas suggested, leave a comment in kvm_flush_remote_tlbs(),
> saying which memory barrier you are relying on and for what.

If we remove the smp_mb() in the kvm_flush_remote_tlbs(), we need to
leave comments both in the kvm_flush_remote_tlbs() and
kvm_mmu_commit_zap_page(), right?

> 
> And finally, the memory barrier in kvm_make_all_cpus_request can become
> smp_mb__after_atomic, which is free on x86.

I found you have done this in your tree :)

commit 5b06b60b72b73cbe3805d2a64d223e0264bd0479
Author: Paolo Bonzini 
Date:   Wed Feb 24 12:44:17 2016 +0100

KVM: mark memory barrier with smp_mb__after_atomic

Signed-off-by: Paolo Bonzini 


> 
> Of course, all this should be done in at least three separate patches.
> 
> Thanks!
> 
> Paolo
> 


-- 
Best regards
Tianyu Lan


Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()

2016-03-03 Thread Lan Tianyu
On 2016年03月04日 15:21, Thomas Gleixner wrote:
> On Fri, 4 Mar 2016, Lan Tianyu wrote:
> 
>> The following kvm_flush_remote_tlbs() will call smp_mb() inside and so
>> remove smp_mb() here.
>>
>> Signed-off-by: Lan Tianyu 
>> ---
>>  arch/x86/kvm/mmu.c | 6 --
>>  1 file changed, 6 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index a54ecd9..6315416 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -2383,12 +2383,6 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>>  return;
>>  
>>  /*
>> - * wmb: make sure everyone sees our modifications to the page tables
>> - * rmb: make sure we see changes to vcpu->mode
> 
> You want to leave the comment explaining the memory barriers and tell that
> kvm_flush_remote_tlbs() contains the smp_mb().

That sounds more reasonable. Will update. Thanks.

> 
>> - */
>> -smp_mb();
>> -
>> -/*
>>   * Wait for all vcpus to exit guest mode and/or lockless shadow
>>   * page table walks.
>>   */
>> -- 
>> 1.8.4.rc0.1.g8f6a3e5.dirty
>>
>>


-- 
Best regards
Tianyu Lan


[PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()

2016-03-03 Thread Lan Tianyu
The following kvm_flush_remote_tlbs() will call smp_mb() inside and so
remove smp_mb() here.

Signed-off-by: Lan Tianyu 
---
 arch/x86/kvm/mmu.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a54ecd9..6315416 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2383,12 +2383,6 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
return;
 
/*
-* wmb: make sure everyone sees our modifications to the page tables
-* rmb: make sure we see changes to vcpu->mode
-*/
-   smp_mb();
-
-   /*
 * Wait for all vcpus to exit guest mode and/or lockless shadow
 * page table walks.
 */
-- 
1.8.4.rc0.1.g8f6a3e5.dirty



Re: [PATCH] KVM: Replace the number of page entries per page with PT64_ENT_PER_PAGE macro

2016-03-01 Thread Lan, Tianyu

Sorry. Miss some other modifications and please ignore this version.

On 3/1/2016 10:49 PM, Lan Tianyu wrote:

This patch is to make code more readable.

Signed-off-by: Lan Tianyu 
---
  arch/x86/include/asm/kvm_host.h | 2 +-
  arch/x86/kvm/mmu.c  | 2 +-
  2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 44adbb8..b91a08a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -265,7 +265,7 @@ struct kvm_mmu_page {
/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
unsigned long mmu_valid_gen;

-   DECLARE_BITMAP(unsync_child_bitmap, 512);
+   DECLARE_BITMAP(unsync_child_bitmap, PT64_ENT_PER_PAGE);

  #ifdef CONFIG_X86_32
/*
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 95a955d..d51441b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1806,7 +1806,7 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
  {
int i, ret, nr_unsync_leaf = 0;

-   for_each_set_bit(i, sp->unsync_child_bitmap, 512) {
+   for_each_set_bit(i, sp->unsync_child_bitmap, PT64_ENT_PER_PAGE) {
struct kvm_mmu_page *child;
u64 ent = sp->spt[i];




[PATCH] KVM: Replace the number of page entries per page with PT64_ENT_PER_PAGE macro

2016-03-01 Thread Lan Tianyu
This patch is to make code more readable.

Signed-off-by: Lan Tianyu 
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 arch/x86/kvm/mmu.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 44adbb8..b91a08a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -265,7 +265,7 @@ struct kvm_mmu_page {
/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
unsigned long mmu_valid_gen;
 
-   DECLARE_BITMAP(unsync_child_bitmap, 512);
+   DECLARE_BITMAP(unsync_child_bitmap, PT64_ENT_PER_PAGE);
 
 #ifdef CONFIG_X86_32
/*
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 95a955d..d51441b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1806,7 +1806,7 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
 {
int i, ret, nr_unsync_leaf = 0;
 
-   for_each_set_bit(i, sp->unsync_child_bitmap, 512) {
+   for_each_set_bit(i, sp->unsync_child_bitmap, PT64_ENT_PER_PAGE) {
struct kvm_mmu_page *child;
u64 ent = sp->spt[i];
 
-- 
1.8.4.rc0.1.g8f6a3e5.dirty



Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Lan, Tianyu

On 12/9/2015 7:28 PM, Michael S. Tsirkin wrote:

I remember reading that it's possible to implement a bus driver
on windows if required.  But basically I don't see how windows can be
relevant to discussing guest driver patches. That discussion
probably belongs on the qemu maling list, not on lkml.


I am not sure whether we can write a bus driver for Windows to support
migration. But I think device vendor who want to support migration will 
improve their driver if we provide such framework in the

hypervisor which just need them to change their driver.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Lan, Tianyu

On 12/9/2015 6:37 PM, Michael S. Tsirkin wrote:

On Sat, Dec 05, 2015 at 12:32:00AM +0800, Lan, Tianyu wrote:

Hi Michael & Alexander:
Thanks a lot for your comments and suggestions.


It's nice that it's appreciated, but you then go on and ignore
all that I have written here:
https://www.mail-archive.com/kvm@vger.kernel.org/msg123826.html



No, I will reply it separately and according your suggestion to snip it 
into 3 thread.



We still need to support Windows guest for migration and this is why our
patches keep all changes in the driver since it's impossible to change
Windows kernel.


This is not a reasonable argument.  It makes no sense to duplicate code
on Linux because you must duplicate code on Windows.  Let's assume you
must do it in the driver on windows because windows has closed source
drivers.  What does it matter? Linux can still do it as part of DMA API
and have it apply to all drivers.



Sure. Duplicated code should be encapsulated and make it able to reuse
by other drivers. Just like you said the dummy write part.

I meant the framework should not require to change Windows kernel code
(such as PM core or PCI bus driver)and this will block implementation on
the Windows.

I think it's not problem to duplicate code in the Windows drivers.


Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.

The dma page allocated by VF driver also needs to reserve space
to do dummy write.


I suggested ways to do it all in the hypervisor without driver hacks, or
hide it within DMA API without need to reserve extra space. Both
approaches seem much cleaner.



This sounds reasonable. We may discuss it detail in the separate thread.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Lan, Tianyu



On 12/8/2015 1:12 AM, Alexander Duyck wrote:

On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu  wrote:

On 12/5/2015 1:07 AM, Alexander Duyck wrote:



We still need to support Windows guest for migration and this is why our
patches keep all changes in the driver since it's impossible to change
Windows kernel.



That is a poor argument.  I highly doubt Microsoft is interested in
having to modify all of the drivers that will support direct assignment
in order to support migration.  They would likely request something
similar to what I have in that they will want a way to do DMA tracking
with minimal modification required to the drivers.



This totally depends on the NIC or other devices' vendors and they
should make decision to support migration or not. If yes, they would
modify driver.


Having to modify every driver that wants to support live migration is
a bit much.  In addition I don't see this being limited only to NIC
devices.  You can direct assign a number of different devices, your
solution cannot be specific to NICs.


We are also adding such migration support for QAT device and so our
solution will not just be limit to NIC. Now just is the beginning.

We can't limit user to only use Linux guest. So the migration feature
should work for both Windows and Linux guest.




If just target to call suspend/resume during migration, the feature will
be meaningless. Most cases don't want to affect user during migration
a lot and so the service down time is vital. Our target is to apply
SRIOV NIC passthough to cloud service and NFV(network functions
virtualization) projects which are sensitive to network performance
and stability. From my opinion, We should give a change for device
driver to implement itself migration job. Call suspend and resume
callback in the driver if it doesn't care the performance during migration.


The suspend/resume callback should be efficient in terms of time.
After all we don't want the system to stall for a long period of time
when it should be either running or asleep.  Having it burn cycles in
a power state limbo doesn't do anyone any good.  If nothing else maybe
it will help to push the vendors to speed up those functions which
then benefit migration and the system sleep states.


If we can benefit both migration and suspend, that would be wonderful.
But migration and system pm is still different. Just for example,
driver doesn't need to put device into deep D-status during migration
and host can do this after migration while it's essential for
system sleep. PCI configure space and interrupt config is emulated by
Qemu and Qemu can migrate these configures to new machine. Driver
doesn't need to deal with such thing. So I think migration still needs a
different callback or different code path than device suspend/resume.

Another concern is that we have to rework PM core ore PCI bus driver
to call suspend/resume for passthrough devices during migration. This
also blocks new feature works on the Windows.



Also you keep assuming you can keep the device running while you do
the migration and you can't.  You are going to corrupt the memory if
you do, and you have yet to provide any means to explain how you are
going to solve that.



The main problem is tracking DMA issue. I will repose my solution in the
new thread for discussion. If not way to mark DMA page dirty when
DMA is enabled, we have to stop DMA for a small time to do that at the
last stage.








Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.



The ordering of your explanation here doesn't quite work.  What needs to
happen is that you have to disable DMA and then mark the pages as dirty.
   What the disabling of the BME does is signal to the hypervisor that
the device is now stopped.  The ixgbevf_suspend call already supported
by the driver is almost exactly what is needed to take care of something
like this.



This is why I hope to reserve a piece of space in the dma page to do dummy
write. This can help to mark page dirty while not require to stop DMA and
not race with DMA data.


You can't and it will still race.  What concerns me is that your
patches and the document you referenced earlier show a considerable
lack of understanding about how DMA and device drivers work.  There is
a reason why device drivers have so many memory barriers and the like
in them.  The fact is when you have CPU and a device both accessing
memory things have to be done in a very specific order and you cannot
violate that.

If you have a contiguous block of memory you expect the dev

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-07 Thread Lan, Tianyu

On 12/5/2015 1:07 AM, Alexander Duyck wrote:


We still need to support Windows guest for migration and this is why our
patches keep all changes in the driver since it's impossible to change
Windows kernel.


That is a poor argument.  I highly doubt Microsoft is interested in
having to modify all of the drivers that will support direct assignment
in order to support migration.  They would likely request something
similar to what I have in that they will want a way to do DMA tracking
with minimal modification required to the drivers.


This totally depends on the NIC or other devices' vendors and they
should make decision to support migration or not. If yes, they would
modify driver.

If just target to call suspend/resume during migration, the feature will
be meaningless. Most cases don't want to affect user during migration
a lot and so the service down time is vital. Our target is to apply
SRIOV NIC passthough to cloud service and NFV(network functions
virtualization) projects which are sensitive to network performance
and stability. From my opinion, We should give a change for device
driver to implement itself migration job. Call suspend and resume
callback in the driver if it doesn't care the performance during migration.




Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.


The ordering of your explanation here doesn't quite work.  What needs to
happen is that you have to disable DMA and then mark the pages as dirty.
  What the disabling of the BME does is signal to the hypervisor that
the device is now stopped.  The ixgbevf_suspend call already supported
by the driver is almost exactly what is needed to take care of something
like this.


This is why I hope to reserve a piece of space in the dma page to do 
dummy write. This can help to mark page dirty while not require to stop 
DMA and not race with DMA data.


If can't do that, we have to stop DMA in a short time to mark all dma
pages dirty and then reenable it. I am not sure how much we can get by
this way to track all DMA memory with device running during migration. I
need to do some tests and compare results with stop DMA diretly at last
stage during migration.




The question is how we would go about triggering it.  I really don't
think the PCI configuration space approach is the right idea.
 I wonder
if we couldn't get away with some sort of ACPI event instead.  We
already require ACPI support in order to shut down the system
gracefully, I wonder if we couldn't get away with something similar in
order to suspend/resume the direct assigned devices gracefully.



I don't think there is such events in the current spec.
Otherwise, There are two kinds of suspend/resume callbacks.
1) System suspend/resume called during S2RAM and S2DISK.
2) Runtime suspend/resume called by pm core when device is idle.
If you want to do what you mentioned, you have to change PM core and
ACPI spec.


The dma page allocated by VF driver also needs to reserve space
to do dummy write.


No, this will not work.  If for example you have a VF driver allocating
memory for a 9K receive how will that work?  It isn't as if you can poke
a hole in the contiguous memory.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-04 Thread Lan, Tianyu

Hi Michael & Alexander:
Thanks a lot for your comments and suggestions.

We still need to support Windows guest for migration and this is why our 
patches keep all changes in the driver since it's impossible to change 
Windows kernel.


Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.

The dma page allocated by VF driver also needs to reserve space
to do dummy write.


On 12/2/2015 7:44 PM, Michael S. Tsirkin wrote:

On Tue, Dec 01, 2015 at 10:36:33AM -0800, Alexander Duyck wrote:

On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkin  wrote:

On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote:

On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin  wrote:



There are several components to this:
- dma_map_* needs to prevent page from
   being migrated while device is running.
   For example, expose some kind of bitmap from guest
   to host, set bit there while page is mapped.
   What happens if we stop the guest and some
   bits are still set? See dma_alloc_coherent below
   for some ideas.


Yeah, I could see something like this working.  Maybe we could do
something like what was done for the NX bit and make use of the upper
order bits beyond the limits of the memory range to mark pages as
non-migratable?

I'm curious.  What we have with a DMA mapped region is essentially
shared memory between the guest and the device.  How would we resolve
something like this with IVSHMEM, or are we blocked there as well in
terms of migration?


I have some ideas. Will post later.


I look forward to it.


- dma_unmap_* needs to mark page as dirty
   This can be done by writing into a page.

- dma_sync_* needs to mark page as dirty
   This is trickier as we can not change the data.
   One solution is using atomics.
   For example:
 int x = ACCESS_ONCE(*p);
 cmpxchg(p, x, x);
   Seems to do a write without changing page
   contents.


Like I said we can probably kill 2 birds with one stone by just
implementing our own dma_mark_clean() for x86 virtualized
environments.

I'd say we could take your solution one step further and just use 0
instead of bothering to read the value.  After all it won't write the
area if the value at the offset is not 0.


Really almost any atomic that has no side effect will do.
atomic or with 0
atomic and with 

It's just that cmpxchg already happens to have a portable
wrapper.


I was originally thinking maybe an atomic_add with 0 would be the way
to go.


cmpxchg with any value too.


  Either way though we still are using a locked prefix and
having to dirty a cache line per page which is going to come at some
cost.


I agree. It's likely not necessary for everyone
to be doing this: only people that both
run within the VM and want migration to work
need to do this logging.

So set some module option to have driver tell hypervisor that it
supports logging.  If bus mastering is enabled before this, migration is
blocked.  Or even pass some flag from hypervisor so
driver can detect it needs to log writes.
I guess this could be put in device config somewhere,
though in practice it's a global thing, not a per device one, so
maybe we need some new channel to
pass this flag to guest. CPUID?
Or maybe we can put some kind of agent in the initrd
and use the existing guest agent channel after all.
agent in initrd could open up a lot of new possibilities.



- dma_alloc_coherent memory (e.g. device rings)
   must be migrated after device stopped modifying it.
   Just stopping the VCPU is not enough:
   you must make sure device is not changing it.

   Or maybe the device has some kind of ring flush operation,
   if there was a reasonably portable way to do this
   (e.g. a flush capability could maybe be added to SRIOV)
   then hypervisor could do this.


This is where things start to get messy. I was suggesting the
suspend/resume to resolve this bit, but it might be possible to also
deal with this via something like this via clearing the bus master
enable bit for the VF.  If I am not mistaken that should disable MSI-X
interrupts and halt any DMA.  That should work as long as you have
some mechanism that is tracking the pages in use for DMA.


A bigger issue is recovering afterwards.


Agreed.


   In case you need to resume on source, you
   really need to follow the same path
   as on destination, preferably detecting
   device reset and restoring the device
   state.


The problem with detecting the reset is that you would likely have to
be polling to do something like that.


We could some event to guest to notify it about this event
through a new or existing channel.

Or we could make i

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-01 Thread Lan, Tianyu



On 12/1/2015 12:07 AM, Alexander Duyck wrote:

They can only be corrected if the underlying assumptions are correct
and they aren't.  Your solution would have never worked correctly.
The problem is you assume you can keep the device running when you are
migrating and you simply cannot.  At some point you will always have
to stop the device in order to complete the migration, and you cannot
stop it before you have stopped your page tracking mechanism.  So
unless the platform has an IOMMU that is somehow taking part in the
dirty page tracking you will not be able to stop the guest and then
the device, it will have to be the device and then the guest.


>Doing suspend and resume() may help to do migration easily but some
>devices requires low service down time. Especially network and I got
>that some cloud company promised less than 500ms network service downtime.

Honestly focusing on the downtime is getting the cart ahead of the
horse.  First you need to be able to do this without corrupting system
memory and regardless of the state of the device.  You haven't even
gotten to that state yet.  Last I knew the device had to be up in
order for your migration to even work.


I think the issue is that the content of rx package delivered to stack 
maybe changed during migration because the piece of memory won't be 
migrated to new machine. This may confuse applications or stack. Current 
dummy write solution can ensure the content of package won't change 
after doing dummy write while the content maybe not received data if 
migration happens before that point. We can recheck the content via 
checksum or crc in the protocol after dummy write to ensure the content 
is what VF received. I think stack has already done such checks and the 
package will be abandoned if failed to pass through the check.


Another way is to tell all memory driver are using to Qemu and let Qemu 
to migrate these memory after stopping VCPU and the device. This seems 
safe but implementation maybe complex.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-29 Thread Lan, Tianyu

On 11/26/2015 11:56 AM, Alexander Duyck wrote:

> I am not saying you cannot modify the drivers, however what you are
doing is far too invasive.  Do you seriously plan on modifying all of
the PCI device drivers out there in order to allow any device that
might be direct assigned to a port to support migration?  I certainly
hope not.  That is why I have said that this solution will not scale.


Current drivers are not migration friendly. If the driver wants to
support migration, it's necessary to be changed.

RFC PATCH V1 presented our ideas about how to deal with MMIO, ring and
DMA tracking during migration. These are common for most drivers and
they maybe problematic in the previous version but can be corrected later.

Doing suspend and resume() may help to do migration easily but some
devices requires low service down time. Especially network and I got
that some cloud company promised less than 500ms network service downtime.

So I think performance effect also should be taken into account when we 
design the framework.





What I am counter proposing seems like a very simple proposition.  It
can be implemented in two steps.

1.  Look at modifying dma_mark_clean().  It is a function called in
the sync and unmap paths of the lib/swiotlb.c.  If you could somehow
modify it to take care of marking the pages you unmap for Rx as being
dirty it will get you a good way towards your goal as it will allow
you to continue to do DMA while you are migrating the VM.

2.  Look at making use of the existing PCI suspend/resume calls that
are there to support PCI power management.  They have everything
needed to allow you to pause and resume DMA for the device before and
after the migration while retaining the driver state.  If you can
implement something that allows you to trigger these calls from the
PCI subsystem such as hot-plug then you would have a generic solution
that can be easily reproduced for multiple drivers beyond those
supported by ixgbevf.


Glanced at PCI hotplug code. The hotplug events are triggered by PCI 
hotplug controller and these event are defined in the controller spec.
It's hard to extend more events. Otherwise, we also need to add some 
specific codes in the PCI hotplug core since it's only add and remove
PCI device when it gets events. It's also a challenge to modify Windows 
hotplug codes. So we may need to find another way.






Thanks.

- Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-25 Thread Lan, Tianyu

On 11/25/2015 8:28 PM, Michael S. Tsirkin wrote:

Frankly, I don't really see what this short term hack buys us,
and if it goes in, we'll have to maintain it forever.



The framework of how to notify VF about migration status won't be
changed regardless of stopping VF or not before doing migration.
We hope to reach agreement on this first. Tracking dirty memory still
need to more discussions and we will continue working on it. Stop VF may
help to work around the issue and make tracking easier.



Also, assuming you just want to do ifdown/ifup for some reason, it's
easy enough to do using a guest agent, in a completely generic way.



Just ifdown/ifup is not enough for migration. It needs to restore some 
PCI settings before doing ifup on the target machine

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-25 Thread Lan Tianyu
On 2015年11月25日 13:30, Alexander Duyck wrote:
> No, what I am getting at is that you can't go around and modify the
> configuration space for every possible device out there.  This
> solution won't scale.


PCI config space regs are emulation by Qemu and so We can find the free
PCI config space regs for the faked PCI capability. Its position can be
not permanent.


>  If you instead moved the logic for notifying
> the device into a separate mechanism such as making it a part of the
> hot-plug logic then you only have to write the code once per OS in
> order to get the hot-plug capability to pause/resume the device.  What
> I am talking about is not full hot-plug, but rather to extend the
> existing hot-plug in Qemu and the Linux kernel to support a
> "pause/resume" functionality.  The PCI hot-plug specification calls
> out the option of implementing something like this, but we don't
> currently have support for it.
>

Could you elaborate the part of PCI hot-plug specification you mentioned?

My concern is whether it needs to change PCI spec or not.



> I just feel doing it through PCI hot-plug messages will scale much
> better as you could likely make use of the power management
> suspend/resume calls to take care of most of the needed implementation
> details.
> 
> - Alex
-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-24 Thread Lan Tianyu
On 2015年11月25日 05:20, Michael S. Tsirkin wrote:
> I have to say, I was much more interested in the idea
> of tracking dirty memory. I have some thoughts about
> that one - did you give up on it then?

No, our finial target is to keep VF active before doing
migration and tracking dirty memory is essential. But this
seems not easy to do that in short term for upstream. As
starters, stop VF before migration.

After deep thinking, the way of stopping VF still needs tracking
DMA-accessed dirty memory to make sure the received data buffer
before stopping VF migrated. It's easier to do that via dummy writing
data buffer when receive packet.


-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-24 Thread Lan Tianyu
On 2015年11月24日 22:20, Alexander Duyck wrote:
> I'm still not a fan of this approach.  I really feel like this is
> something that should be resolved by extending the existing PCI hot-plug
> rather than trying to instrument this per driver.  Then you will get the
> goodness for multiple drivers and multiple OSes instead of just one.  An
> added advantage to dealing with this in the PCI hot-plug environment
> would be that you could then still do a hot-plug even if the guest
> didn't load a driver for the VF since you would be working with the PCI
> slot instead of the device itself.
> 
> - Alex

Hi Alex:
What's you mentioned seems the bonding driver solution.
Paper "Live Migration with Pass-through Device for Linux VM" describes
it. It does VF hotplug during migration. In order to maintain Network
connection when VF is out, it takes advantage of Linux bonding driver to
switch between VF NIC and emulated NIC. But the side affects, that
requires VM to do additional configure and the performance during
switching two NIC is not good.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-24 Thread Lan Tianyu
This patch is to add migration support for ixgbevf driver. Using
faked PCI migration capability table communicates with Qemu to
share migration status and mailbox irq vector index.

Qemu will notify VF via sending MSIX msg to trigger mailbox
vector during migration and store migration status in the
PCI_VF_MIGRATION_VMM_STATUS regs in the new capability table.
The mailbox irq will be triggered just befoe stop-and-copy stage
and after migration on the target machine.

VF driver will put down net when detect migration and tell
Qemu it's ready for migration via writing PCI_VF_MIGRATION_VF_STATUS
reg. After migration, put up net again.

Qemu will in charge of migrating PCI config space regs and MSIX config.

The patch is to dedicate on the normal case that net traffic works
when mailbox irq is enabled. For other cases(such as the driver
isn't loaded, adapter is suspended or closed), mailbox irq won't be
triggered and VF driver will disable it via PCI_VF_MIGRATION_CAP
reg. These case will be resolved later.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |   5 ++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 102 ++
 2 files changed, 107 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 775d089..4b8ba2f 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -438,6 +438,11 @@ struct ixgbevf_adapter {
u64 bp_tx_missed;
 #endif
 
+   u8 migration_cap;
+   u8 last_migration_reg;
+   unsigned long migration_status;
+   struct work_struct migration_task;
+
u8 __iomem *io_addr; /* Mainly for iounmap use */
u32 link_speed;
bool link_up;
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index a16d267..95860c2 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -96,6 +96,8 @@ static int debug = -1;
 module_param(debug, int, 0);
 MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
 
+#define MIGRATION_IN_PROGRESS  0
+
 static void ixgbevf_service_event_schedule(struct ixgbevf_adapter *adapter)
 {
if (!test_bit(__IXGBEVF_DOWN, &adapter->state) &&
@@ -1262,6 +1264,22 @@ static void ixgbevf_set_itr(struct ixgbevf_q_vector 
*q_vector)
}
 }
 
+static void ixgbevf_migration_check(struct ixgbevf_adapter *adapter) 
+{
+   struct pci_dev *pdev = adapter->pdev;
+   u8 val;
+
+   pci_read_config_byte(pdev,
+adapter->migration_cap + PCI_VF_MIGRATION_VMM_STATUS,
+&val);
+
+   if (val != adapter->last_migration_reg) {
+   schedule_work(&adapter->migration_task);
+   adapter->last_migration_reg = val;
+   }
+
+}
+
 static irqreturn_t ixgbevf_msix_other(int irq, void *data)
 {
struct ixgbevf_adapter *adapter = data;
@@ -1269,6 +1287,7 @@ static irqreturn_t ixgbevf_msix_other(int irq, void *data)
 
hw->mac.get_link_status = 1;
 
+   ixgbevf_migration_check(adapter);
ixgbevf_service_event_schedule(adapter);
 
IXGBE_WRITE_REG(hw, IXGBE_VTEIMS, adapter->eims_other);
@@ -1383,6 +1402,7 @@ out:
 static int ixgbevf_request_msix_irqs(struct ixgbevf_adapter *adapter)
 {
struct net_device *netdev = adapter->netdev;
+   struct pci_dev *pdev = adapter->pdev;
int q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS;
int vector, err;
int ri = 0, ti = 0;
@@ -1423,6 +1443,12 @@ static int ixgbevf_request_msix_irqs(struct 
ixgbevf_adapter *adapter)
goto free_queue_irqs;
}
 
+   if (adapter->migration_cap) {
+   pci_write_config_byte(pdev,
+   adapter->migration_cap + PCI_VF_MIGRATION_IRQ,
+   vector);
+   }
+
return 0;
 
 free_queue_irqs:
@@ -2891,6 +2917,59 @@ static void ixgbevf_watchdog_subtask(struct 
ixgbevf_adapter *adapter)
ixgbevf_update_stats(adapter);
 }
 
+static void ixgbevf_migration_task(struct work_struct *work)
+{
+   struct ixgbevf_adapter *adapter = container_of(work,
+   struct ixgbevf_adapter,
+   migration_task);
+   struct pci_dev *pdev = adapter->pdev;
+   struct net_device *netdev = adapter->netdev;
+   u8 val;
+
+   if (!test_bit(MIGRATION_IN_PROGRESS, &adapter->migration_status)) {
+   pci_read_config_byte(pdev,
+adapter->migration_cap + PCI_VF_MIGRATION_VMM_STATUS,
+&val);
+   if (val != VMM_MIGRATION_START)
+   return;
+
+   pr_info("migration start\n");
+   set_bit(MIGRATION_IN_PROGRESS, &a

[RFC PATCH V2 2/3] PCI: Add macros for faked PCI migration capability

2015-11-24 Thread Lan Tianyu
This patch is to extend PCI CAP id for migration cap and
add reg macros. The CAP ID is trial and we may find better one if the
solution is feasible.

*PCI_VF_MIGRATION_CAP
For VF driver to  control that triggers mailbox irq or not during migration.

*PCI_VF_MIGRATION_VMM_STATUS
Qemu stores migration status in the reg

*PCI_VF_MIGRATION_VF_STATUS
VF driver tells Qemu ready for migration

*PCI_VF_MIGRATION_IRQ
VF driver stores mailbox interrupt vector in the reg for Qemu to trigger during 
migration.

Signed-off-by: Lan Tianyu 
---
 include/uapi/linux/pci_regs.h | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index efe3443..9defb6f 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -216,7 +216,8 @@
 #define  PCI_CAP_ID_MSIX   0x11/* MSI-X */
 #define  PCI_CAP_ID_SATA   0x12/* SATA Data/Index Conf. */
 #define  PCI_CAP_ID_AF 0x13/* PCI Advanced Features */
-#define  PCI_CAP_ID_MAXPCI_CAP_ID_AF
+#define  PCI_CAP_ID_MIGRATION  0X14   
+#define  PCI_CAP_ID_MAXPCI_CAP_ID_MIGRATION
 #define PCI_CAP_LIST_NEXT  1   /* Next capability in the list */
 #define PCI_CAP_FLAGS  2   /* Capability defined flags (16 bits) */
 #define PCI_CAP_SIZEOF 4
@@ -904,4 +905,19 @@
 #define PCI_TPH_CAP_ST_SHIFT   16  /* st table shift */
 #define PCI_TPH_BASE_SIZEOF12  /* size with no st table */
 
+/* Migration*/
+#define PCI_VF_MIGRATION_CAP   0x04
+#define PCI_VF_MIGRATION_VMM_STATUS0x05
+#define PCI_VF_MIGRATION_VF_STATUS 0x06
+#define PCI_VF_MIGRATION_IRQ   0x07
+
+#define PCI_VF_MIGRATION_DISABLE   0x00
+#define PCI_VF_MIGRATION_ENABLE0x01
+
+#define VMM_MIGRATION_END0x00
+#define VMM_MIGRATION_START  0x01
+
+#define PCI_VF_WAIT_FOR_MIGRATION   0x00
+#define PCI_VF_READY_FOR_MIGRATION  0x01
+
 #endif /* LINUX_PCI_REGS_H */
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH V2 1/3] VFIO: Add new ioctl cmd VFIO_GET_PCI_CAP_INFO

2015-11-24 Thread Lan Tianyu
This patch is to add new ioctl cmd VFIO_GET_PCI_CAP_INFO to get
PCI cap table size and get free PCI config space regs according
pos and size.

Qemu will add faked PCI capability for migration and need such
info.

Signed-off-by: Lan Tianyu 
---
 drivers/vfio/pci/vfio_pci.c | 21 
 drivers/vfio/pci/vfio_pci_config.c  | 38 +++--
 drivers/vfio/pci/vfio_pci_private.h |  5 +
 include/uapi/linux/vfio.h   | 12 
 4 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 69fab0f..2e42de0 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -784,6 +784,27 @@ hot_reset_release:
 
kfree(groups);
return ret;
+   } else if (cmd == VFIO_GET_PCI_CAP_INFO) {
+   struct vfio_pci_cap_info info;
+   int offset;
+
+   if (copy_from_user(&info, (void __user *)arg, sizeof(info)))
+   return -EFAULT;
+
+   switch (info.index) {
+   case VFIO_PCI_CAP_GET_SIZE:
+   info.size = vfio_get_cap_size(vdev, info.cap, 
info.offset);
+   break;
+   case VFIO_PCI_CAP_GET_FREE_REGION:
+   offset = vfio_find_free_pci_config_reg(vdev,
+   info.offset, info.size);
+   info.offset = offset;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return copy_to_user((void __user *)arg, &info, sizeof(info));
}
 
return -ENOTTY;
diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index ff75ca3..8afbda4 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -841,6 +841,21 @@ static int vfio_find_cap_start(struct vfio_pci_device 
*vdev, int pos)
return pos;
 }
 
+int vfio_find_free_pci_config_reg(struct vfio_pci_device *vdev,
+   int pos, int size)
+{
+   int i, offset = pos;
+
+   for (i = pos; i < PCI_CFG_SPACE_SIZE; i++) {
+   if (vdev->pci_config_map[i] != PCI_CAP_ID_INVALID)
+   offset = i + 1;
+   else if (i - offset + 1 == size)
+   return offset;
+   }
+
+   return 0;
+}
+
 static int vfio_msi_config_read(struct vfio_pci_device *vdev, int pos,
int count, struct perm_bits *perm,
int offset, __le32 *val)
@@ -1199,6 +1214,20 @@ static int vfio_fill_vconfig_bytes(struct 
vfio_pci_device *vdev,
return ret;
 }
 
+int vfio_get_cap_size(struct vfio_pci_device *vdev, u8 cap, int pos)
+{
+   int len;
+
+   len = pci_cap_length[cap];
+   if (len == 0xFF) { /* Variable length */
+   len = vfio_cap_len(vdev, cap, pos);
+   if (len < 0)
+   return len;
+   }
+
+   return len;
+}
+
 static int vfio_cap_init(struct vfio_pci_device *vdev)
 {
struct pci_dev *pdev = vdev->pdev;
@@ -1238,12 +1267,9 @@ static int vfio_cap_init(struct vfio_pci_device *vdev)
return ret;
 
if (cap <= PCI_CAP_ID_MAX) {
-   len = pci_cap_length[cap];
-   if (len == 0xFF) { /* Variable length */
-   len = vfio_cap_len(vdev, cap, pos);
-   if (len < 0)
-   return len;
-   }
+   len = vfio_get_cap_size(vdev, cap, pos);
+   if (len < 0)
+   return len;
}
 
if (!len) {
diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index ae0e1b4..91b4f9b 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -89,4 +89,9 @@ extern void vfio_pci_uninit_perm_bits(void);
 
 extern int vfio_config_init(struct vfio_pci_device *vdev);
 extern void vfio_config_free(struct vfio_pci_device *vdev);
+extern int vfio_find_free_pci_config_reg(struct vfio_pci_device *vdev,
+   int pos, int size);
+extern int vfio_get_cap_size(struct vfio_pci_device *vdev,
+   u8 cap, int pos);
+
 #endif /* VFIO_PCI_PRIVATE_H */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index b57b750..dfa7023 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -495,6 +495,18 @@ struct vfio_eeh_pe_op {
 
 #define VFIO_EEH_PE_OP _IO(VFIO_TYPE, VFIO_BASE + 21)
 
+#define VFIO_GET_PCI_CAP_INFO  _IO(VFIO_TYPE, VFIO_BASE + 22)
+struct vfio_pci_cap_info {
+   __u32   argsz;
+   __u32   flags;
+#define VFIO_PCI_CAP_GET_SIZE  (1 << 0)
+#define VFI

[RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-24 Thread Lan Tianyu
This patchset is to propose a solution of adding live migration
support for SRIOV NIC.

During migration, Qemu needs to let VF driver in the VM to know
migration start and end. Qemu adds faked PCI migration capability
to help to sync status between two sides during migration.

Qemu triggers VF's mailbox irq via sending MSIX msg when migration
status is changed. VF driver tells Qemu its mailbox vector index
via the new PCI capability. In some cases(NIC is suspended or closed),
VF mailbox irq is freed and VF driver can disable irq injecting via
new capability.

VF driver will put down nic before migration and put up again on
the target machine.

Lan Tianyu (3):
  VFIO: Add new ioctl cmd VFIO_GET_PCI_CAP_INFO
  PCI: Add macros for faked PCI migration capability
  Ixgbevf: Add migration support for ixgbevf driver

 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |   5 ++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 102 ++
 drivers/vfio/pci/vfio_pci.c   |  21 +
 drivers/vfio/pci/vfio_pci_config.c|  38 ++--
 drivers/vfio/pci/vfio_pci_private.h   |   5 ++
 include/uapi/linux/pci_regs.h |  18 +++-
 include/uapi/linux/vfio.h |  12 +++
 7 files changed, 194 insertions(+), 7 deletions(-)

-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-29 Thread Lan Tianyu
On 2015年10月30日 00:17, Alexander Duyck wrote:
> On 10/29/2015 01:33 AM, Lan Tianyu wrote:
>> On 2015年10月29日 14:58, Alexander Duyck wrote:
>>> Your code was having to do a bunch of shuffling in order to get things
>>> set up so that you could bring the interface back up.  I would argue
>>> that it may actually be faster at least on the bring-up to just drop the
>>> old rings and start over since it greatly reduced the complexity and the
>>> amount of device related data that has to be moved.
>> If give up the old ring after migration and keep DMA running before
>> stopping VCPU, it seems we don't need to track Tx/Rx descriptor ring and
>> just make sure that all Rx buffers delivered to stack has been migrated.
>>
>> 1) Dummy write Rx buffer before checking Rx descriptor to ensure packet
>> migrated first.
> 
> Don't dummy write the Rx descriptor.  You should only really need to
> dummy write the Rx buffer and you would do so after checking the
> descriptor, not before.  Otherwise you risk corrupting the Rx buffer
> because it is possible for you to read the Rx buffer, DMA occurs, and
> then you write back the Rx buffer and now you have corrupted the memory.
> 
>> 2) Make a copy of Rx descriptor and then use the copied data to check
>> buffer status. Not use the original descriptor because it won't be
>> migrated and migration may happen between two access of the Rx
>> descriptor.
> 
> Do not just blindly copy the Rx descriptor ring.  That is a recipe for
> disaster.  The problem is DMA has to happen in a very specific order for
> things to function correctly.  The Rx buffer has to be written and then
> the Rx descriptor.  The problem is you will end up getting a read-ahead
> on the Rx descriptor ring regardless of which order you dirty things in.


Sorry, I didn't say clearly.
I meant to copy one Rx descriptor when receive rx irq and handle Rx ring.

Current code in the ixgbevf_clean_rx_irq() checks status of the Rx
descriptor whether its Rx buffer has been populated data and then read
the packet length from Rx descriptor to handle the Rx buffer.

My idea is to do the following three steps when receive Rx buffer in the
ixgbevf_clean_rx_irq().

(1) dummy write the Rx buffer first,
(2) make a copy of its Rx descriptor
(3) Check the buffer status and get length from the copy.

Migration may happen every time.
Happen between (1) and (2). If the Rx buffer has been populated data, VF
driver will not know that on the new machine because the Rx descriptor
isn't migrated. But it's still safe.

Happen between (2) and (3). The copy will be migrated to new machine
and Rx buffer is migrated firstly. If there is data in the Rx buffer,
VF driver still can handle the buffer without migrating Rx descriptor.

The next buffers will be ignored since we don't migrate Rx descriptor
for them. Their status will be not completed on the new machine.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-29 Thread Lan Tianyu
On 2015年10月29日 14:58, Alexander Duyck wrote:
> 
> Your code was having to do a bunch of shuffling in order to get things
> set up so that you could bring the interface back up.  I would argue
> that it may actually be faster at least on the bring-up to just drop the
> old rings and start over since it greatly reduced the complexity and the
> amount of device related data that has to be moved.

If give up the old ring after migration and keep DMA running before
stopping VCPU, it seems we don't need to track Tx/Rx descriptor ring and
just make sure that all Rx buffers delivered to stack has been migrated.

1) Dummy write Rx buffer before checking Rx descriptor to ensure packet
migrated first.

2) Make a copy of Rx descriptor and then use the copied data to check
buffer status. Not use the original descriptor because it won't be
migrated and migration may happen between two access of the Rx descriptor.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-28 Thread Lan Tianyu
On 2015年10月26日 23:03, Alexander Duyck wrote:
> No.  I think you are missing the fact that there are 256 descriptors per
> page.  As such if you dirty just 1 you will be pulling in 255 more, of
> which you may or may not have pulled in the receive buffer for.
> 
> So for example if you have the descriptor ring size set to 256 then that
> means you are going to get whatever the descriptor ring has since you
> will be marking the entire ring dirty with every packet processed,
> however you cannot guarantee that you are going to get all of the
> receive buffers unless you go through and flush the entire ring prior to
> migrating.


Yes, that will be a problem. How about adding tag for each Rx buffer and
check the tag when deliver the Rx buffer to stack? If tag has been
overwritten, this means the packet data has been migrated.


> 
> This is why I have said you will need to do something to force the rings
> to be flushed such as initiating a PM suspend prior to migrating.  You
> need to do something to stop the DMA and flush the remaining Rx buffers
> if you want to have any hope of being able to migrate the Rx in a
> consistent state.  Beyond that the only other thing you have to worry
> about are the Rx buffers that have already been handed off to the
> stack.  However those should be handled if you do a suspend and somehow
> flag pages as dirty when they are unmapped from the DMA.
> 
> - Alex

This will be simple and maybe our first version to enable migration. But
we still hope to find a way not to disable DMA before stopping VCPU to
decrease service down time.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-25 Thread Lan Tianyu
On 2015年10月24日 02:36, Alexander Duyck wrote:
> I was thinking about it and I am pretty sure the dummy write approach is
> problematic at best.  Specifically the issue is that while you are
> performing a dummy write you risk pulling in descriptors for data that
> hasn't been dummy written to yet.  So when you resume and restore your
> descriptors you will have once that may contain Rx descriptors
> indicating they contain data when after the migration they don't.

How about changing sequence? dummy writing Rx packet data fist and then
its desc. This can ensure that RX data is migrated before its desc and
prevent such case.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch 03/12] IXGBE: Add sysfs interface for Qemu to migrate VF status in the PF driver

2015-10-25 Thread Lan, Tianyu



On 10/22/2015 4:45 AM, Alexander Duyck wrote:

+/* Record states hold by PF */
+memcpy(&state->vf_data, &adapter->vfinfo[vfn], sizeof(struct
vf_data_storage));
+
+vf_shift = vfn % 32;
+reg_offset = vfn / 32;
+
+reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset));
+reg &= ~(1 << vf_shift);
+IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg);
+
+reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset));
+reg &= ~(1 << vf_shift);
+IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg);
+
+reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset));
+reg &= ~(1 << vf_shift);
+IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg);
+
+return sizeof(struct state_in_pf);
+}
+


This is a read.  Why does it need to switch off the VF?  Also why turn
of the anti-spoof, it doesn't make much sense.


This is to prevent packets which target to VM from delivering to 
original VF after migration. E,G After migration, VM pings the PF of 
original machine and the ping reply packet will forward to original

VF if it wasn't disabled.

BTW, the read is done when VM has been stopped on the source machine.





+static ssize_t ixgbe_store_state_in_pf(struct device *dev,
+   struct device_attribute *attr,
+   const char *buf, size_t count)
+{
+struct ixgbe_adapter *adapter = to_adapter(dev);
+struct pci_dev *pdev = adapter->pdev, *vdev;
+struct pci_dev *vf_pdev = to_pci_dev(dev);
+struct state_in_pf *state = (struct state_in_pf *)buf;
+int vfn = vf_pdev->virtfn_index;
+
+/* Check struct size */
+if (count != sizeof(struct state_in_pf)) {
+printk(KERN_ERR "State in PF size does not fit.\n");
+goto out;
+}
+
+/* Restore PCI configurations */
+vdev = ixgbe_get_virtfn_dev(pdev, vfn);
+if (vdev) {
+pci_write_config_word(vdev, IXGBE_PCI_VFCOMMAND,
state->command);
+pci_write_config_word(vdev, IXGBE_PCI_VFMSIXMC,
state->msix_message_control);
+}
+
+/* Restore states hold by PF */
+memcpy(&adapter->vfinfo[vfn], &state->vf_data, sizeof(struct
vf_data_storage));
+
+  out:
+return count;
+}


Just doing a memcpy to move the vfinfo over adds no value.  The fact is
there are a number of filters that have to be configured in hardware
after, and it isn't as simple as just migrating the values stored.


Restoring VF status in the PF is triggered by VF driver via new mailbox
msg and call ixgbe_restore_setting(). Here just copy data into vfinfo.
If configure hardware early, state will be cleared by FLR which is
trigger by restoring operation in the VF driver.



 As I
mentioned in the case of the 82598 there is also jumbo frames to take
into account.  If the first PF didn't have it enabled, but the second
one does that implies the state of the VF needs to change to account for
that.


Yes, that will be a problem and VF driver also need to know this change 
after migration and reconfigure jumbo frame




I really think you would be better off only migrating the data related
to what can be configured using the ip link command and leaving other
values such as clear_to_send at the reset value of 0. Then you can at
least restore state from the VF after just a couple of quick messages.


This sounds good. I will try it later.




+static struct device_attribute ixgbe_per_state_in_pf_attribute =
+__ATTR(state_in_pf, S_IRUGO | S_IWUSR,
+ixgbe_show_state_in_pf, ixgbe_store_state_in_pf);
+
+void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter)
+{
+struct pci_dev *pdev = adapter->pdev;
+struct pci_dev *vfdev;
+unsigned short vf_id;
+int pos, ret;
+
+pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
+if (!pos)
+return;
+
+/* get the device ID for the VF */
+pci_read_config_word(pdev, pos + PCI_SRIOV_VF_DID, &vf_id);
+
+vfdev = pci_get_device(pdev->vendor, vf_id, NULL);
+
+while (vfdev) {
+if (vfdev->is_virtfn) {
+ret = device_create_file(&vfdev->dev,
+&ixgbe_per_state_in_pf_attribute);
+if (ret)
+pr_warn("Unable to add VF attribute for dev %s,\n",
+dev_name(&vfdev->dev));
+}
+
+vfdev = pci_get_device(pdev->vendor, vf_id, vfdev);
+}
+}


Driver specific sysfs is a no-go.  Otherwise we will end up with a
different implementation of this for every driver.  You will need to
find a way to make this generic in order to have a hope of getting this
to be acceptable.


Yes, Alex Williamson proposed to get/put data via VFIO interface. This
will be more general. I will do more research about how to communicate
between PF driver and VFIO subsystem.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"

2015-10-24 Thread Lan, Tianyu



On 10/25/2015 2:03 PM, Alexander Duyck wrote:

On 10/24/2015 08:43 AM, Lan, Tianyu wrote:


On 10/22/2015 4:52 AM, Alexander Duyck wrote:

Also have you even considered the MSI-X configuration on the VF?  I
haven't seen anything anywhere that would have migrated the VF's MSI-X
configuration from BAR 3 on one system to the new system.


MSI-X migration is done by Hypervisor(Qemu).
Following link is my Qemu patch to do that.
http://marc.info/?l=kvm&m=144544706530484&w=2


I really don't like the idea of trying to migrate the MSI-X across from
host to host while it is still active.  I really think Qemu shouldn't be
moving this kind of data over in a migration.


Hi Alex:

VF MSI-X regs in the VM are faked by Qemu and Qemu maps host vectors of
VF with guest's vector. The MSIX data migrated are for faked regs rather
than the one on the host. After migration, Qemu will remap guest vectors
with host vectors on the new machine. Moreover, VM is stopped during
migrating MSI-X data.




I think that having the VF do a suspend/resume is the best way to go.
Then it simplifies things as all you have to deal with is the dirty page
tracking for the Rx DMA and you should be able to do this without making
things too difficult.



Yes, that will be simple and most concern is service down time. I will
test later.



- Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package

2015-10-24 Thread Lan, Tianyu


On 10/22/2015 5:14 AM, Alexander Duyck wrote:

Where is i being initialized?  It was here but you removed it.  Are you
using i without initializing it?


Sorry, the initialization was put into patch 10 by mistake. "i" is
assigned with "tx_ring->next_to_clean".
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package

2015-10-24 Thread Lan, Tianyu



On 10/22/2015 8:58 PM, Michael S. Tsirkin wrote:

Do you really need to play the shifting games?
Can't you just reset everything and re-initialize the rings?
It's slower but way less intrusive.
Also removes the need to track writes into rings.


Shift ring is to avoid losing those packets in the ring.
This may cause some race condition and so I introduced a
lock to prevent such cases in the latter patch.
Yes, reset everything after migration can make thing easy.
But just like you said it would affect performance and loss
more packets. I can do a test later to get data about these
two way.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"

2015-10-24 Thread Lan, Tianyu


On 10/22/2015 4:52 AM, Alexander Duyck wrote:

Also have you even considered the MSI-X configuration on the VF?  I
haven't seen anything anywhere that would have migrated the VF's MSI-X
configuration from BAR 3 on one system to the new system.


MSI-X migration is done by Hypervisor(Qemu).
Following link is my Qemu patch to do that.
http://marc.info/?l=kvm&m=144544706530484&w=2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device

2015-10-24 Thread Lan, Tianyu



On 10/22/2015 2:07 AM, Alexander Duyck wrote:

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

Add "virtfn_index" member in the struct pci_device to record VF sequence
of PF. This will be used in the VF sysfs node handle.

Signed-off-by: Lan Tianyu 
---
  drivers/pci/iov.c   | 1 +
  include/linux/pci.h | 1 +
  2 files changed, 2 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index ee0ebff..065b6bb 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -136,6 +136,7 @@ static int virtfn_add(struct pci_dev *dev, int id,
int reset)
  virtfn->physfn = pci_dev_get(dev);
  virtfn->is_virtfn = 1;
  virtfn->multifunction = 0;
+virtfn->virtfn_index = id;
  for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
  res = &dev->resource[i + PCI_IOV_RESOURCES];
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 353db8d..85c5531 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -356,6 +356,7 @@ struct pci_dev {
  unsigned intio_window_1k:1;/* Intel P2P bridge 1K I/O
windows */
  unsigned intirq_managed:1;
  pci_dev_flags_t dev_flags;
+unsigned intvirtfn_index;
  atomic_tenable_cnt;/* pci_enable_device has been called */
  u32saved_config_space[16]; /* config space saved at
suspend time */



Can't you just calculate the VF index based on the VF BDF number
combined with the information in the PF BDF number and VF
offset/stride?  Seems kind of pointless to add a variable that is only
used by one driver and is in a slowpath when you can just calculate it
pretty quickly.


Good suggestion. Will try it.



- Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC Patch 03/12] IXGBE: Add sysfs interface for Qemu to migrate VF status in the PF driver

2015-10-21 Thread Lan Tianyu
This patch is to add sysfs interface state_in_pf under sysfs directory
of VF PCI device for Qemu to get and put VF status in the PF driver during
migration.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 156 -
 1 file changed, 155 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index ab2a2e2..89671eb 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -124,6 +124,157 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter 
*adapter)
return -ENOMEM;
 }
 
+#define IXGBE_PCI_VFCOMMAND   0x4
+#define IXGBE_PCI_VFMSIXMC0x72
+#define IXGBE_SRIOV_VF_OFFSET 0x180
+#define IXGBE_SRIOV_VF_STRIDE 0x2
+
+#define to_adapter(dev) ((struct ixgbe_adapter 
*)(pci_get_drvdata(to_pci_dev(dev)->physfn)))
+
+struct state_in_pf {
+   u16 command;
+   u16 msix_message_control;
+   struct vf_data_storage vf_data;
+};
+
+static struct pci_dev *ixgbe_get_virtfn_dev(struct pci_dev *pdev, int vfn)
+{
+   u16 rid = pdev->devfn + IXGBE_SRIOV_VF_OFFSET + IXGBE_SRIOV_VF_STRIDE * 
vfn;
+   return pci_get_bus_and_slot(pdev->bus->number + (rid >> 8), rid & 0xff);
+}
+
+static ssize_t ixgbe_show_state_in_pf(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+   struct ixgbe_adapter *adapter = to_adapter(dev);
+   struct pci_dev *pdev = adapter->pdev, *vdev;
+   struct pci_dev *vf_pdev = to_pci_dev(dev);
+   struct ixgbe_hw *hw = &adapter->hw;
+   struct state_in_pf *state = (struct state_in_pf *)buf;
+   int vfn = vf_pdev->virtfn_index;
+   u32 reg, reg_offset, vf_shift;
+
+   /* Clear VF mac and disable VF */
+   ixgbe_del_mac_filter(adapter, adapter->vfinfo[vfn].vf_mac_addresses, 
vfn);
+
+   /* Record PCI configurations */
+   vdev = ixgbe_get_virtfn_dev(pdev, vfn);
+   if (vdev) {
+   pci_read_config_word(vdev, IXGBE_PCI_VFCOMMAND, 
&state->command);
+   pci_read_config_word(vdev, IXGBE_PCI_VFMSIXMC, 
&state->msix_message_control);
+   }
+   else
+   printk(KERN_WARNING "Unable to find VF device.\n");
+
+   /* Record states hold by PF */
+   memcpy(&state->vf_data, &adapter->vfinfo[vfn], sizeof(struct 
vf_data_storage));
+
+   vf_shift = vfn % 32;
+   reg_offset = vfn / 32;
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset));
+   reg &= ~(1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset));
+   reg &= ~(1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset));
+   reg &= ~(1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg);
+
+   return sizeof(struct state_in_pf);
+}
+
+static ssize_t ixgbe_store_state_in_pf(struct device *dev,
+  struct device_attribute *attr,
+  const char *buf, size_t count)
+{
+   struct ixgbe_adapter *adapter = to_adapter(dev);
+   struct pci_dev *pdev = adapter->pdev, *vdev;
+   struct pci_dev *vf_pdev = to_pci_dev(dev);
+   struct state_in_pf *state = (struct state_in_pf *)buf;
+   int vfn = vf_pdev->virtfn_index;
+
+   /* Check struct size */
+   if (count != sizeof(struct state_in_pf)) {
+   printk(KERN_ERR "State in PF size does not fit.\n");
+   goto out;
+   }
+
+   /* Restore PCI configurations */
+   vdev = ixgbe_get_virtfn_dev(pdev, vfn);
+   if (vdev) {
+   pci_write_config_word(vdev, IXGBE_PCI_VFCOMMAND, 
state->command);
+   pci_write_config_word(vdev, IXGBE_PCI_VFMSIXMC, 
state->msix_message_control);
+   }
+
+   /* Restore states hold by PF */
+   memcpy(&adapter->vfinfo[vfn], &state->vf_data, sizeof(struct 
vf_data_storage));
+
+  out:
+   return count;
+}
+
+static struct device_attribute ixgbe_per_state_in_pf_attribute =
+   __ATTR(state_in_pf, S_IRUGO | S_IWUSR,
+   ixgbe_show_state_in_pf, ixgbe_store_state_in_pf);
+
+void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter)
+{
+   struct pci_dev *pdev = adapter->pdev;
+   struct pci_dev *vfdev;
+   unsigned short vf_id;
+   int pos, ret;
+
+   pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
+   if (!pos)
+   return;
+
+   /* get the device ID for the VF */
+   pci_read_config_word(pdev, pos + PCI_SRIOV_VF_DID, &vf_id);
+
+   vfdev = pci_get_device(pdev->vendor, vf_id, NULL);
+
+   while (vfdev) {
+   if (vf

[RFC Patch 09/12] IXGBEVF: Add live migration support for VF driver

2015-10-21 Thread Lan Tianyu
To let VF driver in the guest to know migration status, Qemu will
fake PCI configure reg 0xF0 and 0xF1 to show migrate status and
get ack from VF driver.

When migration starts, Qemu will set reg "0xF0" to 1, notify
VF driver via triggering mail box msg and wait for VF driver to tell
it's ready for migration(set reg "0xF1" to 1). After migration, Qemu
will set reg "0xF0" to 0 and notify VF driver by mail box irq. VF
driver begins to restore tx/rx function after detecting sttatus change.

When VF receives mail box irq, it will check reg "0xF0" in the service
task function to get migration status and performs related operations
according its value.

Steps of restarting receive and transmit function
1) Restore VF status in the PF driver via sending mail event to PF driver
2) Write back reg values recorded by self emulation layer
3) Restart rx/tx ring
4) Recovery interrupt

Transmit/Receive descriptor head regs are read-only and can't
be restored via writing back recording reg value directly and they
are set to 0 during VF reset. To reuse original tx/rx rings, shift
desc ring in order to move the desc pointed by original head reg to
first entry of the ring and then enable tx/rx rings. VF restarts to
receive and transmit from original head desc.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbevf/defines.h   |   6 ++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h   |   7 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  | 115 -
 .../net/ethernet/intel/ixgbevf/self-emulation.c| 107 +++
 4 files changed, 232 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/defines.h 
b/drivers/net/ethernet/intel/ixgbevf/defines.h
index 770e21a..113efd2 100644
--- a/drivers/net/ethernet/intel/ixgbevf/defines.h
+++ b/drivers/net/ethernet/intel/ixgbevf/defines.h
@@ -239,6 +239,12 @@ struct ixgbe_adv_tx_context_desc {
__le32 mss_l4len_idx;
 };
 
+union ixgbevf_desc {
+   union ixgbe_adv_tx_desc rx_desc;
+   union ixgbe_adv_rx_desc tx_desc;
+   struct ixgbe_adv_tx_context_desc tx_context_desc;
+};
+
 /* Adv Transmit Descriptor Config Masks */
 #define IXGBE_ADVTXD_DTYP_MASK 0x00F0 /* DTYP mask */
 #define IXGBE_ADVTXD_DTYP_CTXT 0x0020 /* Advanced Context Desc */
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index c823616..6eab402e 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -109,7 +109,7 @@ struct ixgbevf_ring {
struct ixgbevf_ring *next;
struct net_device *netdev;
struct device *dev;
-   void *desc; /* descriptor ring memory */
+   union ixgbevf_desc *desc;   /* descriptor ring memory */
dma_addr_t dma; /* phys. address of descriptor ring */
unsigned int size;  /* length in bytes */
u16 count;  /* amount of descriptors */
@@ -493,6 +493,11 @@ extern void ixgbevf_write_eitr(struct ixgbevf_q_vector 
*q_vector);
 
 void ixgbe_napi_add_all(struct ixgbevf_adapter *adapter);
 void ixgbe_napi_del_all(struct ixgbevf_adapter *adapter);
+int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head);
+int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head);
+void ixgbevf_restore_state(struct ixgbevf_adapter *adapter);
+inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter);
+
 
 #ifdef DEBUG
 char *ixgbevf_get_hw_dev_name(struct ixgbe_hw *hw);
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 056841c..15ec361 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -91,6 +91,10 @@ MODULE_DESCRIPTION("Intel(R) 10 Gigabit Virtual Function 
Network Driver");
 MODULE_LICENSE("GPL");
 MODULE_VERSION(DRV_VERSION);
 
+
+#define MIGRATION_COMPLETED   0x00
+#define MIGRATION_IN_PROGRESS 0x01
+
 #define DEFAULT_MSG_ENABLE (NETIF_MSG_DRV|NETIF_MSG_PROBE|NETIF_MSG_LINK)
 static int debug = -1;
 module_param(debug, int, 0);
@@ -221,6 +225,78 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring 
*ring)
return ring->stats.packets;
 }
 
+int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
+{
+   struct ixgbevf_tx_buffer *tx_buffer = NULL;
+   static union ixgbevf_desc *tx_desc = NULL;
+
+   tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count));
+   if (!tx_buffer)
+   return -ENOMEM;
+
+   tx_desc = vmalloc(sizeof(union ixgbevf_desc) * r->count);
+   if (!tx_desc)
+   return -ENOMEM;
+
+   memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count);
+   memcpy(r->desc, &tx_desc[head], sizeof(union ixgbevf_desc) * (r->count 
- head));
+   memcpy(&

[RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package

2015-10-21 Thread Lan Tianyu
When transmit a package, the end transmit desc of package
indicates whether package is sent already. Current code records
the end desc's pointer in the next_to_watch of struct tx buffer.
This code will be broken if shifting desc ring after migration.
The pointer will be invalid. This patch is to replace recording
pointer with recording the desc number of the package and find
the end decs via the first desc and desc number.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  1 +
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 19 ---
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 775d089..c823616 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -54,6 +54,7 @@
  */
 struct ixgbevf_tx_buffer {
union ixgbe_adv_tx_desc *next_to_watch;
+   u16 desc_num;
unsigned long time_stamp;
struct sk_buff *skb;
unsigned int bytecount;
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 4446916..056841c 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -210,6 +210,7 @@ static void ixgbevf_unmap_and_free_tx_resource(struct 
ixgbevf_ring *tx_ring,
   DMA_TO_DEVICE);
}
tx_buffer->next_to_watch = NULL;
+   tx_buffer->desc_num = 0;
tx_buffer->skb = NULL;
dma_unmap_len_set(tx_buffer, len, 0);
/* tx_buffer must be completely set up in the transmit path */
@@ -295,7 +296,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
union ixgbe_adv_tx_desc *tx_desc;
unsigned int total_bytes = 0, total_packets = 0;
unsigned int budget = tx_ring->count / 2;
-   unsigned int i = tx_ring->next_to_clean;
+   int i, watch_index;
 
if (test_bit(__IXGBEVF_DOWN, &adapter->state))
return true;
@@ -305,9 +306,17 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
i -= tx_ring->count;
 
do {
-   union ixgbe_adv_tx_desc *eop_desc = tx_buffer->next_to_watch;
+   union ixgbe_adv_tx_desc *eop_desc;
+
+   if (!tx_buffer->desc_num)
+   break;
+
+   if (i + tx_buffer->desc_num >= 0)
+   watch_index = i + tx_buffer->desc_num;
+   else
+   watch_index = i + tx_ring->count + tx_buffer->desc_num;
 
-   /* if next_to_watch is not set then there is no work pending */
+   eop_desc = IXGBEVF_TX_DESC(tx_ring, watch_index);
if (!eop_desc)
break;
 
@@ -320,6 +329,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
 
/* clear next_to_watch to prevent false hangs */
tx_buffer->next_to_watch = NULL;
+   tx_buffer->desc_num = 0;
 
/* update the statistics for this packet */
total_bytes += tx_buffer->bytecount;
@@ -3457,6 +3467,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
u32 tx_flags = first->tx_flags;
__le32 cmd_type;
u16 i = tx_ring->next_to_use;
+   u16 start;
 
tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
 
@@ -3540,6 +3551,8 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
 
/* set next_to_watch value indicating a packet is present */
first->next_to_watch = tx_desc;
+   start = first - tx_ring->tx_buffer_info;
+   first->desc_num = (i - start >= 0) ? i - start: i + tx_ring->count - 
start;
 
i++;
if (i == tx_ring->count)
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-21 Thread Lan Tianyu
l=64 time=82.3 
ms
[983769928.256503] 64 bytes from 10.239.48.100: icmp_seq=4150 ttl=64 time=72.2 
ms
[983769928.256631] 64 bytes from 10.239.48.100: icmp_seq=4158 ttl=64 time=0.500 
ms
[983769928.257284] 64 bytes from 10.239.48.100: icmp_seq=4159 ttl=64 time=0.154 
ms
[983769928.258297] 64 bytes from 10.239.48.100: icmp_seq=4160 ttl=64 time=0.165 
ms

Todo
===
So far, the patchset isn't perfect. VF net interface can't be open, closed, 
down 
and up during migration. Will prevent such operation during migration in the
future job.

Very appreciate for your comments.


Lan Tianyu (12):
  PCI: Add virtfn_index for struct pci_device
  IXGBE: Add new mail box event to restore VF status in the PF driver
  IXGBE: Add sysfs interface for Qemu to migrate VF status in the PF
driver
  IXGBE: Add ixgbe_ping_vf() to notify a specified VF via mailbox msg.
  IXGBE: Add new sysfs interface of "notify_vf"
  IXGBEVF: Add self emulation layer
  IXGBEVF: Add new mail box event for migration
  IXGBEVF: Rework code of finding the end transmit desc of package
  IXGBEVF: Add live migration support for VF driver
  IXGBEVF: Add lock to protect tx/rx ring operation
  IXGBEVF: Migrate VF statistic data
  IXGBEVF: Track dma dirty pages

 drivers/net/ethernet/intel/ixgbe/ixgbe.h   |   1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h   |   1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 245 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h |   1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_type.h  |   4 +
 drivers/net/ethernet/intel/ixgbevf/Makefile|   3 +-
 drivers/net/ethernet/intel/ixgbevf/defines.h   |   6 +
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h   |  10 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  | 179 ++-
 drivers/net/ethernet/intel/ixgbevf/mbx.h   |   3 +
 .../net/ethernet/intel/ixgbevf/self-emulation.c| 133 +++
 drivers/net/ethernet/intel/ixgbevf/vf.c|  10 +
 drivers/net/ethernet/intel/ixgbevf/vf.h|   6 +-
 drivers/pci/iov.c  |   1 +
 include/linux/pci.h|   1 +
 15 files changed, 582 insertions(+), 22 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c

-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC Patch 10/12] IXGBEVF: Add lock to protect tx/rx ring operation

2015-10-21 Thread Lan Tianyu
Ring shifting during restoring VF function maybe race with original
ring operation(transmit/receive package). This patch is to add tx/rx
lock to protect ring related data.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  2 ++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 28 ---
 2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 6eab402e..3a748c8 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -448,6 +448,8 @@ struct ixgbevf_adapter {
 
spinlock_t mbx_lock;
unsigned long last_reset;
+   spinlock_t mg_rx_lock;
+   spinlock_t mg_tx_lock;
 };
 
 enum ixbgevf_state_t {
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 15ec361..04b6ce7 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -227,8 +227,10 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring 
*ring)
 
 int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
 {
+   struct ixgbevf_adapter *adapter = netdev_priv(r->netdev);
struct ixgbevf_tx_buffer *tx_buffer = NULL;
static union ixgbevf_desc *tx_desc = NULL;
+   unsigned long flags;
 
tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count));
if (!tx_buffer)
@@ -238,6 +240,7 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
if (!tx_desc)
return -ENOMEM;
 
+   spin_lock_irqsave(&adapter->mg_tx_lock, flags);
memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count);
memcpy(r->desc, &tx_desc[head], sizeof(union ixgbevf_desc) * (r->count 
- head));
memcpy(&r->desc[r->count - head], tx_desc, sizeof(union ixgbevf_desc) * 
head);
@@ -256,6 +259,8 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
else
r->next_to_use += (r->count - head);
 
+   spin_unlock_irqrestore(&adapter->mg_tx_lock, flags);
+
vfree(tx_buffer);
vfree(tx_desc);
return 0;
@@ -263,8 +268,10 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
 
 int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
 {
+   struct ixgbevf_adapter *adapter = netdev_priv(r->netdev);
struct ixgbevf_rx_buffer *rx_buffer = NULL;
static union ixgbevf_desc *rx_desc = NULL;
+   unsigned long flags;
 
rx_buffer = vmalloc(sizeof(struct ixgbevf_rx_buffer) * (r->count));
if (!rx_buffer)
@@ -274,6 +281,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
if (!rx_desc)
return -ENOMEM;
 
+   spin_lock_irqsave(&adapter->mg_rx_lock, flags);
memcpy(rx_desc, r->desc, sizeof(union ixgbevf_desc) * (r->count));
memcpy(r->desc, &rx_desc[head], sizeof(union ixgbevf_desc) * (r->count 
- head));
memcpy(&r->desc[r->count - head], rx_desc, sizeof(union ixgbevf_desc) * 
head);
@@ -291,6 +299,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
r->next_to_use -= head;
else
r->next_to_use += (r->count - head);
+   spin_unlock_irqrestore(&adapter->mg_rx_lock, flags);
 
vfree(rx_buffer);
vfree(rx_desc);
@@ -377,6 +386,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
if (test_bit(__IXGBEVF_DOWN, &adapter->state))
return true;
 
+   spin_lock(&adapter->mg_tx_lock);
+   i = tx_ring->next_to_clean;
tx_buffer = &tx_ring->tx_buffer_info[i];
tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
i -= tx_ring->count;
@@ -471,6 +482,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
q_vector->tx.total_bytes += total_bytes;
q_vector->tx.total_packets += total_packets;
 
+   spin_unlock(&adapter->mg_tx_lock);
+
if (check_for_tx_hang(tx_ring) && ixgbevf_check_tx_hang(tx_ring)) {
struct ixgbe_hw *hw = &adapter->hw;
union ixgbe_adv_tx_desc *eop_desc;
@@ -999,10 +1012,12 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
struct ixgbevf_ring *rx_ring,
int budget)
 {
+   struct ixgbevf_adapter *adapter = netdev_priv(rx_ring->netdev);
unsigned int total_rx_bytes = 0, total_rx_packets = 0;
u16 cleaned_count = ixgbevf_desc_unused(rx_ring);
struct sk_buff *skb = rx_ring->skb;
 
+   spin_lock(&adapter->mg_rx_lock);
while (likely(total_rx_packets < budget)) {
un

[RFC Patch 12/12] IXGBEVF: Track dma dirty pages

2015-10-21 Thread Lan Tianyu
Migration relies on tracking dirty page to migrate memory.
Hardware can't automatically mark a page as dirty after DMA
memory access. VF descriptor rings and data buffers are modified
by hardware when receive and transmit data. To track such dirty memory
manually, do dummy writes(read a byte and write it back) during receive
and transmit data.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index d22160f..ce7bd7a 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -414,6 +414,9 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
if (!(eop_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
break;
 
+   /* write back status to mark page dirty */
+   eop_desc->wb.status = eop_desc->wb.status;
+
/* clear next_to_watch to prevent false hangs */
tx_buffer->next_to_watch = NULL;
tx_buffer->desc_num = 0;
@@ -946,15 +949,17 @@ static struct sk_buff *ixgbevf_fetch_rx_buffer(struct 
ixgbevf_ring *rx_ring,
 {
struct ixgbevf_rx_buffer *rx_buffer;
struct page *page;
+   u8 *page_addr;
 
rx_buffer = &rx_ring->rx_buffer_info[rx_ring->next_to_clean];
page = rx_buffer->page;
prefetchw(page);
 
-   if (likely(!skb)) {
-   void *page_addr = page_address(page) +
- rx_buffer->page_offset;
+   /* Mark page dirty */
+   page_addr = page_address(page) + rx_buffer->page_offset;
+   *page_addr = *page_addr;
 
+   if (likely(!skb)) {
/* prefetch first cache line of first page */
prefetch(page_addr);
 #if L1_CACHE_BYTES < 128
@@ -1032,6 +1037,9 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
if (!ixgbevf_test_staterr(rx_desc, IXGBE_RXD_STAT_DD))
break;
 
+   /* Write back status to mark page dirty */
+   rx_desc->wb.upper.status_error = rx_desc->wb.upper.status_error;
+
/* This memory barrier is needed to keep us from reading
 * any other fields out of the rx_desc until we know the
 * RXD_STAT_DD bit is set
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC Patch 06/12] IXGBEVF: Add self emulation layer

2015-10-21 Thread Lan Tianyu
In order to restore VF function after migration, add self emulation layer
to record regs' values during accessing regs.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbevf/Makefile|  3 ++-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  |  2 +-
 .../net/ethernet/intel/ixgbevf/self-emulation.c| 26 ++
 drivers/net/ethernet/intel/ixgbevf/vf.h|  5 -
 4 files changed, 33 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c

diff --git a/drivers/net/ethernet/intel/ixgbevf/Makefile 
b/drivers/net/ethernet/intel/ixgbevf/Makefile
index 4ce4c97..841c884 100644
--- a/drivers/net/ethernet/intel/ixgbevf/Makefile
+++ b/drivers/net/ethernet/intel/ixgbevf/Makefile
@@ -31,7 +31,8 @@
 
 obj-$(CONFIG_IXGBEVF) += ixgbevf.o
 
-ixgbevf-objs := vf.o \
+ixgbevf-objs := self-emulation.o \
+   vf.o \
 mbx.o \
 ethtool.o \
 ixgbevf_main.o
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index a16d267..4446916 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -156,7 +156,7 @@ u32 ixgbevf_read_reg(struct ixgbe_hw *hw, u32 reg)
 
if (IXGBE_REMOVED(reg_addr))
return IXGBE_FAILED_READ_REG;
-   value = readl(reg_addr + reg);
+   value = ixgbe_self_emul_readl(reg_addr, reg);
if (unlikely(value == IXGBE_FAILED_READ_REG))
ixgbevf_check_remove(hw, reg);
return value;
diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c 
b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
new file mode 100644
index 000..d74b2da
--- /dev/null
+++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
@@ -0,0 +1,26 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vf.h"
+#include "ixgbevf.h"
+
+static u32 hw_regs[0x4000];
+
+u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr)
+{
+   u32 tmp;
+
+   tmp = readl(base + addr);
+   hw_regs[(unsigned long)addr] = tmp;
+
+   return tmp;
+}
+
+void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr)
+{
+   hw_regs[(unsigned long)addr] = val;
+   writel(val, (volatile void __iomem *)(base + addr));
+}
diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h 
b/drivers/net/ethernet/intel/ixgbevf/vf.h
index d40f036..6a3f4eb 100644
--- a/drivers/net/ethernet/intel/ixgbevf/vf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/vf.h
@@ -39,6 +39,9 @@
 
 struct ixgbe_hw;
 
+u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr);
+void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr);
+
 /* iterator type for walking multicast address lists */
 typedef u8* (*ixgbe_mc_addr_itr) (struct ixgbe_hw *hw, u8 **mc_addr_ptr,
  u32 *vmdq);
@@ -182,7 +185,7 @@ static inline void ixgbe_write_reg(struct ixgbe_hw *hw, u32 
reg, u32 value)
 
if (IXGBE_REMOVED(reg_addr))
return;
-   writel(value, reg_addr + reg);
+   ixgbe_self_emul_writel(value, reg_addr, reg);
 }
 
 #define IXGBE_WRITE_REG(h, r, v) ixgbe_write_reg(h, r, v)
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"

2015-10-21 Thread Lan Tianyu
This patch is to add new sysfs interface of "notify_vf" under sysfs
directory of VF PCI device for Qemu to notify VF when migration status
is changed.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 30 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_type.h  |  4 
 2 files changed, 34 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index e247d67..5cc7817 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -217,10 +217,37 @@ static ssize_t ixgbe_store_state_in_pf(struct device *dev,
return count;
 }
 
+static ssize_t ixgbe_store_notify_vf(struct device *dev,
+  struct device_attribute *attr,
+  const char *buf, size_t count)
+{
+   struct ixgbe_adapter *adapter = to_adapter(dev);
+   struct ixgbe_hw *hw = &adapter->hw;
+   struct pci_dev *vf_pdev = to_pci_dev(dev);
+   int vfn = vf_pdev->virtfn_index;
+   u32 ivar;
+
+   /* Enable VF mailbox irq first */
+   IXGBE_WRITE_REG(hw, IXGBE_PVTEIMS(vfn), 0x4);
+   IXGBE_WRITE_REG(hw, IXGBE_PVTEIAM(vfn), 0x4);
+   IXGBE_WRITE_REG(hw, IXGBE_PVTEIAC(vfn), 0x4);
+
+   ivar = IXGBE_READ_REG(hw, IXGBE_PVTIVAR_MISC(vfn));
+   ivar &= ~0xFF;
+   ivar |= 0x2 | IXGBE_IVAR_ALLOC_VAL;
+   IXGBE_WRITE_REG(hw, IXGBE_PVTIVAR_MISC(vfn), ivar);
+
+   ixgbe_ping_vf(adapter, vfn);
+   return count;
+}
+
 static struct device_attribute ixgbe_per_state_in_pf_attribute =
__ATTR(state_in_pf, S_IRUGO | S_IWUSR,
ixgbe_show_state_in_pf, ixgbe_store_state_in_pf);
 
+static struct device_attribute ixgbe_per_notify_vf_attribute =
+   __ATTR(notify_vf, S_IWUSR, NULL, ixgbe_store_notify_vf);
+
 void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter)
 {
struct pci_dev *pdev = adapter->pdev;
@@ -241,6 +268,8 @@ void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter)
if (vfdev->is_virtfn) {
ret = device_create_file(&vfdev->dev,
&ixgbe_per_state_in_pf_attribute);
+   ret |= device_create_file(&vfdev->dev,
+   &ixgbe_per_notify_vf_attribute);
if (ret)
pr_warn("Unable to add VF attribute for dev 
%s,\n",
dev_name(&vfdev->dev));
@@ -269,6 +298,7 @@ void ixgbe_remove_vf_attrib(struct ixgbe_adapter *adapter)
while (vfdev) {
if (vfdev->is_virtfn) {
device_remove_file(&vfdev->dev, 
&ixgbe_per_state_in_pf_attribute);
+   device_remove_file(&vfdev->dev, 
&ixgbe_per_notify_vf_attribute);
}
 
vfdev = pci_get_device(pdev->vendor, vf_id, vfdev);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
index dd6ba59..c6ddb66 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
@@ -2302,6 +2302,10 @@ enum {
 #define IXGBE_PVFTDT(P)(0x06018 + (0x40 * (P)))
 #define IXGBE_PVFTDWBAL(P) (0x06038 + (0x40 * (P)))
 #define IXGBE_PVFTDWBAH(P) (0x0603C + (0x40 * (P)))
+#define IXGBE_PVTEIMS(P)   (0x00D00 + (4 * (P)))
+#define IXGBE_PVTIVAR_MISC(P)  (0x04E00 + (4 * (P)))
+#define IXGBE_PVTEIAC(P)   (0x00F00 + (4 * P))
+#define IXGBE_PVTEIAM(P)   (0x04D00 + (4 * P))
 
 #define IXGBE_PVFTDWBALn(q_per_pool, vf_number, vf_q_index) \
(IXGBE_PVFTDWBAL((q_per_pool)*(vf_number) + (vf_q_index)))
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC Patch 11/12] IXGBEVF: Migrate VF statistic data

2015-10-21 Thread Lan Tianyu
VF statistic regs are read-only and can't be migrated via writing back
directly.

Currently, statistic data returned to user space by the driver is not equal
to value of statistic regs. VF driver records value of statistic regs as base 
data
when net interface is up or open, calculate increased count of regs during
last period of online service and added it to saved_reset data. When user
space collects statistic data, VF driver returns result of
"current - base + saved_reset". "Current" is reg value at that point.

Restoring net function after migration just likes net interface is up or open.
Call existed function to update base and saved_reset data to keep statistic
data continual during migration.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 04b6ce7..d22160f 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -3005,6 +3005,7 @@ int ixgbevf_live_mg(struct ixgbevf_adapter *adapter)
return 0;
 
del_timer_sync(&adapter->service_timer);
+   ixgbevf_update_stats(adapter);
pr_info("migration start\n");
migration_status = MIGRATION_IN_PROGRESS; 
 
@@ -3017,6 +3018,8 @@ int ixgbevf_live_mg(struct ixgbevf_adapter *adapter)
return 1;
 
ixgbevf_restore_state(adapter);
+   ixgbevf_save_reset_stats(adapter);
+   ixgbevf_init_last_counter_stats(adapter);
migration_status = MIGRATION_COMPLETED;
pr_info("migration end\n");
return 0;
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC Patch 07/12] IXGBEVF: Add new mail box event for migration

2015-10-21 Thread Lan Tianyu
VF status in the PF driver needs to be restored after migration and reset
VF hardware. This patch is to add a new event for VF driver to notify PF
driver to restore status.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbevf/mbx.h |  3 +++
 drivers/net/ethernet/intel/ixgbevf/vf.c  | 10 ++
 drivers/net/ethernet/intel/ixgbevf/vf.h  |  1 +
 3 files changed, 14 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/mbx.h 
b/drivers/net/ethernet/intel/ixgbevf/mbx.h
index 82f44e0..22761d8 100644
--- a/drivers/net/ethernet/intel/ixgbevf/mbx.h
+++ b/drivers/net/ethernet/intel/ixgbevf/mbx.h
@@ -112,6 +112,9 @@ enum ixgbe_pfvf_api_rev {
 #define IXGBE_VF_GET_RETA  0x0a/* VF request for RETA */
 #define IXGBE_VF_GET_RSS_KEY   0x0b/* get RSS hash key */
 
+/* mail box event for live migration  */
+#define IXGBE_VF_NOTIFY_RESUME  0x0c /* VF notify PF migration to restore 
status */
+
 /* length of permanent address message returned from PF */
 #define IXGBE_VF_PERMADDR_MSG_LEN  4
 /* word in permanent address message with the current multicast type */
diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.c 
b/drivers/net/ethernet/intel/ixgbevf/vf.c
index d1339b0..1e4e5e6 100644
--- a/drivers/net/ethernet/intel/ixgbevf/vf.c
+++ b/drivers/net/ethernet/intel/ixgbevf/vf.c
@@ -717,6 +717,15 @@ int ixgbevf_get_queues(struct ixgbe_hw *hw, unsigned int 
*num_tcs,
return err;
 }
 
+static void ixgbevf_notify_resume_vf(struct ixgbe_hw *hw)
+{
+   struct ixgbe_mbx_info *mbx = &hw->mbx;
+   u32 msgbuf[1];
+
+   msgbuf[0] = IXGBE_VF_NOTIFY_RESUME;
+   mbx->ops.write_posted(hw, msgbuf, 1);
+}
+
 static const struct ixgbe_mac_operations ixgbevf_mac_ops = {
.init_hw= ixgbevf_init_hw_vf,
.reset_hw   = ixgbevf_reset_hw_vf,
@@ -729,6 +738,7 @@ static const struct ixgbe_mac_operations ixgbevf_mac_ops = {
.update_mc_addr_list= ixgbevf_update_mc_addr_list_vf,
.set_uc_addr= ixgbevf_set_uc_addr_vf,
.set_vfta   = ixgbevf_set_vfta_vf,
+   .notify_resume  = ixgbevf_notify_resume_vf,
 };
 
 const struct ixgbevf_info ixgbevf_82599_vf_info = {
diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h 
b/drivers/net/ethernet/intel/ixgbevf/vf.h
index 6a3f4eb..a25fe81 100644
--- a/drivers/net/ethernet/intel/ixgbevf/vf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/vf.h
@@ -70,6 +70,7 @@ struct ixgbe_mac_operations {
s32 (*disable_mc)(struct ixgbe_hw *);
s32 (*clear_vfta)(struct ixgbe_hw *);
s32 (*set_vfta)(struct ixgbe_hw *, u32, u32, bool);
+   void (*notify_resume)(struct ixgbe_hw *); 
 };
 
 enum ixgbe_mac_type {
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device

2015-10-21 Thread Lan Tianyu
Add "virtfn_index" member in the struct pci_device to record VF sequence
of PF. This will be used in the VF sysfs node handle.

Signed-off-by: Lan Tianyu 
---
 drivers/pci/iov.c   | 1 +
 include/linux/pci.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index ee0ebff..065b6bb 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -136,6 +136,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int 
reset)
virtfn->physfn = pci_dev_get(dev);
virtfn->is_virtfn = 1;
virtfn->multifunction = 0;
+   virtfn->virtfn_index = id;
 
for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
res = &dev->resource[i + PCI_IOV_RESOURCES];
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 353db8d..85c5531 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -356,6 +356,7 @@ struct pci_dev {
unsigned intio_window_1k:1; /* Intel P2P bridge 1K I/O windows */
unsigned intirq_managed:1;
pci_dev_flags_t dev_flags;
+   unsigned intvirtfn_index;
atomic_tenable_cnt; /* pci_enable_device has been called */
 
u32 saved_config_space[16]; /* config space saved at 
suspend time */
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC Patch 02/12] IXGBE: Add new mail box event to restore VF status in the PF driver

2015-10-21 Thread Lan Tianyu
This patch is to restore VF status in the PF driver when get event
from VF.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h   |  1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h   |  1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 40 ++
 3 files changed, 42 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 636f9e3..9d5669a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -148,6 +148,7 @@ struct vf_data_storage {
bool pf_set_mac;
u16 pf_vlan; /* When set, guest VLAN config not allowed. */
u16 pf_qos;
+   u32 vf_lpe;
u16 tx_rate;
u16 vlan_count;
u8 spoofchk_enabled;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
index b1e4703..8fdb38d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
@@ -91,6 +91,7 @@ enum ixgbe_pfvf_api_rev {
 
 /* mailbox API, version 1.1 VF requests */
 #define IXGBE_VF_GET_QUEUES0x09 /* get queue configuration */
+#define IXGBE_VF_NOTIFY_RESUME0x0c /* VF notify PF migration finishing */
 
 /* GET_QUEUES return data indices within the mailbox */
 #define IXGBE_VF_TX_QUEUES 1   /* number of Tx queues supported */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index 1d17b58..ab2a2e2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -648,6 +648,42 @@ static inline void ixgbe_write_qde(struct ixgbe_adapter 
*adapter, u32 vf,
}
 }
 
+/**
+ *  Restore the settings by mailbox, after migration
+ **/
+void ixgbe_restore_setting(struct ixgbe_adapter *adapter, u32 vf)
+{
+   struct ixgbe_hw *hw = &adapter->hw;
+   u32 reg, reg_offset, vf_shift;
+   int rar_entry = hw->mac.num_rar_entries - (vf + 1);
+
+   vf_shift = vf % 32;
+   reg_offset = vf / 32;
+
+   /* enable transmit and receive for vf */
+   reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset));
+   reg |= (1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset));
+   reg |= (1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset));
+   reg |= (1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg);
+
+   ixgbe_vf_reset_event(adapter, vf);
+
+   hw->mac.ops.set_rar(hw, rar_entry,
+   adapter->vfinfo[vf].vf_mac_addresses,
+   vf, IXGBE_RAH_AV);
+
+
+   if (adapter->vfinfo[vf].vf_lpe)
+   ixgbe_set_vf_lpe(adapter, &adapter->vfinfo[vf].vf_lpe, vf);
+}
+
 static int ixgbe_vf_reset_msg(struct ixgbe_adapter *adapter, u32 vf)
 {
struct ixgbe_ring_feature *vmdq = &adapter->ring_feature[RING_F_VMDQ];
@@ -1047,6 +1083,7 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter 
*adapter, u32 vf)
break;
case IXGBE_VF_SET_LPE:
retval = ixgbe_set_vf_lpe(adapter, msgbuf, vf);
+   adapter->vfinfo[vf].vf_lpe = *msgbuf;
break;
case IXGBE_VF_SET_MACVLAN:
retval = ixgbe_set_vf_macvlan_msg(adapter, msgbuf, vf);
@@ -1063,6 +1100,9 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter 
*adapter, u32 vf)
case IXGBE_VF_GET_RSS_KEY:
retval = ixgbe_get_vf_rss_key(adapter, msgbuf, vf);
break;
+   case IXGBE_VF_NOTIFY_RESUME:
+   ixgbe_restore_setting(adapter, vf);
+   break;
default:
e_err(drv, "Unhandled Msg %8.8x\n", msgbuf[0]);
retval = IXGBE_ERR_MBX;
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC Patch 04/12] IXGBE: Add ixgbe_ping_vf() to notify a specified VF via mailbox msg.

2015-10-21 Thread Lan Tianyu
This patch is to add ixgbe_ping_vf() to notify a specified VF. When
migration status is changed, it's necessary to notify VF the change.
VF driver will check the migrate status when it gets mailbox msg.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 19 ---
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h |  1 +
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index 89671eb..e247d67 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -1318,18 +1318,23 @@ void ixgbe_disable_tx_rx(struct ixgbe_adapter *adapter)
IXGBE_WRITE_REG(hw, IXGBE_VFRE(1), 0);
 }
 
-void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter)
+void ixgbe_ping_vf(struct ixgbe_adapter *adapter, int vfn)
 {
struct ixgbe_hw *hw = &adapter->hw;
u32 ping;
+
+   ping = IXGBE_PF_CONTROL_MSG;
+   if (adapter->vfinfo[vfn].clear_to_send)
+   ping |= IXGBE_VT_MSGTYPE_CTS;
+   ixgbe_write_mbx(hw, &ping, 1, vfn);
+}
+
+void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter)
+{
int i;
 
-   for (i = 0 ; i < adapter->num_vfs; i++) {
-   ping = IXGBE_PF_CONTROL_MSG;
-   if (adapter->vfinfo[i].clear_to_send)
-   ping |= IXGBE_VT_MSGTYPE_CTS;
-   ixgbe_write_mbx(hw, &ping, 1, i);
-   }
+   for (i = 0 ; i < adapter->num_vfs; i++)
+   ixgbe_ping_vf(adapter, i);
 }
 
 int ixgbe_ndo_set_vf_mac(struct net_device *netdev, int vf, u8 *mac)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
index 2c197e6..143e2fd 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
@@ -41,6 +41,7 @@ void ixgbe_msg_task(struct ixgbe_adapter *adapter);
 int ixgbe_vf_configuration(struct pci_dev *pdev, unsigned int event_mask);
 void ixgbe_disable_tx_rx(struct ixgbe_adapter *adapter);
 void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter);
+void ixgbe_ping_vf(struct ixgbe_adapter *adapter, int vfn);
 int ixgbe_ndo_set_vf_mac(struct net_device *netdev, int queue, u8 *mac);
 int ixgbe_ndo_set_vf_vlan(struct net_device *netdev, int queue, u16 vlan,
   u8 qos);
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] KVM: Avoid warning "user requested TSC rate below hardware speed" when create VM.

2015-06-16 Thread Lan Tianyu
KVM populates max_tsc_khz with tsc_khz at arch init stage on the
constant tsc machine and creates VM with max_tsc_khz as tsc. However,
tsc_khz maybe changed during tsc clocksource driver refines calibration.
This will cause to create VM with slow tsc and produce the following
warning. To fix the issue, compare max_tsc_khz with tsc_khz and
update max_tsc_khz with new value of tsc_khz if it has been changed
when create a VM.

[   94.916906] [ cut here ]
[   94.922127] WARNING: CPU: 0 PID: 824 at arch/x86/kvm/vmx.c:2272 
vmx_set_tsc_khz+0x3e/0x50()
[   94.931503] user requested TSC rate below hardware speed
[   94.937470] Modules linked in:
[   94.940923] CPU: 0 PID: 824 Comm: qemu-system-x86 Tainted: G  D W   
4.1.0-rc3+ #4
[   94.960350]  81f453f8 88027e9f3bc8 81b5eb8a 

[   94.968721]  88027e9f3c18 88027e9f3c08 810e6f8a 
8802
[   94.977103]  001d3300 88027e98 0001 
88027e98
[   94.985476] Call Trace:
[   94.988242]  [] dump_stack+0x45/0x57
[   94.994020]  [] warn_slowpath_common+0x8a/0xc0
[   95.000772]  [] warn_slowpath_fmt+0x46/0x50
[   95.007222]  [] vmx_set_tsc_khz+0x3e/0x50
[   95.013488]  [] kvm_set_tsc_khz.part.106+0xa7/0xe0
[   95.020618]  [] kvm_arch_vcpu_init+0x208/0x230
[   95.027373]  [] kvm_vcpu_init+0xc9/0x110
[   95.033540]  [] vmx_create_vcpu+0x70/0xc30
[   95.039911]  [] ? vmx_create_vcpu+0x20/0xc30
[   95.046476]  [] kvm_arch_vcpu_create+0x3e/0x60
[   95.053233]  [] kvm_vm_ioctl+0x1a0/0x770
[   95.059400]  [] ? __fget+0x5/0x200
[   95.064991]  [] ? rcu_irq_exit+0x7f/0xd0
[   95.071157]  [] do_vfs_ioctl+0x308/0x540
[   95.077323]  [] ? expand_files+0x1f1/0x280
[   95.083684]  [] ? selinux_file_ioctl+0x5b/0x100
[   95.090538]  [] SyS_ioctl+0x81/0xa0
[   95.096218]  [] system_call_fastpath+0x12/0x76
[   95.102974] ---[ end trace 08ade884081d9dd7 ]---

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99861
Signed-off-by: Lan Tianyu 
---
 arch/x86/kvm/x86.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 43f0df7..6c7fefe 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7814,6 +7814,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 {
struct page *page;
struct kvm *kvm;
+   int cpu;
int r;
 
BUG_ON(vcpu->kvm == NULL);
@@ -7833,6 +7834,21 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
}
vcpu->arch.pio_data = page_address(page);
 
+   /*
+* max_tsc_khz records tsc_khz at arch init stage on the constant tsc
+* machine. However, tsc_khz maybe changed during tsc clocksource driver
+* refines calibration. This will cause to create VM with slow tsc
+* and produce warning. To avoid such case, check whether tsc_khz
+* has been changed here and update max_tsc_khz with new value of
+* tsc_khz if changed.
+*/
+   if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && max_tsc_khz != tsc_khz) {
+   max_tsc_khz = tsc_khz;
+   pr_debug("kvm: max_tsc_khz is changed to %ld\n", max_tsc_khz);
+   for_each_online_cpu(cpu)
+   smp_call_function_single(cpu, tsc_khz_changed, NULL, 1);
+   }
+
kvm_set_tsc_khz(vcpu, max_tsc_khz);
 
r = kvm_mmu_create(vcpu);
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ACPI / osl: add acpi_os_down_wait to avoid a schedule BUG

2015-06-02 Thread Lan Tianyu
On 2015年05月28日 14:33, Pan, XinhuiX wrote:
> acpi_os_wait_semaphore can be called in local/hard irq disabled path. like in 
> cpu up/down callback.
> So when dirver try to acquire the semaphore, current code may call down_wait 
> which might sleep.
> Then hit panic as we can't schedule here. So introduce acpi_os_down_wait to 
> cover such case.
> acpi_os_down_wait use down_trylock, and use cpu_relax to wait the semaphore 
> signalled if preempt is disabled.
> 
> below is the panic.

Hi Xinhui:

Does this issue happen in the latest upstream kernel?
In the latest code, acpi_cpu_soft_notify() doesn't detail with CPU_DYING
event and return directly. The issue should not take place.

> 
> [ 1148.230132, 1]smpboot: CPU 3 is now offline
> [ 1148.277288, 0]smpboot: CPU 2 is now offline
> [ 1148.322385, 1]BUG: scheduling while atomic: migration/1/13/0x0002
> [ 1148.329604, 1]Modules linked in: hid_sensor_hub sens_col_core hid_heci_ish 
> heci_ish heci vidt_driver atomisp_css2401a0_v21 lm3642 8723bs(O) cfg80211 
> gc2235 bt_lpm videobuf_vmalloc 6lowpan_iphc ip6table_raw iptable_raw 
> videobuf_core rfkill_gpio atmel_mxt_ts
> [ 1148.355276, 1]CPU: 1 PID: 13 Comm: migration/1 Tainted: GW  O 
> 3.14.37-x86_64-L1-R409-g73e8207 #25
> [ 1148.365983, 1]Hardware name: Intel Corporation CHERRYVIEW C0 
> PLATFORM/Cherry Trail CR, BIOS CH2TCR.X64.0004.R48.1504211851 04/21/2015
> [ 1148.379397, 1] 880077801140 880073233a58 819eec6c 
> 8800732303d0
> [ 1148.387914, 1] 880073233a70 819eb0e0 88007ac92240 
> 880073233ad0
> [ 1148.396430, 1] 819f790a 8800732303d0 880073233fd8 
> 00012240
> [ 1148.404948, 1]Call Trace:
> [ 1148.407912, 1] [] dump_stack+0x4e/0x7a
> [ 1148.413872, 1] [] __schedule_bug+0x58/0x67
> [ 1148.420219, 1] [] __schedule+0x67a/0x7b0
> [ 1148.426369, 1] [] schedule+0x29/0x70
> [ 1148.432123, 1] [] schedule_timeout+0x269/0x310
> [ 1148.438860, 1] [] ? update_group_power+0x16c/0x260
> [ 1148.445988, 1] [] __down_common+0x91/0xd6
> [ 1148.452236, 1] [] ? update_cfs_rq_blocked_load+0xc0/0x130
> [ 1148.460036, 1] [] __down_timeout+0x16/0x18
> [ 1148.466380, 1] [] down_timeout+0x4c/0x60
> [ 1148.472534, 1] [] acpi_os_wait_semaphore+0x43/0x57
> [ 1148.479658, 1] [] acpi_ut_acquire_mutex+0x48/0x88
> [ 1148.486683, 1] [] ? acpi_match_device+0x4d/0x4d
> [ 1148.493516, 1] [] acpi_get_data+0x35/0x77
> [ 1148.499761, 1] [] acpi_bus_get_device+0x21/0x3e
> [ 1148.506593, 1] [] acpi_cpu_soft_notify+0x3d/0xd3
> [ 1148.513522, 1] [] notifier_call_chain+0x53/0xa0
> [ 1148.520356, 1] [] ? cpu_stop_park+0x51/0x70
> [ 1148.526801, 1] [] __raw_notifier_call_chain+0xe/0x10
> [ 1148.534118, 1] [] cpu_notify+0x23/0x50
> [ 1148.540075, 1] [] take_cpu_down+0x27/0x40
> [ 1148.546322, 1] [] multi_cpu_stop+0xc1/0x110
> [ 1148.552763, 1] [] ? cpu_stop_should_run+0x50/0x50
> [ 1148.559776, 1] [] cpu_stopper_thread+0x78/0x150
> [ 1148.566608, 1] [] ? _raw_spin_unlock_irq+0x1e/0x40
> [ 1148.573730, 1] [] ? finish_task_switch+0x57/0xd0
> [ 1148.580646, 1] [] ? __schedule+0x37e/0x7b0
> [ 1148.586991, 1] [] smpboot_thread_fn+0x17d/0x2b0
> [ 1148.593819, 1] [] ? SyS_setgroups+0x160/0x160
> [ 1148.600455, 1] [] kthread+0xe4/0x100
> [ 1148.606208, 1] [] ? kthread_create_on_node+0x190/0x190
> [ 1148.613721, 1] [] ret_from_fork+0x58/0x90
> [ 1148.619967, 1] [] ? kthread_create_on_node+0x190/0x190
> 
> Signed-off-by: Pan Xinhui 
> ---
>   drivers/acpi/osl.c | 28 +++-
>   1 file changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> index 7ccba39..57a1812 100644
> --- a/drivers/acpi/osl.c
> +++ b/drivers/acpi/osl.c
> @@ -1195,6 +1195,32 @@ void acpi_os_wait_events_complete(void)
> flush_workqueue(kacpi_notify_wq);
>   }
> 
> +static int acpi_os_down_wait(struct semaphore *sem, long jiffies_timeout)
> +{
> +   unsigned long deadline_time;
> +   int ret = 0;
> +
> +   if (down_trylock(sem)) {
> +   if (unlikely(preempt_count())) {
> +   deadline_time = jiffies + jiffies_timeout;
> +   while (true) {
> +   cpu_relax();
> +
> +   if (!down_trylock(sem))
> +   break;
> +
> +   if (time_after(jiffies, deadline_time)) {
> +   ret = -ETIME;
> +   break;
> +   }
> +   }
> +   } else
> +   ret = down_timeout(sem, jiffies_timeout);
> +   }
> +
> +   return ret;
> +}
> +
>   struct acpi_hp_work {
> struct work_struct work;
> struct acpi_device *adev;
> @@ -1309,7 +1335,7 @@ acpi_status acpi_os_wait_semaphore(acpi_handle handle, 
> u32 units, u16 timeout)
> else
> jiffies = msecs_to_jiffies(timeout);
> 
> -   ret = down_timeout(sem

[PATCH] ACPI/EC: Call acpi_walk_dep_device_list() after installing EC opregion handler

2015-03-31 Thread Lan Tianyu
On some machines(E,G Mircosoft surface 3), ACPI battery depends on
the EC operation region and it has _DEP method which contains EC.
Current code doesn't support such devices whose dep_unmet will be
not be decreased after EC opregion handler being installed. This
blocks battery device to be attached with its driver. This patch
is to fix the issue.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=90161
Tested-and-reported-by: Lompik 
Tested-by: Valentin Lab 
Signed-off-by: Lan Tianyu 
---
 drivers/acpi/ec.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/acpi/ec.c b/drivers/acpi/ec.c
index a362f20..220d640 100644
--- a/drivers/acpi/ec.c
+++ b/drivers/acpi/ec.c
@@ -1147,6 +1147,9 @@ static int acpi_ec_add(struct acpi_device *device)
 
ret = ec_install_handlers(ec);
 
+   /* Reprobe devices depending on the EC */
+   acpi_walk_dep_device_list(ec->handle);
+
/* EC is fully operational, allow queries */
clear_bit(EC_FLAGS_QUERY_PENDING, &ec->flags);
 
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V3] ACPI: Add _DEP(Operation Region Dependencies) support to fix battery issue on the Asus T100TA

2014-11-23 Thread Lan Tianyu
ACPI 5.0 introduces _DEP to designate device objects that OSPM should
assign a higher priority in start ordering due to future operation region
accesses.

On Asus T100TA, ACPI battery info are read from a I2C slave device via
I2C operation region. Before I2C operation region handler is installed,
battery _STA always returns 0. There is a _DEP method of designating
start order under battery device node.

This patch is to implement _DEP feature to fix battery issue on the Asus T100TA.
Introducing acpi_dep_list and adding dep_unmet count in the struct
acpi_device. During ACPI namespace scan, create struct acpi_dep_data for a
valid pair of master (device pointed to by _DEP)/slave(device with _DEP), record
master's and slave's ACPI handle in it and put it into acpi_dep_list. The 
dep_unmet
count will increase by one if there is a device under its _DEP. Driver's 
probe() should
return EPROBE_DEFER when find dep_unmet is larger than 0. When I2C operation
region handler is installed, remove all struct acpi_dep_data on the 
acpi_dep_list
whose master is pointed to I2C host controller and decrease slave's dep_unmet.
When dep_unmet decreases to 0, all _DEP conditions are met and then do 
acpi_bus_attach()
for the device in order to resolve battery _STA issue on the Asus T100TA.

Reference: https://bugzilla.kernel.org/show_bug.cgi?id=69011
Tested-by: Jan-Michael Brummer 
Tested-by: Adam Williamson 
Tested-by: Michael Shigorin 
Acked-by: Wolfram Sang 
Acked-by: Mika Westerberg 
Signed-off-by: Lan Tianyu 
---
Change since V2:
Coding style change
Change since V1:
Remove redundant blank line and some coding style fixs.


 drivers/acpi/battery.c  |  4 +++
 drivers/acpi/scan.c | 85 +
 drivers/i2c/i2c-core.c  |  1 +
 include/acpi/acpi_bus.h |  1 +
 include/linux/acpi.h|  1 +
 5 files changed, 92 insertions(+)

diff --git a/drivers/acpi/battery.c b/drivers/acpi/battery.c
index 8ec8a89..d98ba43 100644
--- a/drivers/acpi/battery.c
+++ b/drivers/acpi/battery.c
@@ -1180,6 +1180,10 @@ static int acpi_battery_add(struct acpi_device *device)
 
if (!device)
return -EINVAL;
+
+   if (device->dep_unmet)
+   return -EPROBE_DEFER;
+
battery = kzalloc(sizeof(struct acpi_battery), GFP_KERNEL);
if (!battery)
return -ENOMEM;
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index 9cb5cca..1b1cf55 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -36,6 +36,8 @@ bool acpi_force_hot_remove;
 
 static const char *dummy_hid = "device";
 
+static LIST_HEAD(acpi_dep_list);
+static DEFINE_MUTEX(acpi_dep_list_lock);
 static LIST_HEAD(acpi_bus_id_list);
 static DEFINE_MUTEX(acpi_scan_lock);
 static LIST_HEAD(acpi_scan_handlers_list);
@@ -43,6 +45,12 @@ DEFINE_MUTEX(acpi_device_lock);
 LIST_HEAD(acpi_wakeup_device_list);
 static DEFINE_MUTEX(acpi_hp_context_lock);
 
+struct acpi_dep_data {
+   struct list_head node;
+   acpi_handle master;
+   acpi_handle slave;
+};
+
 struct acpi_device_bus_id{
char bus_id[15];
unsigned int instance_no;
@@ -2193,6 +2201,59 @@ static void acpi_scan_init_hotplug(struct acpi_device 
*adev)
}
 }
 
+static void acpi_device_dep_initialize(struct acpi_device *adev)
+{
+   struct acpi_dep_data *dep;
+   struct acpi_handle_list dep_devices;
+   acpi_status status;
+   int i;
+
+   if (!acpi_has_method(adev->handle, "_DEP"))
+   return;
+
+   status = acpi_evaluate_reference(adev->handle, "_DEP", NULL,
+   &dep_devices);
+   if (ACPI_FAILURE(status)) {
+   dev_err(&adev->dev, "Failed to evaluate _DEP.\n");
+   return;
+   }
+
+   for (i = 0; i < dep_devices.count; i++) {
+   struct acpi_device_info *info;
+   int skip;
+
+   status = acpi_get_object_info(dep_devices.handles[i], &info);
+   if (ACPI_FAILURE(status)) {
+   dev_err(&adev->dev, "Error reading device info\n");
+   continue;
+   }
+
+   /*
+* Skip the dependency of Windows System Power
+* Management Controller
+*/
+   skip = info->valid & ACPI_VALID_HID &&
+   !strcmp(info->hardware_id.string, "INT3396");
+
+   kfree(info);
+
+   if (skip)
+   continue;
+
+   dep = kzalloc(sizeof(struct acpi_dep_data), GFP_KERNEL);
+   if (!dep)
+   return;
+
+   dep->master = dep_devices.handles[i];
+   dep->slave  = adev->handle;
+   adev->dep_unmet++;
+
+   mutex_lock(&acpi_dep_list_lock);
+   

[PATCH V2] ACPI: Add _DEP(Operation Region Dependencies) support to fix battery issue on the Asus T100TA

2014-11-20 Thread Lan Tianyu
ACPI 5.0 introduces _DEP to designate device objects that OSPM should
assign a higher priority in start ordering due to future operation region
accesses.

On Asus T100TA, ACPI battery info are read from a I2C slave device via
I2C operation region. Before I2C operation region handler is installed,
battery _STA always returns 0. There is a _DEP method of designating
start order under battery device node.

This patch is to implement _DEP feature to fix battery issue on the Asus T100TA.
Introducing acpi_dep_list and adding dep_unmet count in the struct
acpi_device. During ACPI namespace scan, create struct acpi_dep_data for a
valid pair of master (device pointed to by _DEP)/slave(device with _DEP), record
master's and slave's ACPI handle in it and put it into acpi_dep_list. The 
dep_unmet
count will increase by one if there is a device under its _DEP. Driver's 
probe() should
return EPROBE_DEFER when find dep_unmet is larger than 0. When I2C operation
region handler is installed, remove all struct acpi_dep_data on the 
acpi_dep_list
whose master is pointed to I2C host controller and decrease slave's dep_unmet.
When dep_unmet decreases to 0, all _DEP conditions are met and then do 
acpi_bus_attach()
for the device in order to resolve battery _STA issue on the Asus T100TA.

Reference: https://bugzilla.kernel.org/show_bug.cgi?id=69011
Tested-by: Jan-Michael Brummer 
Tested-by: Adam Williamson 
Tested-by: Michael Shigorin 
Acked-by: Wolfram Sang 
Signed-off-by: Lan Tianyu 
---
Change since V1:
Remove redundant blank line and some coding style fixs.
   
 drivers/acpi/battery.c  |  4 +++
 drivers/acpi/scan.c | 86 +
 drivers/i2c/i2c-core.c  |  1 +
 include/acpi/acpi_bus.h |  1 +
 include/linux/acpi.h|  1 +
 5 files changed, 93 insertions(+)

diff --git a/drivers/acpi/battery.c b/drivers/acpi/battery.c
index 8ec8a89..d98ba43 100644
--- a/drivers/acpi/battery.c
+++ b/drivers/acpi/battery.c
@@ -1180,6 +1180,10 @@ static int acpi_battery_add(struct acpi_device *device)
 
if (!device)
return -EINVAL;
+
+   if (device->dep_unmet)
+   return -EPROBE_DEFER;
+
battery = kzalloc(sizeof(struct acpi_battery), GFP_KERNEL);
if (!battery)
return -ENOMEM;
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index 9cb5cca..54a4102 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -36,6 +36,8 @@ bool acpi_force_hot_remove;
 
 static const char *dummy_hid = "device";
 
+static LIST_HEAD(acpi_dep_list);
+static DEFINE_MUTEX(acpi_dep_list_lock);
 static LIST_HEAD(acpi_bus_id_list);
 static DEFINE_MUTEX(acpi_scan_lock);
 static LIST_HEAD(acpi_scan_handlers_list);
@@ -43,6 +45,12 @@ DEFINE_MUTEX(acpi_device_lock);
 LIST_HEAD(acpi_wakeup_device_list);
 static DEFINE_MUTEX(acpi_hp_context_lock);
 
+struct acpi_dep_data {
+   struct list_head node;
+   acpi_handle master;
+   acpi_handle slave;
+};
+
 struct acpi_device_bus_id{
char bus_id[15];
unsigned int instance_no;
@@ -2193,6 +2201,60 @@ static void acpi_scan_init_hotplug(struct acpi_device 
*adev)
}
 }
 
+static void acpi_device_dep_initialize(struct acpi_device *adev)
+{
+   struct acpi_dep_data *dep;
+   struct acpi_handle_list dep_devices;
+   struct acpi_device_info *info;
+   acpi_status status;
+   int i, skip;
+
+   if (!acpi_has_method(adev->handle, "_DEP"))
+   return;
+
+   status = acpi_evaluate_reference(adev->handle, "_DEP", NULL,
+   &dep_devices);
+   if (ACPI_FAILURE(status)) {
+   dev_err(&adev->dev, "Failed to evaluate _DEP.\n");
+   return;
+   }
+
+   for (i = 0; i < dep_devices.count; i++) {
+   status = acpi_get_object_info(dep_devices.handles[i], &info);
+   if (ACPI_FAILURE(status)) {
+   dev_err(&adev->dev, "Error reading device info\n");
+   continue;
+   }
+
+   /*
+* Skip the dependency of Windows System Power
+* Management Controller
+*/
+   if (info->valid & ACPI_VALID_HID
+   && !strcmp(info->hardware_id.string, "INT3396"))
+   skip = 1;
+   else
+   skip = 0;
+
+   kfree(info);
+
+   if (skip)
+   continue;
+
+   dep = kzalloc(sizeof(struct acpi_dep_data), GFP_KERNEL);
+   if (!dep)
+   return;
+
+   dep->master = dep_devices.handles[i];
+   dep->slave  = adev->handle;
+   adev->dep_unmet++;
+
+   mutex_lock(&acpi_dep_list_lock);
+   list_add_tail(&

Re: [PATCH] ACPI: Add _DEP(Operation Region Dependencies) support to fix battery issue on the Asus T100TA

2014-11-19 Thread Lan Tianyu
Hi Mika:
Thank you for review. Will resolve your comments in the next version.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ACPI: Add _DEP(Operation Region Dependencies) support to fix battery issue on the Asus T100TA

2014-11-19 Thread Lan Tianyu
On 2014年11月20日 00:05, Wolfram Sang wrote:
> As long as the I2C related changes are that minimal:
> 
> Acked-by: Wolfram Sang  for the I2C part
> 

Thank you!

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ACPI: Add _DEP(Operation Region Dependencies) support to fix battery issue on the Asus T100TA

2014-11-19 Thread Lan Tianyu
ACPI 5.0 introduces _DEP to designate device objects that OSPM should
assign a higher priority in start ordering due to future operation region
accesses.

On Asus T100TA, ACPI battery info are read from a I2C slave device via
I2C operation region. Before I2C operation region handler is installed,
battery _STA always returns 0. There is a _DEP method of designating
start order under battery device node.

This patch is to implement _DEP feature to fix battery issue on the Asus T100TA.
Introducing acpi_dep_list and adding dep_unmet count in the struct
acpi_device. During ACPI namespace scan, create struct acpi_dep_data for a
valid pair of master (device pointed to by _DEP)/slave(device with _DEP), record
master's and slave's ACPI handle in it and put it into acpi_dep_list. The 
dep_unmet
count will increase by one if there is a device under its _DEP. Driver's 
probe() should
return EPROBE_DEFER when find dep_unmet larger than 0. When I2C operation
region handler is installed, remove all struct acpi_dep_data on the 
acpi_dep_list
whose master is pointed to I2C host controller and decrease slave's dep_unmet.
When dep_unmet decreases to 0, all _DEP conditions are met and then do 
acpi_bus_attach()
for the device in order to resolve battery _STA issue on the Asus T100TA.

Reference: https://bugzilla.kernel.org/show_bug.cgi?id=69011 
Tested-by: Jan-Michael Brummer 
Tested-by: Adam Williamson 
Tested-by: Michael Shigorin 
Signed-off-by: Lan Tianyu 
---
 drivers/acpi/battery.c  |  4 +++
 drivers/acpi/scan.c | 89 +
 drivers/i2c/i2c-core.c  |  1 +
 include/acpi/acpi_bus.h |  1 +
 include/linux/acpi.h|  3 ++
 5 files changed, 98 insertions(+)

diff --git a/drivers/acpi/battery.c b/drivers/acpi/battery.c
index 8ec8a89..d98ba43 100644
--- a/drivers/acpi/battery.c
+++ b/drivers/acpi/battery.c
@@ -1180,6 +1180,10 @@ static int acpi_battery_add(struct acpi_device *device)
 
if (!device)
return -EINVAL;
+
+   if (device->dep_unmet)
+   return -EPROBE_DEFER;
+
battery = kzalloc(sizeof(struct acpi_battery), GFP_KERNEL);
if (!battery)
return -ENOMEM;
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index 9cb5cca..0e58666 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -36,6 +36,8 @@ bool acpi_force_hot_remove;
 
 static const char *dummy_hid = "device";
 
+static LIST_HEAD(acpi_dep_list);
+static DEFINE_MUTEX(acpi_dep_list_lock);
 static LIST_HEAD(acpi_bus_id_list);
 static DEFINE_MUTEX(acpi_scan_lock);
 static LIST_HEAD(acpi_scan_handlers_list);
@@ -43,6 +45,12 @@ DEFINE_MUTEX(acpi_device_lock);
 LIST_HEAD(acpi_wakeup_device_list);
 static DEFINE_MUTEX(acpi_hp_context_lock);
 
+struct acpi_dep_data {
+   struct list_head node;
+   acpi_handle master;
+   acpi_handle slave;
+};
+
 struct acpi_device_bus_id{
char bus_id[15];
unsigned int instance_no;
@@ -2193,6 +2201,61 @@ static void acpi_scan_init_hotplug(struct acpi_device 
*adev)
}
 }
 
+static void acpi_device_dep_initialize(struct acpi_device *adev)
+{
+   struct acpi_dep_data *dep;
+   struct acpi_handle_list dep_devices;
+   struct acpi_device_info *info;
+   acpi_status status;
+   int i, skip;
+
+   if (!acpi_has_method(adev->handle, "_DEP"))
+   return;
+
+   status = acpi_evaluate_reference(adev->handle, "_DEP", NULL,
+   &dep_devices);
+   if (ACPI_FAILURE(status)) {
+   dev_err(&adev->dev, "Failed to evaluate _DEP.\n");
+   return;
+   }
+
+   for (i = 0; i < dep_devices.count; i++) {
+
+   status = acpi_get_object_info(dep_devices.handles[i], &info);
+   if (ACPI_FAILURE(status)) {
+   dev_err(&adev->dev, "Error reading device info\n");
+   continue;
+   }
+
+   /*
+* Skip the dependency of Windows System Power
+* Management Controller
+*/
+   if (info->valid & ACPI_VALID_HID
+   && !strcmp(info->hardware_id.string, "INT3396"))
+   skip = 1;
+   else
+   skip = 0;
+
+   kfree(info);
+
+   if (skip)
+   continue;
+
+   dep = kzalloc(sizeof(struct acpi_dep_data), GFP_KERNEL);
+   if (!dep)
+   return;
+
+   dep->master = dep_devices.handles[i];
+   dep->slave  = adev->handle;
+   adev->dep_unmet++;
+
+   mutex_lock(&acpi_dep_list_lock);
+   list_add_tail(&dep->node , &acpi_dep_list);
+   mutex_unlock(&acpi_dep_list_lock);
+   }
+}
+
 s

[RFC PATCH V2] ACPI: Add _DEP(Operation Region Dependencies) support to fix battery issue on the Asus T100TA

2014-10-27 Thread Lan Tianyu
ACPI 5.0 introduces _DEP to designate device objects that OSPM should
assign a higher priority in start ordering due to future operation region
accesses.

On Asus T100TA, ACPI battery info are read from a I2C slave device via
I2C operation region. Before I2C operation region handler is installed,
battery _STA always returns 0. There is a _DEP method of designating
start order under battery device node.

This patch is to implement _DEP feature to fix battery issue on the Asus T100TA.
Introducing acpi_dep_list and adding dep_unmet count in the struct
acpi_device. During ACPI namespace scan, create struct acpi_dep_data for a
valid pair of master (device pointed to by _DEP)/slave(device with _DEP), record
master's and slave's ACPI handle in it and put it into acpi_dep_list. The 
dep_unmet
count will increase by one if there is a device under its _DEP. Driver's 
probe() should
return EPROBE_DEFER when find dep_unmet larger than 0. When I2C operation
region handler is installed, remove all struct acpi_dep_data on the 
acpi_dep_list
whose master is pointed to I2C host controller and decrease slave's dep_unmet.
When dep_unmet decreases to 0, all _DEP conditions are met and then do 
acpi_bus_attach()
for the device in order to resolve battery _STA issue on the Asus T100TA.

Signed-off-by: Lan Tianyu 
---
Change since V1:
Rework struct acpi_dep_data, record master and slave's ACPI handle.
Rename dep_present with dep_unmet and make it as a count. Increase dep_unmet
during bootup and decrease it when associated operation region is installed.

 drivers/acpi/scan.c | 91 +
 drivers/i2c/i2c-core.c  |  1 +
 include/acpi/acpi_bus.h |  1 +
 include/linux/acpi.h|  3 ++
 4 files changed, 96 insertions(+)

diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index 4da55d8..bfa9c77 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -36,6 +36,8 @@ bool acpi_force_hot_remove;
 
 static const char *dummy_hid = "device";
 
+static LIST_HEAD(acpi_dep_list);
+static DEFINE_MUTEX(acpi_dep_list_lock);
 static LIST_HEAD(acpi_bus_id_list);
 static DEFINE_MUTEX(acpi_scan_lock);
 static LIST_HEAD(acpi_scan_handlers_list);
@@ -43,6 +45,12 @@ DEFINE_MUTEX(acpi_device_lock);
 LIST_HEAD(acpi_wakeup_device_list);
 static DEFINE_MUTEX(acpi_hp_context_lock);
 
+struct acpi_dep_data {
+   struct list_head node;
+   acpi_handle master;
+   acpi_handle slave;
+};
+
 struct acpi_device_bus_id{
char bus_id[15];
unsigned int instance_no;
@@ -2121,6 +2129,63 @@ static void acpi_scan_init_hotplug(struct acpi_device 
*adev)
}
 }
 
+static void acpi_device_dep_initialize(struct acpi_device *adev)
+{
+   struct acpi_dep_data *dep;
+   struct acpi_handle_list dep_devices;
+   struct acpi_device_info *info;
+   acpi_status status;
+   int i, skip;
+
+   if (!acpi_has_method(adev->handle, "_DEP"))
+   return;
+
+   status = acpi_evaluate_reference(adev->handle, "_DEP", NULL,
+   &dep_devices);
+   if (ACPI_FAILURE(status)) {
+   dev_err(&adev->dev, "Failed to evaluate _DEP.\n");
+   return;
+   }
+
+   for (i = 0; i < dep_devices.count; i++) {
+
+   status = acpi_get_object_info(dep_devices.handles[i], &info);
+   if (ACPI_FAILURE(status)) {
+   dev_err(&adev->dev, "Error reading device info\n");
+   continue;
+   }
+
+   /*
+* Skip the dependency of Windows System Power
+* Management Controller
+*/
+   if (info->valid & ACPI_VALID_HID
+   && !strcmp(info->hardware_id.string, "INT3396"))
+   skip = 1;
+   else
+   skip = 0;
+
+   kfree(info);
+
+   if (skip)
+   continue;
+
+   dep = kzalloc(sizeof(struct acpi_dep_data), GFP_KERNEL);
+   if (!dep) {
+   dev_err(&adev->dev, "Not enough memory for _DEP list 
entry.\n");
+   return;
+   }
+
+   dep->master = dep_devices.handles[i];
+   dep->slave  = adev->handle;
+   adev->dep_unmet++;
+
+   mutex_lock(&acpi_dep_list_lock);
+   list_add_tail(&dep->node , &acpi_dep_list);
+   mutex_unlock(&acpi_dep_list_lock);
+   }
+}
+
 static acpi_status acpi_bus_check_add(acpi_handle handle, u32 lvl_not_used,
  void *not_used, void **return_value)
 {
@@ -2147,6 +2212,7 @@ static acpi_status acpi_bus_check_add(acpi_handle handle, 
u32 lvl_not_used,
return AE_CTRL_DEPTH

Re: [RFC PATCH V3 0/3] PM/CPU: Parallel enalbing nonboot cpus with resume devices

2014-10-21 Thread Lan Tianyu
On 2014年10月09日 09:18, Lan Tianyu wrote:
> On 2014年10月09日 04:54, Peter Zijlstra wrote:
>> On Thu, Sep 25, 2014 at 04:32:02PM +0800, Lan Tianyu wrote:
>>> This patchset is to parallel enabling nonboot cpus with resuming devices
>>> during system resume in order to accelerate S2RAM. From test result on
>>> a 8 logical core Haswell machine, system resume time reduces from 347ms
>>> to 217ms with this patchset.
>>>
>>> In the current world, all nonboot cpus are enabled serially during system
>>> resume. System resume sequence is that boot cpu enables nonboot cpu one by
>>> one and then resume devices. Before resuming devices, there are few tasks
>>> assigned to nonboot cpus after they are brought up. This wastes cpu usage.
>>>
>>> This patchset is to allow boot cpu to go forward to resume devices after
>>> bringing up one nonboot cpu and starting a thread. The thread will be in
>>> charge of bringing up other frozen cpus. The thread will be scheduled to
>>> the first online cpu to run . This makes enabling cpu2~x parallel with
>>> resuming devices.
>>
>> So I feel we should really clean up hotplug before doing anything like
>> this. Its a trainwreck and just piling more and more hacks ontop isn't
>> going to help any.
>>
> 
> Hi Peter:
>   Sorry, I don't know the gap of hotplug clearly. Could you elaborate it?
> 

Hi Peter:
Could you elaborate the gap of hotplug? Thanks.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH V3 0/3] PM/CPU: Parallel enalbing nonboot cpus with resume devices

2014-10-08 Thread Lan Tianyu
On 2014年10月09日 04:54, Peter Zijlstra wrote:
> On Thu, Sep 25, 2014 at 04:32:02PM +0800, Lan Tianyu wrote:
>> This patchset is to parallel enabling nonboot cpus with resuming devices
>> during system resume in order to accelerate S2RAM. From test result on
>> a 8 logical core Haswell machine, system resume time reduces from 347ms
>> to 217ms with this patchset.
>>
>> In the current world, all nonboot cpus are enabled serially during system
>> resume. System resume sequence is that boot cpu enables nonboot cpu one by
>> one and then resume devices. Before resuming devices, there are few tasks
>> assigned to nonboot cpus after they are brought up. This wastes cpu usage.
>>
>> This patchset is to allow boot cpu to go forward to resume devices after
>> bringing up one nonboot cpu and starting a thread. The thread will be in
>> charge of bringing up other frozen cpus. The thread will be scheduled to
>> the first online cpu to run . This makes enabling cpu2~x parallel with
>> resuming devices.
> 
> So I feel we should really clean up hotplug before doing anything like
> this. Its a trainwreck and just piling more and more hacks ontop isn't
> going to help any.
> 

Hi Peter:
Sorry, I don't know the gap of hotplug clearly. Could you elaborate it?
-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] Cpufreq: Keep cpufreq sysfs nodes across S2RAM/S2DISK when using intel_pstate driver.

2014-10-07 Thread Lan Tianyu
Sorry later response and just back from vacation.

On 2014年09月29日 16:20, Viresh Kumar wrote:
> But this change is buggy.. Because you are updating 'cpufreq_suspended'
> before actually stopping the governor, any calls to __cpufreq_governor()
> will be converted to NO-operations because of this in __cpufreq_governor():
> 
> /* Don't start any governor operations if we are entering suspend */
> if (cpufreq_suspended)
> return 0;
> 
> And so the governor's will never stop :(
> 
> So you need to keep the above line where it was :)
> 

Yes, you are right. Thanks for fixing it.
-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] ACPI: Add _DEP(Operation Region Dependencies) support to fix battery issue on the Asus T100TA

2014-09-25 Thread Lan Tianyu
On 2014年09月26日 03:27, Rafael J. Wysocki wrote:
> I'm not sure what you mean.  "Dependent" means "depending on something", so 
> the
> question reads "This requires the devices with _DEP to have a list of devices
> that depend on them" which is probably not what you meant.
> 

Sorry, I didn't say clearly. The "dependent device" I meant is device
pointed to by _DEP(the master you mentioned at the bottom). I thought
master also needed a list to find its slave(device with _DEP).

> For each device with _DEP we have dep_devices, so if you pass a pointer
> (opregion_adev) to the device that has just installed an operation region
> handler to acpi_walk_dep_device_list() as an argument, then you can do
> 
>   for (i = 0; i < adev->dep_devices.count; i++)
>   if (opregion_adev->handle == adev->dep_devices.handles[i]) {
>   adev->dep_unmet = false;
>   acpi_bus_attach(adev);
>   list_del(&dep->node);
>   kfree(dep);
>   }
> 
> and of course appropriate locking needs to be there in case this races with
> enumeration during hotplug after loading a new ACPI table on demand).
> 

Yes, we can scan all devices on the list and match the opregion_adev
with adev->dep_devices. This is comparatively simple solution.

> I think you can even define
>   
> struct acpi_dep_data {
>   struct list_head node;
>   struct acpi_device *master;
>   struct acpi_device *slave;
> };
> 
> and create that for every valid pair of master (device pointed to by 
> _DEP)/slave
> (device with _DEP) and create a list of these.  Then, you won't need 
> dep_devices
> in struct acpi_device any more and your acpi_walk_dep_device_list() will only
> need to walk the list until it finds the matching master/slave pair.

One question is that when create struct acpi_dep_data for the dependency
relationship between master and slave. If do this when slave's ACPI
device is created during ACPI namespace scan, master's ACPI device maybe
not created at that point. So acpi_handle maybe more suitable than
struct acpi_device here.

> 
> That will handle the case when one device depends on multiple other devices 
> too
> I think.
> 
 > >> + dep_adev = acpi_bus_get_acpi_device(
 > >> + adev->dep_devices.handles[i]);
>> > 


-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] ACPI: Add _DEP(Operation Region Dependencies) support to fix battery issue on the Asus T100TA

2014-09-25 Thread Lan Tianyu
On 2014年09月25日 06:27, Rafael J. Wysocki wrote:
> On Tuesday, September 23, 2014 03:06:43 PM Lan Tianyu wrote:
>> ACPI 5.0 introduces _DEP to designate device objects that OSPM should
>> assign a higher priority in start ordering due to future operation region
>> accesses.
>>
>> On Asus T100TA, ACPI battery info are read from a I2C slave device via
>> I2C operation region. Before I2C operation region handler is installed,
>> battery _STA always returns 0. There is a _DEP method of designating
>> start order under battery device node.
>>
>> This patch is to implement _DEP feature to fix battery issue on the Asus 
>> T100TA.
>> Introducing acpi_bus_dep_device_list and adding dep_present flags in the 
>> struct
>> acpi_device. During ACPI namespace scan, all devices with _DEP support will 
>> be put
>> into the new list and those devices' dep_present flag will be set. Driver's 
>> probe()
>> should return EPROBE_DEFER when find dep_present is set. When I2C operation
>> region handler is installed, check all devices on the new list. Remove the 
>> one from
>> list if _DEP condition is met and clear its dep_present flag and do 
>> acpi_bus_attch()
>> for the device in order to resolve battery _STA issue on the Asus T100TA.
>>
>> Signed-off-by: Lan Tianyu 
> 
> This is going in the right direction in my view, but isn't there just yet.
> 
> Details below.

Thanks for review.

> 
>> ---
>>  drivers/acpi/battery.c  |  4 +++
>>  drivers/acpi/scan.c | 84 
>> +
>>  drivers/i2c/i2c-acpi.c  |  1 +
>>  include/acpi/acpi_bus.h |  2 ++
>>  include/linux/acpi.h|  3 ++
>>  5 files changed, 94 insertions(+)
>>
>> diff --git a/drivers/acpi/battery.c b/drivers/acpi/battery.c
>> index 1c162e7..c0a68ce 100644
>> --- a/drivers/acpi/battery.c
>> +++ b/drivers/acpi/battery.c
>> @@ -1194,6 +1194,10 @@ static int acpi_battery_add(struct acpi_device 
>> *device)
>>  
>>  if (!device)
>>  return -EINVAL;
>> +
>> +if (device->dep_present)
> 
> device->flags.dep_present would be better.  Or even call the flag dep_unmet.

Ok. Will update.

> 
>> +return -EPROBE_DEFER;
>> +
>>  battery = kzalloc(sizeof(struct acpi_battery), GFP_KERNEL);
>>  if (!battery)
>>  return -ENOMEM;
>> diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
>> index 3bf7764..a26dbb3 100644
>> --- a/drivers/acpi/scan.c
>> +++ b/drivers/acpi/scan.c
>> @@ -36,6 +36,7 @@ bool acpi_force_hot_remove;
>>  
>>  static const char *dummy_hid = "device";
>>  
>> +static LIST_HEAD(acpi_bus_dep_device_list);
>>  static LIST_HEAD(acpi_bus_id_list);
>>  static DEFINE_MUTEX(acpi_scan_lock);
>>  static LIST_HEAD(acpi_scan_handlers_list);
>> @@ -43,6 +44,11 @@ DEFINE_MUTEX(acpi_device_lock);
>>  LIST_HEAD(acpi_wakeup_device_list);
>>  static DEFINE_MUTEX(acpi_hp_context_lock);
>>  
>> +struct acpi_dep_data {
>> +struct list_head node;
>> +struct acpi_device *adev;
>> +};
>> +
>>  struct acpi_device_bus_id{
>>  char bus_id[15];
>>  unsigned int instance_no;
>> @@ -2048,6 +2054,32 @@ static void acpi_scan_init_hotplug(struct acpi_device 
>> *adev)
>>  }
>>  }
>>  
>> +static void acpi_device_dep_initialize(struct acpi_device * adev)
>> +{
>> +struct acpi_dep_data *dep;
>> +acpi_status status;
>> +
>> +if (!acpi_has_method(adev->handle, "_DEP"))
>> +return;
>> +
>> +status = acpi_evaluate_reference(adev->handle, "_DEP", NULL,
>> +&adev->dep_devices);
>> +if (ACPI_FAILURE(status)) {
>> +dev_err(&adev->dev, "Fail to evaluate _DEP.\n");
> 
> "Failed"
> 
>> +return;
>> +}
>> +
>> +dep = kzalloc(sizeof(struct acpi_dep_data), GFP_KERNEL);
>> +if (!dep) {
>> +dev_err(&adev->dev, "Memory allocation error.\n");
> 
> "Not enough memory for _DEP list entry\n"
> 
>> +return;
>> +}
>> +
>> +dep->adev = adev;
>> +adev->dep_present = true;
>> +list_add_tail(&dep->node , &acpi_bus_dep_device_list);
>> +}
>> +
>>  static acpi_status acpi_bus_check_add(acpi_handle handle, u32 lvl_not_used,
>>v

[RFC PATCH V2 2/3] X86/CPU: Initialize MTRR/PAT when each cpu is online during system resume.

2014-09-25 Thread Lan Tianyu
SDM Vol 3a section titled "MTRR considerations in MP systems" specifies
the need for synchronizing the logical cpu's while initializing/updating
MTRR.

Commit d0af9eed5a(x86, pat/mtrr: Rendezvous all the cpus for MTRR/PAT init)
delay the MTRR/PAT init for APs till all the logical cpu's come online and
the rendezvous process at the end of AP's bringup, will initialize the MTRR/PAT
for all AP's during boot and resume.

Currently, APs enabling are paralleled with resume devices during system resume
and the AP will be assigned with task before all APs' bringup. MTRR/PAT should
be initialized before running tasks. So this patch is to remove
set_mtrr_aps_delayed_init() for system resume and do the MTRR/PAT init when
the AP comes online just like dynamic single cpu online.

Signed-off-by: Lan Tianyu 
---
 kernel/cpu.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 24c4889..30ffdd9 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -592,13 +592,10 @@ static int async_enable_nonboot_cpus(void *data)
 {
int cpu;
 
-   arch_enable_nonboot_cpus_begin();
-
for_each_cpu(cpu, frozen_cpus) {
_cpu_up_for_pm(cpu);
}
 
-   arch_enable_nonboot_cpus_end();
cpumask_clear(frozen_cpus);
cpu_maps_update_done();
return 0;
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH V3 3/3] Cpufreq: Hold cpu_add_remove_lock before change cpufreq_suspended flag

2014-09-25 Thread Lan Tianyu
Now, enabling non-boot cpus will parallel with resuming devices.
Cpu online may take place after cpufreq resume. But all cpu should be up
before clearing cpufreq_suspended flag in the cpufreq_resume(). Cpufreq
core uses this flag to decide to do light-weight init or full init. Light
-weight init/tear down is dedicated for cpu hotplug during system pm. To
keep this rule, hold cpu_add_remove_lock during changing cpufreq_suspended
flag.

Signed-off-by: Lan Tianyu 
---
 drivers/cpufreq/cpufreq.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 37951ec..b03f7dd 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1689,7 +1689,9 @@ void cpufreq_resume(void)
if (!cpufreq_driver)
return;
 
+   cpu_maps_update_begin();
cpufreq_suspended = false;
+   cpu_maps_update_done();
 
if (!has_target())
return;
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH V3 0/3] PM/CPU: Parallel enalbing nonboot cpus with resume devices

2014-09-25 Thread Lan Tianyu
This patchset is to parallel enabling nonboot cpus with resuming devices
during system resume in order to accelerate S2RAM. From test result on
a 8 logical core Haswell machine, system resume time reduces from 347ms
to 217ms with this patchset.

In the current world, all nonboot cpus are enabled serially during system
resume. System resume sequence is that boot cpu enables nonboot cpu one by
one and then resume devices. Before resuming devices, there are few tasks
assigned to nonboot cpus after they are brought up. This wastes cpu usage.

This patchset is to allow boot cpu to go forward to resume devices after
bringing up one nonboot cpu and starting a thread. The thread will be in
charge of bringing up other frozen cpus. The thread will be scheduled to
the first online cpu to run . This makes enabling cpu2~x parallel with
resuming devices.

Patch 2 is to change the policy of init MTRR/PAT for nonboot cpus. Original
code is to init all nonboot cpus' MTRR/PAT after all nonboot cpus coming online
during system resume. Now parallel enabling nonboot cpus with resuming devices 
and
nonboot cpus will be assigned with tasks before all cpus are online. So
it's necessary to do init MTRR/PAT just after one nonboot cpus comes online
just like dynamic single cpu online. 

Patch 3 is to guarantee that all cpus are online before changing 
cpufreq_suspended
flag in the cpufreq_resume() to avoid breaking cpufreq subsystem.

Lan Tianyu (3):
  PM/CPU: Parallel enalbing nonboot cpus with resume devices
  X86/CPU: Initialize MTRR/PAT when each cpu is online during system
resume.
  Cpufreq: Hold cpu_add_remove_lock before change cpufreq_suspended flag

 drivers/cpufreq/cpufreq.c |  2 ++
 kernel/cpu.c  | 64 +++
 2 files changed, 55 insertions(+), 11 deletions(-)

-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH V3 1/3] PM/CPU: Parallel enalbing nonboot cpus with resume devices

2014-09-25 Thread Lan Tianyu
In the current world, all nonboot cpus are enalbed serially during system
resume. System resume sequence is that boot cpu enables nonboot cpu one by
one and then resume devices. Before resuming devices, there are few tasks
assigned to nonboot cpus after they are brought up. This waste cpu usage.

To accelerate S3, this patchset is to allow boot cpu to go forward to
resume devices after bringing up one nonboot cpu and starting a thread.
The thread will be in charge of bringing up other frozen cpus. The thread
will be scheduled to the first online cpu to run . This makes enabling
cpu2~x parallel with resuming devices.

Signed-off-by: Lan Tianyu 
---
 kernel/cpu.c | 67 ++--
 1 file changed, 56 insertions(+), 11 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 356450f..24c4889 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -572,8 +572,41 @@ void __weak arch_enable_nonboot_cpus_end(void)
 {
 }
 
+static int _cpu_up_for_pm(int cpu)
+{
+   int error;
+
+   trace_suspend_resume(TPS("CPU_ON"), cpu, true);
+   error = _cpu_up(cpu, 1);
+   trace_suspend_resume(TPS("CPU_ON"), cpu, false);
+   if (error) {
+   pr_warn("Error taking CPU%d up: %d\n", cpu, error);
+   return error;
+   }
+
+   pr_info("CPU%d is up\n", cpu);
+   return 0;
+}
+
+static int async_enable_nonboot_cpus(void *data)
+{
+   int cpu;
+
+   arch_enable_nonboot_cpus_begin();
+
+   for_each_cpu(cpu, frozen_cpus) {
+   _cpu_up_for_pm(cpu);
+   }
+
+   arch_enable_nonboot_cpus_end();
+   cpumask_clear(frozen_cpus);
+   cpu_maps_update_done();
+   return 0;
+}
+
 void __ref enable_nonboot_cpus(void)
 {
+   struct task_struct *tsk;
int cpu, error;
 
/* Allow everyone to use the CPU hotplug again */
@@ -584,21 +617,33 @@ void __ref enable_nonboot_cpus(void)
 
pr_info("Enabling non-boot CPUs ...\n");
 
-   arch_enable_nonboot_cpus_begin();
+   cpu = cpumask_first(frozen_cpus);
+   cpumask_clear_cpu(cpu, frozen_cpus);
+   error = _cpu_up_for_pm(cpu);
+   if (error) {
+   pr_err("Failed to bring up first non-boot cpu.\n");
+   goto fail;
+   }
 
-   for_each_cpu(cpu, frozen_cpus) {
-   trace_suspend_resume(TPS("CPU_ON"), cpu, true);
-   error = _cpu_up(cpu, 1);
-   trace_suspend_resume(TPS("CPU_ON"), cpu, false);
-   if (!error) {
-   pr_info("CPU%d is up\n", cpu);
-   continue;
-   }
-   pr_warn("Error taking CPU%d up: %d\n", cpu, error);
+   tsk = kthread_run(async_enable_nonboot_cpus,
+   NULL, "async-enable-nonboot-cpus");
+   if (IS_ERR(tsk)) {
+   pr_err("Failed to create async enable nonboot cpus thread.\n");
+   goto fail;
}
+   return;
 
+fail:
+   /*
+* If fail to bring up the first frozen cpu or
+* start async thread, enable these rest frozen cpus
+* on the boot cpu.
+*/
+   arch_enable_nonboot_cpus_begin();
+   for_each_cpu(cpu, frozen_cpus) {
+   _cpu_up_for_pm(cpu);
+   }
arch_enable_nonboot_cpus_end();
-
cpumask_clear(frozen_cpus);
 out:
cpu_maps_update_done();
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] i2c: move acpi code back into the core

2014-09-25 Thread Lan Tianyu
On 2014年09月25日 05:36, Wolfram Sang wrote:
> Commit 5d98e61d337c ("I2C/ACPI: Add i2c ACPI operation region support")
> renamed the i2c-core module. This may cause regressions for
> distributions, so put the ACPI code back into the core.
> 
> Reported-by: Jean Delvare 
> Signed-off-by: Wolfram Sang 
> Cc: Mika Westerberg 
> Cc: Lan Tianyu 
> Cc: Jean Delvare 
> ---
> 
> v2: - make all acpi functions static
> - annotate #endif
> - remove declarations in i2c.h
> - add dummy functions to i2c-core.c
> - make two seperate #ifdef blocks instead of two nested ones
>   (otherwise build errors due to no external decl. anymore)
> 
> Mika, Lan, Jean: please test/review. I am still waiting for some testbot
> results, yet your audit is very wanted.

Hi Wolfram:
I have tested battery function and it worked normally with this patch.

Tested-by: Lan Tianyu 

> 
> 
>  MAINTAINERS|   1 -
>  drivers/i2c/Makefile   |   5 +-
>  drivers/i2c/i2c-acpi.c | 364 
> -
>  drivers/i2c/i2c-core.c | 354 +++
>  include/linux/i2c.h|  16 ---
>  5 files changed, 355 insertions(+), 385 deletions(-)
>  delete mode 100644 drivers/i2c/i2c-acpi.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 809ecd680d88..e3682d0dea1e 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -4477,7 +4477,6 @@ M:  Mika Westerberg 
> 
>  L:   linux-...@vger.kernel.org
>  L:   linux-a...@vger.kernel.org
>  S:   Maintained
> -F:   drivers/i2c/i2c-acpi.c
>  
>  I2C-TAOS-EVM DRIVER
>  M:   Jean Delvare 
> diff --git a/drivers/i2c/Makefile b/drivers/i2c/Makefile
> index e0228b228256..1722f50f2473 100644
> --- a/drivers/i2c/Makefile
> +++ b/drivers/i2c/Makefile
> @@ -2,11 +2,8 @@
>  # Makefile for the i2c core.
>  #
>  
> -i2ccore-y := i2c-core.o
> -i2ccore-$(CONFIG_ACPI)   += i2c-acpi.o
> -
>  obj-$(CONFIG_I2C_BOARDINFO)  += i2c-boardinfo.o
> -obj-$(CONFIG_I2C)+= i2ccore.o
> +obj-$(CONFIG_I2C)+= i2c-core.o
>  obj-$(CONFIG_I2C_SMBUS)  += i2c-smbus.o
>  obj-$(CONFIG_I2C_CHARDEV)+= i2c-dev.o
>  obj-$(CONFIG_I2C_MUX)+= i2c-mux.o
> diff --git a/drivers/i2c/i2c-acpi.c b/drivers/i2c/i2c-acpi.c
> deleted file mode 100644
> index 0dbc18c15c43..
> --- a/drivers/i2c/i2c-acpi.c
> +++ /dev/null
> @@ -1,364 +0,0 @@
> -/*
> - * I2C ACPI code
> - *
> - * Copyright (C) 2014 Intel Corp
> - *
> - * Author: Lan Tianyu 
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 as
> - * published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful, but
> - * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
> - * or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> - * for more details.
> - */
> -#define pr_fmt(fmt) "I2C/ACPI : " fmt
> -
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -
> -struct acpi_i2c_handler_data {
> - struct acpi_connection_info info;
> - struct i2c_adapter *adapter;
> -};
> -
> -struct gsb_buffer {
> - u8  status;
> - u8  len;
> - union {
> - u16 wdata;
> - u8  bdata;
> - u8  data[0];
> - };
> -} __packed;
> -
> -static int acpi_i2c_add_resource(struct acpi_resource *ares, void *data)
> -{
> - struct i2c_board_info *info = data;
> -
> - if (ares->type == ACPI_RESOURCE_TYPE_SERIAL_BUS) {
> - struct acpi_resource_i2c_serialbus *sb;
> -
> - sb = &ares->data.i2c_serial_bus;
> - if (sb->type == ACPI_RESOURCE_SERIAL_TYPE_I2C) {
> - info->addr = sb->slave_address;
> - if (sb->access_mode == ACPI_I2C_10BIT_MODE)
> - info->flags |= I2C_CLIENT_TEN;
> - }
> - } else if (info->irq < 0) {
> - struct resource r;
> -
> - if (acpi_dev_resource_interrupt(ares, 0, &r))
> - info->irq = r.start;
> - }
> -
> - /* Tell the ACPI core to skip this resource */
> - return 1;
> -}
> -
> -static acpi_status acpi_i2c_add_device(acpi_handle handle, u32 level,
> -void *data, void **return_value)
> -{
> - struct i2c_adapter *adapter = data;
> - struct list_head resource_list;
> -   

[tip:x86/cpu] x86/smpboot: Speed up suspend/ resume by avoiding 100ms sleep for CPU offline during S3

2014-09-24 Thread tip-bot for Lan Tianyu
Commit-ID:  2ed53c0d6cc99fc712f7c037e41d9ec4eb8d6b08
Gitweb: http://git.kernel.org/tip/2ed53c0d6cc99fc712f7c037e41d9ec4eb8d6b08
Author: Lan Tianyu 
AuthorDate: Tue, 26 Aug 2014 15:43:45 +0800
Committer:  Ingo Molnar 
CommitDate: Wed, 24 Sep 2014 15:02:06 +0200

x86/smpboot: Speed up suspend/resume by avoiding 100ms sleep for CPU offline 
during S3

With certain kernel configurations, CPU offline consumes more than
100ms during S3.

It's a timing related issue: native_cpu_die() would occasionally fall
into a 100ms sleep when the CPU idle loop thread marked the CPU state
to DEAD too slowly.

What native_cpu_die() does is that it polls the CPU state and waits
for 100ms if CPU state hasn't been marked to DEAD. The 100ms sleep
doesn't make sense and is purely historic.

To avoid such long sleeping, this patch adds a 'struct completion'
to each CPU, waits for the completion in native_cpu_die() and wakes
up the completion when the CPU state is marked to DEAD.

Tested on an Intel Xeon server with 48 cores, Ivybridge and on
Haswell laptops. The CPU offlining cost on these machines is
reduced from more than 100ms to less than 5ms. The system
suspend time is reduced by 2.3s on the servers.

Borislav and Prarit also helped to test the patch on an AMD
machine and a few systems of various sizes and configurations
(multi-socket, single-socket, no hyper threading, etc.). No
issues were seen.

Tested-by: Prarit Bhargava 
Signed-off-by: Lan Tianyu 
Acked-by: Borislav Petkov 
Cc: Peter Zijlstra 
Cc: srost...@redhat.com
Cc: toshi.k...@hp.com
Cc: imamm...@redhat.com
Cc: Linus Torvalds 
Link: 
http://lkml.kernel.org/r/1409039025-32310-1-git-send-email-tianyu@intel.com
[ Improved a few minor details in the code, cleaned up the changelog. ]
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/smpboot.c | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 2d872e0..fdbc5fc 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -102,6 +102,8 @@ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, 
cpu_llc_shared_map);
 DEFINE_PER_CPU_SHARED_ALIGNED(struct cpuinfo_x86, cpu_info);
 EXPORT_PER_CPU_SYMBOL(cpu_info);
 
+static DEFINE_PER_CPU(struct completion, die_complete);
+
 atomic_t init_deasserted;
 
 /*
@@ -1323,26 +1325,24 @@ int native_cpu_disable(void)
return ret;
 
clear_local_APIC();
-
+   init_completion(&per_cpu(die_complete, smp_processor_id()));
cpu_disable_common();
+
return 0;
 }
 
 void native_cpu_die(unsigned int cpu)
 {
/* We don't do anything here: idle task is faking death itself. */
-   unsigned int i;
+   wait_for_completion_timeout(&per_cpu(die_complete, cpu), HZ);
 
-   for (i = 0; i < 10; i++) {
-   /* They ack this in play_dead by setting CPU_DEAD */
-   if (per_cpu(cpu_state, cpu) == CPU_DEAD) {
-   if (system_state == SYSTEM_RUNNING)
-   pr_info("CPU %u is now offline\n", cpu);
-   return;
-   }
-   msleep(100);
+   /* They ack this in play_dead() by setting CPU_DEAD */
+   if (per_cpu(cpu_state, cpu) == CPU_DEAD) {
+   if (system_state == SYSTEM_RUNNING)
+   pr_info("CPU %u is now offline\n", cpu);
+   } else {
+   pr_err("CPU %u didn't die...\n", cpu);
}
-   pr_err("CPU %u didn't die...\n", cpu);
 }
 
 void play_dead_common(void)
@@ -1354,6 +1354,7 @@ void play_dead_common(void)
mb();
/* Ack it */
__this_cpu_write(cpu_state, CPU_DEAD);
+   complete(&per_cpu(die_complete, smp_processor_id()));
 
/*
 * With physical CPU hotplug, we should halt the cpu
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] i2c: move acpi code back into the core

2014-09-23 Thread Lan Tianyu
On 2014年09月23日 15:09, Lan Tianyu wrote:
> On 2014年09月23日 13:39, Wolfram Sang wrote:
>>
>>> Sorry for later response due to sickness. I can't write this patch in
>>> time. Sorry again. I will test it soon.
>>
>> Oh, get well soon! Please say so next time, so I know.
>>
> 
> Thanks. OK.
> I have put the patch into test system and need to wait some time.
> 

Hi Wolfram:
The patch passed through the following 110 configure build test in the
0-day build test system.


configs tested: 110

mips allmodconfig
mips   jz4740
mips  allnoconfig
mips  fuloong2e_defconfig
mips txx9
x86_64allnoconfig
shtitan_defconfig
sh  rsk7269_defconfig
sh  sh7785lcr_32bit_defconfig
shallnoconfig
x86_64 randconfig-c3-0923
x86_64 randconfig-c1-0923
x86_64 randconfig-c2-0923
x86_64 randconfig-c0-0923
arm   omap2plus_defconfig
armsa1100
arm  allmodconfig
arm  prima2_defconfig
arm at91_dt_defconfig
arm   allnoconfig
arm   samsung
arm   spear13xx_defconfig
arm s3c2410_defconfig
arm  iop-adma
arm   imx_v6_v7_defconfig
arm   mmp
arm   tegra_defconfig
arm  arm5
arm  marzen_defconfig
arm  at_hdmac
armep93xx
armsh
arm arm67
pariscc3000_defconfig
parisc b180_defconfig
parisc  defconfig
alpha   defconfig
pariscallnoconfig
powerpc  iss476-smp_defconfig
powerpc mpc8315_rdb_defconfig
mips  ath79_defconfig
arm bcm_defconfig
powerpc   gef_ppc9a_defconfig
i386  randconfig-ha0-0923
i386  randconfig-ha1-0923
i386  randconfig-ha2-0923
i386  randconfig-ha3-0923
ia64 allmodconfig
ia64  allnoconfig
ia64defconfig
ia64 alldefconfig
microblaze  mmu_defconfig
microblazenommu_defconfig
microblaze   allyesconfig
s390 allmodconfig
s390  allnoconfig
s390defconfig
x86_64randconfig-ha1-0923
x86_64randconfig-ha3-0923
x86_64randconfig-ha0-0923
x86_64randconfig-ha2-0923
xtensa   common_defconfig
m32r   m32104ut_defconfig
xtensa  iss_defconfig
m32r opsput_defconfig
m32r   usrv_defconfig
m32r mappi3.smp_defconfig
sparc   defconfig
sparc64   allnoconfig
sparc64 defconfig
i386 allyesconfig
cris etrax-100lx_v2_defconfig
blackfin  TCM-BF537_defconfig
blackfinBF561-EZKIT-SMP_defconfig
blackfinBF533-EZKIT_defconfig
blackfinBF526-EZBRD_defconfig
i386   randconfig-r0-0923
i386   randconfig-r2-0923
i386   randconfig-r3-0923
i386   randconfig-r1-0923
x86_64 randconfig-s0-09231616
x86_64 randconfig
Hi Wolfram:
The patch passed through the following 110 configure build test in the
0-day build test system.-s1-09231616
x86_64   allmodconfig
i386 alldefconfig
i386 allmodconfig
i386  allnoconfig
i386defconfig
i386 randconfig-x009-0922
x86_64   randconfig-x003-0922
x86_64   randconfig-x004-0922
i386 randconfig-x006-0922
x86_64   randconfig-x001-0922
i386 randconfig-x005-0922
x86_64   randconfi

  1   2   3   4   >