Re: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-18 Thread Michael S. Tsirkin
On Tue, Jun 19, 2018 at 01:06:48AM +, Wang, Wei W wrote:
> On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > On Sat, Jun 16, 2018 at 01:09:44AM +, Wang, Wei W wrote:
> > > Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,
> > so the maximum memory that can be reported is 2TB. For larger guests, e.g.
> > 4TB, the optimization can still offer 2TB free memory (better than no
> > optimization).
> > 
> > Maybe it's better, maybe it isn't. It certainly muddies the waters even 
> > more.
> > I'd rather we had a better plan. From that POV I like what Matthew Wilcox
> > suggested for this which is to steal the necessary # of entries off the 
> > list.
> 
> Actually what Matthew suggested doesn't make a difference here. That method 
> always steal the first free page blocks, and sure can be changed to take 
> more. But all these can be achieved via kmalloc

I'd do get_user_pages really. You don't want pages split, etc.

> by the caller which is more prudent and makes the code more straightforward. 
> I think we don't need to take that risk unless the MM folks strongly endorse 
> that approach.
> 
> The max size of the kmalloc-ed memory is 4MB, which gives us the limitation 
> that the max free memory to report is 2TB. Back to the motivation of this 
> work, the cloud guys want to use this optimization to accelerate their guest 
> live migration. 2TB guests are not common in today's clouds. When huge guests 
> become common in the future, we can easily tweak this API to fill hints into 
> scattered buffer (e.g. several 4MB arrays passed to this API) instead of one 
> as in this version.
> 
> This limitation doesn't cause any issue from functionality perspective. For 
> the extreme case like a 100TB guest live migration which is theoretically 
> possible today, this optimization helps skip 2TB of its free memory. This 
> result is that it may reduce only 2% live migration time, but still better 
> than not skipping the 2TB (if not using the feature).

Not clearly better, no, since you are slowing the guest.


> So, for the first release of this feature, I think it is better to have the 
> simpler and more straightforward solution as we have now, and clearly 
> document why it can report up to 2TB free memory.

No one has the time to read documentation about how an internal flag
within a device works. Come on, getting two pages isn't much harder
than a single one.

> 
>  
> > If that doesn't fly, we can allocate out of the loop and just retry with 
> > more
> > pages.
> > 
> > > On the other hand, large guests being large mostly because the guests need
> > to use large memory. In that case, they usually won't have that much free
> > memory to report.
> > 
> > And following this logic small guests don't have a lot of memory to report 
> > at
> > all.
> > Could you remind me why are we considering this optimization then?
> 
> If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to 
> use e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB 
> free memory to report, but reporting 0.5TB with this optimization is better 
> than no optimization. (and the current 2TB limitation isn't a limitation for 
> the 3TB guest in this case)

I'd rather not spend time writing up random limitations.


> Best,
> Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/3] Use sbitmap instead of percpu_ida

2018-06-18 Thread Martin K. Petersen


Matthew,

>> Since most of the changes are in scsi or target, should I take this
>> series through my tree?
>
> I'd welcome that.  Nick seems to be inactive as target maintainer;
> his tree on kernel.org hasn't seen any updates in five months.

Applied to 4.19/scsi-queue, thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-18 Thread Wang, Wei W
On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> On Sat, Jun 16, 2018 at 01:09:44AM +, Wang, Wei W wrote:
> > Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,
> so the maximum memory that can be reported is 2TB. For larger guests, e.g.
> 4TB, the optimization can still offer 2TB free memory (better than no
> optimization).
> 
> Maybe it's better, maybe it isn't. It certainly muddies the waters even more.
> I'd rather we had a better plan. From that POV I like what Matthew Wilcox
> suggested for this which is to steal the necessary # of entries off the list.

Actually what Matthew suggested doesn't make a difference here. That method 
always steal the first free page blocks, and sure can be changed to take more. 
But all these can be achieved via kmalloc by the caller which is more prudent 
and makes the code more straightforward. I think we don't need to take that 
risk unless the MM folks strongly endorse that approach.

The max size of the kmalloc-ed memory is 4MB, which gives us the limitation 
that the max free memory to report is 2TB. Back to the motivation of this work, 
the cloud guys want to use this optimization to accelerate their guest live 
migration. 2TB guests are not common in today's clouds. When huge guests become 
common in the future, we can easily tweak this API to fill hints into scattered 
buffer (e.g. several 4MB arrays passed to this API) instead of one as in this 
version.

This limitation doesn't cause any issue from functionality perspective. For the 
extreme case like a 100TB guest live migration which is theoretically possible 
today, this optimization helps skip 2TB of its free memory. This result is that 
it may reduce only 2% live migration time, but still better than not skipping 
the 2TB (if not using the feature).

So, for the first release of this feature, I think it is better to have the 
simpler and more straightforward solution as we have now, and clearly document 
why it can report up to 2TB free memory.


 
> If that doesn't fly, we can allocate out of the loop and just retry with more
> pages.
> 
> > On the other hand, large guests being large mostly because the guests need
> to use large memory. In that case, they usually won't have that much free
> memory to report.
> 
> And following this logic small guests don't have a lot of memory to report at
> all.
> Could you remind me why are we considering this optimization then?

If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to use 
e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB free 
memory to report, but reporting 0.5TB with this optimization is better than no 
optimization. (and the current 2TB limitation isn't a limitation for the 3TB 
guest in this case)

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Design Decision for KVM based anti rootkit

2018-06-18 Thread David Hildenbrand
On 18.06.2018 18:35, Ahmed Soliman wrote:
> Shortly after I sent the first email, we found that there is another
> way to achieve this kind of communication, via KVM Hypercalls, I think
> they are underutilised in kvm, but they exist.
> 
> We also found that they are architecture dependent, but the advantage
> is that one doesn't need to create QEMU<-> kvm interface
> 
> So from our point of view it is either have things easily compatible
> with many architectures out of the box (virtio) VS compatiability with
> almost every front end including QEMU and any other one without
> modification (Hypercalls)?

My gut feeling (I might of course be wrong) is that hypercalls will not
be accepted easily in KVM (I assume only if it is really highly
specialized for e.g. x86 and/or required very early during boot and/or
has very specific performance requirements - e.g. pvspinlocks or kvmclock).

> 
> If it is the case, We might stick to hypercalls at the beginning,
> because it can be easily tested with out modifying QEMU, but then
> later we can move to virtio if there turned out to be clearer
> advantage, especially performance wise.

hypercalls might be good for prototyping, but I assume that the
challenging part will rather be a clean KVM mmu interface. And once you
have that, a kernel interface might not be too hard (I remember some
work being done by malwarebytes).

> 
> Does that sounds like good idea?
> I wanted to make sure because maybe maybe hypercalls aren't that much
> used in kvm for a reason, so I wanted to verify that.

I assume the same.

Another thing to note is performance, having to go via QEMU might be
performance wise not that good. (usually chunking is the answer to
reduce the overhead). But if it is really a "protect once, forget until
reboot" thing, that should not be relevant.

-- 

Thanks,

David / dhildenb
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Design Decision for KVM based anti rootkit

2018-06-18 Thread Ahmed Soliman
Shortly after I sent the first email, we found that there is another
way to achieve this kind of communication, via KVM Hypercalls, I think
they are underutilised in kvm, but they exist.

We also found that they are architecture dependent, but the advantage
is that one doesn't need to create QEMU<-> kvm interface

So from our point of view it is either have things easily compatible
with many architectures out of the box (virtio) VS compatiability with
almost every front end including QEMU and any other one without
modification (Hypercalls)?

If it is the case, We might stick to hypercalls at the beginning,
because it can be easily tested with out modifying QEMU, but then
later we can move to virtio if there turned out to be clearer
advantage, especially performance wise.

Does that sounds like good idea?
I wanted to make sure because maybe maybe hypercalls aren't that much
used in kvm for a reason, so I wanted to verify that.

On 18 June 2018 at 16:34, David Hildenbrand  wrote:
> On 16.06.2018 13:49, Ahmed Soliman wrote:
>> Following up on these threads:
>> - https://marc.info/?l=kvm&m=151929803301378&w=2
>> - http://www.openwall.com/lists/kernel-hardening/2018/02/22/18
>>
>> I lost the original emails so I couldn't reply to them, and also sorry
>> for being late, it was the end of semester exams.
>>
>> I was adviced on #qemu and #kernelnewbies IRCs to ask here as it will
>> help having better insights.
>>
>> To wrap things up, the basic design will be a method for communication
>> between host and guest is guest can request certain pages to be read
>> only, and then host will force them to be read-only by guest until
>> next guest reboot, then it will impossible for guest OS to have them
>> as RW again. The choice of which pages to be set as read only is the
>> guest's. So this way mixed pages can still be mixed with R/W content
>> even if holds kernel code.
>>
>> I was planning to use KVM as my hypervisor, until I found out that KVM
>> can't do that on its own so one will need a custom virtio driver to do
>> this kind of guest-host communication/coordination, I am still
>> sticking to KVM, and have no plans to do this for Xen at least for
>> now, this means that in order to get it to work there must be a QEMU
>> support our specific driver we are planning to write in order for
>> things to work properly.
>>
>> The question is is this the right approach? or is there a simpler way
>> to achieve this goal?
>>
>
> Especially if you want to support multiple architectures in the long
> term, virtio is the way to go.
>
> Design an architecture independent and extensible (+configurable)
> interface and be happy :) This might of course require some thought.
>
> (and don't worry, implementing a virtio driver is a lot simpler than you
> might think)
>
> But be aware that the virtio "hypervisor" side will be handled in QEMU,
> so you'll need a proper QEMU->KVM interface to get things running.
>
> --
>
> Thanks,
>
> David / dhildenb
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Design Decision for KVM based anti rootkit

2018-06-18 Thread David Hildenbrand
On 16.06.2018 13:49, Ahmed Soliman wrote:
> Following up on these threads:
> - https://marc.info/?l=kvm&m=151929803301378&w=2
> - http://www.openwall.com/lists/kernel-hardening/2018/02/22/18
> 
> I lost the original emails so I couldn't reply to them, and also sorry
> for being late, it was the end of semester exams.
> 
> I was adviced on #qemu and #kernelnewbies IRCs to ask here as it will
> help having better insights.
> 
> To wrap things up, the basic design will be a method for communication
> between host and guest is guest can request certain pages to be read
> only, and then host will force them to be read-only by guest until
> next guest reboot, then it will impossible for guest OS to have them
> as RW again. The choice of which pages to be set as read only is the
> guest's. So this way mixed pages can still be mixed with R/W content
> even if holds kernel code.
> 
> I was planning to use KVM as my hypervisor, until I found out that KVM
> can't do that on its own so one will need a custom virtio driver to do
> this kind of guest-host communication/coordination, I am still
> sticking to KVM, and have no plans to do this for Xen at least for
> now, this means that in order to get it to work there must be a QEMU
> support our specific driver we are planning to write in order for
> things to work properly.
> 
> The question is is this the right approach? or is there a simpler way
> to achieve this goal?
> 

Especially if you want to support multiple architectures in the long
term, virtio is the way to go.

Design an architecture independent and extensible (+configurable)
interface and be happy :) This might of course require some thought.

(and don't worry, implementing a virtio driver is a lot simpler than you
might think)

But be aware that the virtio "hypervisor" side will be handled in QEMU,
so you'll need a proper QEMU->KVM interface to get things running.

-- 

Thanks,

David / dhildenb
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-18 Thread Cornelia Huck
On Fri, 15 Jun 2018 15:31:43 +0300
"Michael S. Tsirkin"  wrote:

> On Fri, Jun 15, 2018 at 11:32:42AM +0200, Cornelia Huck wrote:
> > On Fri, 15 Jun 2018 05:34:24 +0300
> > "Michael S. Tsirkin"  wrote:
> >   
> > > On Thu, Jun 14, 2018 at 12:02:31PM +0200, Cornelia Huck wrote:  
> >   
> > > > > > I am not all that familiar with how Qemu manages network devices. 
> > > > > > If we can
> > > > > > do all the
> > > > > > required management of the primary/standby devices within Qemu, 
> > > > > > that is
> > > > > > definitely a better
> > > > > > approach without upper layer involvement.  
> > > > > 
> > > > > Right. I would imagine in the extreme case the upper layer doesn't
> > > > > have to be involved at all if QEMU manages all hot plug/unplug logic.
> > > > > The management tool can supply passthrough device and virtio with the
> > > > > same group UUID, QEMU auto-manages the presence of the primary, and
> > > > > hot plug the device as needed before or after the migration.
> > > > 
> > > > I do not really see how you can manage that kind of stuff in QEMU only. 
> > > >
> > > 
> > > So right now failover is limited to pci passthrough devices only.
> > > The idea is to realize the vfio device but not expose it
> > > to guest. Have a separate command to expose it to guest.
> > > Hotunplug would also hide it from guest but not unrealize it.  
> > 
> > So, this would not be real hot(un)plug, but 'hide it from the guest',
> > right? The concept of "we have it realized in QEMU, but the guest can't
> > discover and use it" should be translatable to non-pci as well (at
> > least for ccw).
> >   
> > > 
> > > This will help ensure that e.g. on migration failure we can
> > > re-expose the device without risk of running out of resources.  
> > 
> > Makes sense.
> > 
> > Should that 'hidden' state be visible/settable from outside as well
> > (e.g. via a property)? I guess yes, so that management software has a
> > chance to see whether a device is visible.  
> 
> Might be handy for debug, but note that since QEMU manages this
> state it's transient: can change at any time, so it's kind
> of hard for management to rely on it.

Might be another reason to have this controlled by management software;
being able to find out easily why a device is not visible to the guest
seems to be a useful thing.

Anyway, let's defer this discussion until it is clear how we actually
want to handle the whole setup.

> 
> > Settable may be useful if we
> > find another use case for hiding realized devices.  

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization