Re: [libvirt] Yet another RFC for CAT

2017-09-12 Thread Eli Qiao
>
>
>
> We didn't want to exec external python programs because that certainly
> *does* have bad scalability, terrible error reporting facilities and
> need to parse ill defined data formats from stdout, etc. It doesn't
> magically solve the complexity, just moves it elsewhere where we have
> less ability to tailor it to fit into libvirt's model.
>
>
>
BTW, to clarify, RMD is not wroten by python, it's golang, and it's not
just a tool,
it's a running service(agent) on the host, provided RESTful API by unix/TCP
socket.
and much more smarter (policy based) then static allocation. Support
re-enforcement
based on monitoring data (cache usage).

It aimed to be the only one interface for all who want to operation
/sys/fs/resctrl, or
even early kernel (4.10) which has no /sys/fs/resctrl (thought MSR).

It not only provide VM but for all kinds of workload/cpu/containers.

BR - Eli


> Regards,
> Daniel
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/
> dberrange :|
> |: https://libvirt.org -o-
> https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/
> dberrange :|
>
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] Yet another RFC for CAT

2017-09-12 Thread Daniel P. Berrange
On Tue, Sep 12, 2017 at 11:33:53AM +0200, Martin Kletzander wrote:
> On Thu, Sep 07, 2017 at 11:02:21AM +0800, 乔立勇(Eli Qiao) wrote:
> > > I'm concerned about the idea of not checking 'from' for collisions,
> > > if there's allowed a mix of guests with & within 'from'.
> > 
> > eg consider
> > > 
> > >  * Initially 24 MB of cache is free, starting at 8MB
> > >  * run guest A   from=8M, size=8M
> > >  * run guest B   size=8M
> > >  => libvirt sets from=16M, so doesn't clash with A
> > >  * stop guest A
> > >  * run guest C   size=8M
> > >  => libvirt sets from=8M, so doesn't clash with B
> > >  * restart guest A
> > >  => now clashes with guest C, whereas if you had
> > > left guest A running, then C would have
> > > got from=24MB and avoided clash
> > > 
> > > IOW, if we're to allow users to set 'from', I think we need to
> > > have an explicit flag to indicate whether this is an exclusive
> > > or shared allocation. That way guest A would set 'exclusive',
> > > and so at least see an error when it got a clash with guest
> > > C in the example.
> > > 
> > 
> > +1
> > 
> 
> OK, I didn't like the exclusive/shared allocation at first when I
> thought about it, but getting back to it it looks like it could save us
> from unwanted behaviour.  I think that if you are setting up stuff like
> this you should already know what you will be running and where.  I
> thought that specifying 'from' also means that it can be shared because
> you have an idea about the machines running on the host.  But that's not
> nice and user friendly.
> 
> What I'm concerned about is the difference between the following
> scenarios:
> 
> * run guest A  from=0M size=8M allocation=exclusive
> * run guest B  from=0M size=8M allocation=shared
> 
> and
> 
> * run guest A  from=0M size=8M allocation=shared
> * run guest B  from=0M size=8M allocation=shared
> 
> When starting guest B, how do you know whether to error out or not?  I'm
> not considering collecting information on all domains as that "does not
> scale" (as the cool kids would say it these days).  The only idea I have
> is naming the group accordingly, e.g.:
> 
>  libvirt-qemu-domain-3-testvm-vcpu0-3+emu+io2-7+shared/
> 
> gross name, but users immediately know what it is for.

I think we have to track usage state across all VMs globally. Think of
this in the same way as we think of VNC port allocation, or NWFilter
creation, or PCI/USB device allocation. These are all global resources
and we have to track them as such. I don't see any way to avoid that
in the cache mgmt too.  I don't really see a big problem with scalability
here as the logic we'd have to run to acqurie/release allocations is not
going to be computationally expensive, so contention on the mutexes durnig
startup should be light.

> > > > - After starting a domain, fill in any missing information about the
> > > >   allocation (I'm generalizing here, but fro now it would only be the
> > > >   optional "from" attribute)
> > > >
> > > > - Add settings not only for vCPUs, but also for other threads as we do
> > > >   with pinning, schedulers, etc.
> > > 
> > > 
> > Thanks Martin to propose this again.
> > 
> > I have started this RFC since the beginning of the year, and made several
> > junior patches, but fail to get merged.
> > 
> > While recently I (together with my team) have started a software "Resource
> > Management Daemon" to manage resource like last level cache, do cache
> > allocation and cache usage monitor, it's accept tcp/unix socket REST API
> > request and talk with /sys/fs/resctrl interface to manage all CAT stuff.
> > 
> > RMD will hidden the complexity usage in CAT and it support not only VM
> > but also other applications and containers.
> > 
> > RMD will open source soon in weeks, and could be leveraged in libvirt
> > or other management software which want to have control of fine granularity
> > resource.
> > 
> > We have done an integration POC with OpenStack Nova, and would like
> > to get into integrate too.
> > 
> > Would like to see if libvirt can integrate with RMD too.
> > 
> 
> I'm afraid there was such effort from Marcelo called resctrltool,
> however it was denied for some reason.  Daniel could elaborate if you'd
> like to know more, I can't really recall the reasoning behind it.

We didn't want to exec external python programs because that certainly
*does* have bad scalability, terrible error reporting facilities and
need to parse ill defined data formats from stdout, etc. It doesn't
magically solve the complexity, just moves it elsewhere where we have
less ability to tailor it to fit into libvirt's model.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-l

Re: [libvirt] Yet another RFC for CAT

2017-09-12 Thread Martin Kletzander

On Thu, Sep 07, 2017 at 11:02:21AM +0800, 乔立勇(Eli Qiao) wrote:

2017-09-04 23:57 GMT+08:00 Daniel P. Berrange :


On Mon, Sep 04, 2017 at 04:14:00PM +0200, Martin Kletzander wrote:
> * The current design (finally something libvirt-related, right?)
>
> The discussion ended with a conclusion of the following (with my best
> knowledge, there were so many discussions about so many things that I
> would spend too much time looking up all of them):
>
> - Users should not need to specify bit masks, such complexity should be
>   abstracted.  We'll use sizes (e.g. 4MB)
>
> - Multiple vCPUs might need to share the same allocation.
>
> - Exclusivity of allocations is to be assumed, that is only unoccupied
>   cache should be used for new allocations.
>
> The last point seems trivial but it's actually very specific condition
> that, if removed, can cause several problems.  If it's hard to grasp the
> last point together with the second one, you're on the right track.  If
> not, then I'll try to make a point for why the last point should be
> removed in 3... 2... 1...
>
> * Design flaws


> 1) Users cannot specify any allocation that would share only part with
>some other allocation of the domain or the default group.
>



yep, There's no share cache ways support.

I was thinking that create a cache resource group in libvirt, and user can
add vms into that resource group, this is good for those who would like to
have share cache resource, maybe NFV case.

but for case:

VM1: fff00
VM2: 00fff
which have a `f` (4 cache ways) share, seems have not really meanful.
at least, I don't heart that we have that case. This was mentioned by
*Marcelo Tosatti *before too.



That's kind of what I was aiming at.  I'm not that sure about it, but we
don't need to go out of our way to support such use cases, just make
sure we don't bite our future selves in the back in case there is
similar case.  I believe this concern is solved below.


2) It was not specified what to do with the default resource group.
>There might be several ways to approach this, with varying pros and
>cons:
>
> a) Treat it as any other group.  That is any bit set for this group
>will be excluded from usable bits when creating new allocation
>for a domain.
>
> - Very predictable behaviour
>
> - You will not be able to allocate any amount of cache without
>   previous setting for the default group as that will have all
>   the bits set which will make all the cache unusable
>
> b) Automatically remove the appropriate amount of bits that are
>needed for new domains.
>
> - No need to do any change to the system settings in order to
>   use this new feature
>
> - We would have to change system settings, which is generally
>   frowned upon when done "automatically" as a side effect of
>   starting a domain, especially for such scarce resource as
>   cache
>
> - The change to system settings would not be entirely
>   predictable
>
> c) Act like it doesn't exist and don't remove its allocations from
>consideration
>
> - Doesn't really make sense as system processes might be
>   trashing the cache as any VM, moreover when all VM processes
>   without allocations will be based in the default group as
>   well
>
> 3) There is no way for users to know what the particular settings are
>for any running domain.



I think you are going to expose what the current CBM looks like for
a given VM? That's fair enough.



Just updating the 'from' in case it wasn't specified should serve the
same purpose.  So yes, basically.




>
> The first point was deemed a corner case.  Fair enough on its own, but
> considering point 2 and its solutions, it is rather difficult for me to
> justify it.  Also, let's say you have domain with 4 vCPUs out of which
> you know 1 might be trashing the cache, but you don't want to restrict
> it completely, but others will utilize it very nicely.  Sensible
> allocations for such domain's vCPUs might be:
>
>  vCPU  0:   000f
>  vCPUs 1-3: 
>
> as you want vCPUs 1-3 to utilize even the part of cache that might get
> trashed by vCPU 0.  Or they might share some data (especially
> guest-memory-related).
>
> The case above is not possible to set up with only per-vcpu(s) scalar
> setting.  And there are more as you might imagine now.  For example how
> do we behave with iothreads and emulator threads?

This is kinds of hard to implement, but possible.


is 1:1 mapping of resource group to VM?



no, one resource allocation is one group, no unnecessary ones will be
created (due to low number of CLOSids).


if you want to have iothreads and emulator threads to have separated
cache allocation, you may need to create resource group to associated with
VM's vcpus and iothreads and emulator thread.



If the user specifies two allocations, one for vcpu0, emulator and
iothreads 1-4, 

Re: [libvirt] Yet another RFC for CAT

2017-09-06 Thread Eli Qiao
2017-09-04 23:57 GMT+08:00 Daniel P. Berrange :

> On Mon, Sep 04, 2017 at 04:14:00PM +0200, Martin Kletzander wrote:
> > * The current design (finally something libvirt-related, right?)
> >
> > The discussion ended with a conclusion of the following (with my best
> > knowledge, there were so many discussions about so many things that I
> > would spend too much time looking up all of them):
> >
> > - Users should not need to specify bit masks, such complexity should be
> >   abstracted.  We'll use sizes (e.g. 4MB)
> >
> > - Multiple vCPUs might need to share the same allocation.
> >
> > - Exclusivity of allocations is to be assumed, that is only unoccupied
> >   cache should be used for new allocations.
> >
> > The last point seems trivial but it's actually very specific condition
> > that, if removed, can cause several problems.  If it's hard to grasp the
> > last point together with the second one, you're on the right track.  If
> > not, then I'll try to make a point for why the last point should be
> > removed in 3... 2... 1...
> >
> > * Design flaws
>
>
> > 1) Users cannot specify any allocation that would share only part with
> >some other allocation of the domain or the default group.
> >
>

yep, There's no share cache ways support.

I was thinking that create a cache resource group in libvirt, and user can
add vms into that resource group, this is good for those who would like to
have share cache resource, maybe NFV case.

but for case:

VM1: fff00
VM2: 00fff
which have a `f` (4 cache ways) share, seems have not really meanful.
at least, I don't heart that we have that case. This was mentioned by
*Marcelo Tosatti *before too.

> 2) It was not specified what to do with the default resource group.
> >There might be several ways to approach this, with varying pros and
> >cons:
> >
> > a) Treat it as any other group.  That is any bit set for this group
> >will be excluded from usable bits when creating new allocation
> >for a domain.
> >
> > - Very predictable behaviour
> >
> > - You will not be able to allocate any amount of cache without
> >   previous setting for the default group as that will have all
> >   the bits set which will make all the cache unusable
> >
> > b) Automatically remove the appropriate amount of bits that are
> >needed for new domains.
> >
> > - No need to do any change to the system settings in order to
> >   use this new feature
> >
> > - We would have to change system settings, which is generally
> >   frowned upon when done "automatically" as a side effect of
> >   starting a domain, especially for such scarce resource as
> >   cache
> >
> > - The change to system settings would not be entirely
> >   predictable
> >
> > c) Act like it doesn't exist and don't remove its allocations from
> >consideration
> >
> > - Doesn't really make sense as system processes might be
> >   trashing the cache as any VM, moreover when all VM processes
> >   without allocations will be based in the default group as
> >   well
> >
> > 3) There is no way for users to know what the particular settings are
> >for any running domain.
>

I think you are going to expose what the current CBM looks like for
a given VM? That's fair enough.


> >
> > The first point was deemed a corner case.  Fair enough on its own, but
> > considering point 2 and its solutions, it is rather difficult for me to
> > justify it.  Also, let's say you have domain with 4 vCPUs out of which
> > you know 1 might be trashing the cache, but you don't want to restrict
> > it completely, but others will utilize it very nicely.  Sensible
> > allocations for such domain's vCPUs might be:
> >
> >  vCPU  0:   000f
> >  vCPUs 1-3: 
> >
> > as you want vCPUs 1-3 to utilize even the part of cache that might get
> > trashed by vCPU 0.  Or they might share some data (especially
> > guest-memory-related).
> >
> > The case above is not possible to set up with only per-vcpu(s) scalar
> > setting.  And there are more as you might imagine now.  For example how
> > do we behave with iothreads and emulator threads?
>
> This is kinds of hard to implement, but possible.

is 1:1 mapping of resource group to VM?

if you want to have iothreads and emulator threads to have separated
cache allocation, you may need to create resource group to associated with
VM's vcpus and iothreads and emulator thread.

but COS number is limited, does it worth to have so fine granularity
control?


> Ok, I see what you're getting at.  I've actually forgotten what
> our current design looks like though :-)
>
> What level of granularity were we allowing within a guest ?
> All vCPUs use separate cache regions from each other, or all
> vCPUs use a share cached region, but separate from other guests,
> or a mix ?
>
> > * My suggestion:
> >
> > - Provide an API for querying and changing the 

Re: [libvirt] Yet another RFC for CAT

2017-09-04 Thread Daniel P. Berrange
On Mon, Sep 04, 2017 at 04:14:00PM +0200, Martin Kletzander wrote:
> * The current design (finally something libvirt-related, right?)
> 
> The discussion ended with a conclusion of the following (with my best
> knowledge, there were so many discussions about so many things that I
> would spend too much time looking up all of them):
> 
> - Users should not need to specify bit masks, such complexity should be
>   abstracted.  We'll use sizes (e.g. 4MB)
> 
> - Multiple vCPUs might need to share the same allocation.
> 
> - Exclusivity of allocations is to be assumed, that is only unoccupied
>   cache should be used for new allocations.
> 
> The last point seems trivial but it's actually very specific condition
> that, if removed, can cause several problems.  If it's hard to grasp the
> last point together with the second one, you're on the right track.  If
> not, then I'll try to make a point for why the last point should be
> removed in 3... 2... 1...
> 
> * Design flaws
> 
> 1) Users cannot specify any allocation that would share only part with
>some other allocation of the domain or the default group.
> 
> 2) It was not specified what to do with the default resource group.
>There might be several ways to approach this, with varying pros and
>cons:
> 
> a) Treat it as any other group.  That is any bit set for this group
>will be excluded from usable bits when creating new allocation
>for a domain.
> 
> - Very predictable behaviour
> 
> - You will not be able to allocate any amount of cache without
>   previous setting for the default group as that will have all
>   the bits set which will make all the cache unusable
> 
> b) Automatically remove the appropriate amount of bits that are
>needed for new domains.
> 
> - No need to do any change to the system settings in order to
>   use this new feature
> 
> - We would have to change system settings, which is generally
>   frowned upon when done "automatically" as a side effect of
>   starting a domain, especially for such scarce resource as
>   cache
> 
> - The change to system settings would not be entirely
>   predictable
> 
> c) Act like it doesn't exist and don't remove its allocations from
>consideration
> 
> - Doesn't really make sense as system processes might be
>   trashing the cache as any VM, moreover when all VM processes
>   without allocations will be based in the default group as
>   well
> 
> 3) There is no way for users to know what the particular settings are
>for any running domain.
> 
> The first point was deemed a corner case.  Fair enough on its own, but
> considering point 2 and its solutions, it is rather difficult for me to
> justify it.  Also, let's say you have domain with 4 vCPUs out of which
> you know 1 might be trashing the cache, but you don't want to restrict
> it completely, but others will utilize it very nicely.  Sensible
> allocations for such domain's vCPUs might be:
> 
>  vCPU  0:   000f
>  vCPUs 1-3: 
> 
> as you want vCPUs 1-3 to utilize even the part of cache that might get
> trashed by vCPU 0.  Or they might share some data (especially
> guest-memory-related).
> 
> The case above is not possible to set up with only per-vcpu(s) scalar
> setting.  And there are more as you might imagine now.  For example how
> do we behave with iothreads and emulator threads?

Ok, I see what you're getting at.  I've actually forgotten what
our current design looks like though :-)

What level of granularity were we allowing within a guest ?
All vCPUs use separate cache regions from each other, or all
vCPUs use a share cached region, but separate from other guests,
or a mix ?

> * My suggestion:
> 
> - Provide an API for querying and changing the allocation of the
>   default resource group.  This would be similar to setting and
>   querying hugepage allocations (see virsh's freepages/allocpages
>   commands).

Reasonable

> - Let users specify the starting position in addition to the size, i.e.
>   not only specifying "size", but also "from".  If "from" is not
>   specified, the whole allocation must be exclusive.  If "from" is
>   specified it will be set without checking for collisions.  The latter
>   needs them to query the system or know what settings are applied
>   (this should be the case all the time), but is better then adding
>   non-specific and/or meaningless exclusivity settings (how do you
>   specify part-exclusivity of the cache as in the example above)

I'm concerned about the idea of not checking 'from' for collisions,
if there's allowed a mix of guests with & within 'from'.

eg consider

 * Initially 24 MB of cache is free, starting at 8MB
 * run guest A   from=8M, size=8M
 * run guest B   size=8M
 => libvirt sets from=16M, so doesn't clash with A
 * stop guest A
 * run guest C   size=8M
 => libvirt sets from=8M, so doesn't

[libvirt] Yet another RFC for CAT

2017-09-04 Thread Martin Kletzander

Hello everyone.

Last couple of weeks [1] I was working on CAT for libvirt.  Only
clean-ups and minor things were pushed into upstream, but as I'm getting
closer and closer to actual functionality I'm seeing a problem with our
current (already discussed and approved) design.  And I would like to
know your thoughts about this, even if you are not familiar with CAT,
feel free to keep the questions coming.

* Little bit of background about CAT

[I wanted to say "Long story short...", but after reading the mail in
its entirety before sending it I see it would end up like in "I Should
Have Never Gone Ziplining", so I'll rather let you brace for quite
elongated or, dare I say, endless stream of words]

Since the interface for CAT in the Linux kernel is quite hairy (together
with cache information reporting, don't even get me started on that) and
might feel pretty inconsistent if you are used to any Linux kernel
interface, I would like to summarize how is it used [2].  Feel free to
skip this part if you are familiar with it.

You can tune how much cache which processes can utilize.  Let's talk
only about L3 for now, also let's assume only unified caches (no
code/data prioritization).  For simplicity.

The cache is split into parts and when describing the allocation we use
hexadecimal representation of bit masks where each bit is the smallest
addressable (or rather allocable) part of the cache.  Let's say you have
16MB L3 cache which the CPU is able to allocate by chunks of 1MB, so the
allocation is represented by 32 bits => 8 hexadecimal characters.  Yes,
there can be minimum of continuous bits that need to be specified, you
can have multiple L3 caches, etc., but that's yet another thing that's
not important to what I need to discuss.  The whole cache is then
referred to as "" in this particular case.  Again, for simplicity
sake, let's assume the above hardware is constant in future examples.

Now, when you want to work with the allocations, it behaves similarly
(not the same way, though) as cgroups.  The default group, which
contains all processes, is in /sys/fs/resctrl, and you can create
additional groups (directories under /sys/fs/resctrl).  These are flat,
not hierarchical, meaning they cannot have subdirectories.  Each
resource group represents a group of processes (PIDs are written in
"tasks" file) that share the same resource settings.  One of the
settings is the allocation of caches.  By default there are no
additional resource groups (subdirectories of /sys/fs/resctrl) and the
default one occupies all the cache.

(IIRC, all bit masks must have only consecutive bits, but I cannot find
this in the documentation; let's assume this as well, but feel free to
correct me)

* Example time (we're almost there)

Let's say you have the default group with this setting:

 L3:0=00ff

That is setting of allocation for L3 cache, both code and data, cache id
0 and the occupancy rate is 50% (lower 8MB of the only L3 cache in our
example to be precise).

If you now create additional resource group, let's say
"libvirt-qemu-3-alpine-vcpu3" (truly random name, right?) and set the
following allocation:

 L3:0=0ff0

That specifies it will be allowed to use also 8MB of the cache, but this
time from the middle.  Half of that will be shared between this group
and the default one, the rest is exclusive to this group only.

* The current design (finally something libvirt-related, right?)

The discussion ended with a conclusion of the following (with my best
knowledge, there were so many discussions about so many things that I
would spend too much time looking up all of them):

- Users should not need to specify bit masks, such complexity should be
  abstracted.  We'll use sizes (e.g. 4MB)

- Multiple vCPUs might need to share the same allocation.

- Exclusivity of allocations is to be assumed, that is only unoccupied
  cache should be used for new allocations.

The last point seems trivial but it's actually very specific condition
that, if removed, can cause several problems.  If it's hard to grasp the
last point together with the second one, you're on the right track.  If
not, then I'll try to make a point for why the last point should be
removed in 3... 2... 1...

* Design flaws

1) Users cannot specify any allocation that would share only part with
   some other allocation of the domain or the default group.

2) It was not specified what to do with the default resource group.
   There might be several ways to approach this, with varying pros and
   cons:

a) Treat it as any other group.  That is any bit set for this group
   will be excluded from usable bits when creating new allocation
   for a domain.

- Very predictable behaviour

- You will not be able to allocate any amount of cache without
  previous setting for the default group as that will have all
  the bits set which will make all the cache unusable

b) Automatically remove the appropriate amount of bits that are