Re: [libvirt][RFC PATCH] add a new 'default' option for attribute mode in numatune

Zhong, Luyao Thu, 19 Nov 2020 21:18:11 -0800



On 11/18/2020 8:42 PM, Martin Kletzander wrote:

On Wed, Nov 18, 2020 at 05:40:04PM +0800, Zhong, Luyao wrote:
On 11/12/2020 5:27 PM, Martin Kletzander wrote:
On Thu, Nov 12, 2020 at 02:19:06PM +0800, Zhong, Luyao wrote:
On 11/9/2020 7:21 PM, Martin Kletzander wrote:
On Sat, Nov 07, 2020 at 10:41:52AM +0800, Zhong, Luyao wrote:
On 11/4/2020 9:02 PM, Martin Kletzander wrote:
On Fri, Oct 16, 2020 at 10:38:51PM +0800, Zhong, Luyao wrote:
On 10/16/2020 9:32 PM, Zang, Rui wrote:
How about if “migratable” is set, “mode” should beignored/omitted?
So any setting of “mode” will be rejected with an error
indicating an
invalid configuration.
We can say in the doc that “migratable” and “mode” shall not beset
together. So even the default value of “mode” is not taken.
If "mode" is not set, it's the same as setting "strict" value
('strict'
is the default value). It involves some code detail, it will be
translated to enumerated type, the value is 0 when mode not set or
set
to 'strict'. The code is in some fixed skeleton, so it's noteasy to
modify.
Well I see it as it is "strict". It does not mean "strict cgroup
setting",
because cgroups are just one of the ways to enforce this. Lookat it
this way:

mode can be:
  - strict: only these nodes can be used for the memory
- preferred: there nodes should be preferred, but allocationshould
not fail
  - interleave: interleave the memory between these nodes

Due to the naming this maps to cgroup settings 1:1.
Sorry, I misspoke, this does not map to cgroup settings at all, in
cgroups you
can only set "strict" (by using cpuset.mems) and that's it. Thereis no
way to
set preferred or interleaved mapping, sorry.
memory policy is independent of cpuset.mems
Yes.
I quote here "Memory policies should not be confused with cpusets
(Documentation/admin-guide/cgroup-v1/cpusets.rst) which is an
administrative mechanism for restricting the nodes from which memorymay
be allocated by a set of processes. Memory policies are a programming
interface that a NUMA-aware application can take advantage of.
Pay attention to this part:
When both cpusets and policies are applied to a task, the restrictions
of the
cpuset takes priority.
See Memory Policies and cpusets below for more details."[1]

So using cpuset.mems does not mean set "strict" memory policy if I
understand it correctly, we can set cpuset.mems with any memory policy.
That's not how I understand that.  Sure, it's independent of memory
policy, but
if you do not specify memory policy (which keeps it as "default") andset
cpuset.mems, then the process will only be permitted to allocate memory
on NUMA
nodes specified in the file.
yes it's not conflict with what I was saying, it's one kind ofcombinations.
For instance, we can also set cpuset.mems to "1,2" and use mbind() set
memory policy to MPOL_PREFERRED and preferred node is "1", that means we
will allocate pages from the node 1 first then fall back to other
nodes(only node 2 under this case since cpuset.mems restrict the memory
nodes) if the preferred nodes is low on free memory. If the prefered
node is "0", we will not allocate pages from node 0 since cpuset.mems
takes priority.

Do you mean cpuset.mems + MPOL_DEFAULT policy == MPOL_BIND policy? They
might be functionally similar if not specific policies implemented in
kernel. But I don't think they are exactly the same.
It's not the same, but has similar outcome. If you donuma_set_membind() orset_mempolicy(MPOL_BIND) on nodes 2-3, then the memory allocations willonly bedone on those nodes. If there is no space on them the allocation willfail. Ifyou write 2-3 into cpuset.mems, then similar thing happens: allocationswillonly be done on the mentioned nodes and will fail if there is not enoughspace.
If you are talking about mbind() in qemu, that is done *after* theallocation,
but *before* any write to that.  If that is anonymous memory or a private
file-backed memory, then writes that allocate will be allocated on thespecifiednodes. If you write to cpuset.mems *after* the allocations were done,then theymight be moved to those nodes based on cpuset.memory_migrate and coupleof other
things.
Because default policy indicate that we are using policy implemented in
kernel(transparent to process), just like the memory
tiering(https://lwn.net/Articles/802544/) case I stated before. So
cpuset.mems + MPOL_DEFAULT policy is not MPOL_BIND policy under thiscase.
Yes, there are some other minor details between mpol_default andmpol_bind. Areyou referring to the fact that mpol_default prefers same node as thethread isrunning on whereas mpol_bind allocates on the node with the lowest IDfrom theset, no matter what node the thread requesting the allocation is runningon,
etc.?
If yes, then sure, I completely agree that someone might want to usecpuset.memswith mpol_default. That's what we're both trying to achieve here,right? I
think the misunderstanding here comes from the fact we both understand what
libvirt XML is trying to describe.

yes

In libvirt XML we have the setting for "mode" which accepts possiblevalues of
"strict", "interleave" and "preferred".  It *does not* say *how* that is
achieved (in fact in the past we were using different mechanisms forthat). If
you say "default" in the terms of libvirt XML it means he value used if no
setting was specified. But since we do not expose "memory policy" perse it is
very misleading to add such value.  In my opinion it would also be very
difficult to describe its outcome with regard to the "mode".
Given what I described earlier, the main difference between cgroups andlibnuma
APIs is (at least from what I remember) that cgroup settings can be changed
(even if The API settings could be changed we'd need cooperation fromQEMU tochange them at runtime). And if cpuset.memory_migrate is set it can bemigratedto another node. That was also requested couple of times in the pastand that's
what I based my suggestion on.
I now understand better what you mean when you are looking at it fromthe otherside needing to keep MPOL_DEFAULT, but with restricted set of NUMA nodesallowedfor allocation. You don't really care whether it is going to bemigratable ornot so that naming does not make sense for your use case (which alsomeans itwould not make sense for all of our users). So my naming is bad too,probably
worse than your suggestion.

So let's finish this sooner rather than later.  Let's remove the
"migratable/movable" attribute and add another mode for the way memory
allocations are going to be treated.  But let's name it something else than
"default" and let's explain the name properly in man pages anddocumentation.Since naming is hard and names I come up with are usually bad I can onlysuggesta lowest bottom fallback if we can't come up with anything else =) Let'ssay
something like "restrictive".

I have no better name than yours. So "restrictive" is good I think, Icould use it and send out the patch first, then other reviewers mightcome up with new name otherwise we keep it.

Sure, I don't like that the only difference between "strict" and"the_new_name"would be whether we pass the options to qemu or not, but saying thatwould not
really help users very much.

Does that sound OK to you?


Yes, it's good. Thank you so much for your comments.

Regards,
Luyao

[1]https://www.infradead.org/~mchehab/kernel_docs/admin-guide/mm/numa_memory_policy.html

But now we have another way of enforcing this, using qemu cmdline
option.  The
names actually map 1:1 to those as well:
https://gitlab.com/qemu-project/qemu/-/blob/master/qapi/machine.json#L901
So my idea was that we would add a movable/migratable/whatever
attribute
that
would tell us which way for enforcing we use because there does not
seem
to be
"one size fits all" solution. Am I misunderstanding thisdiscussion?
Please
correct me if I am.  Thank you.
Actually I need a default memory policy(memory policy is 'hard coded'
into the kernel) support, I thought "migratable" was enough toindicate


So I am getting on your track, yes.  What you mean is basically
MPOL_DEFAULT and

that's where the naming probably comes from, right? Anyway, whatwe're

trying
to do is not restrict us from other options, even if they are only
possible in
the future.  So instead of adding "default" which would actually mean
"strict"
(because you still use cpuset.mems) which would restrict us from
potentially
being able to migrate with a different policy than "strict" (even
though it
might not make sense for "preferred", for example) and it's also a bit
confusing

as I mentioned above, using "cpuset.mems" does not mean "strict" memory
policy.

for users, I suggested we add "migratable" which restricts just theqemu

options.  Of course, "migratable" only makes sense with "strict" now,
but that's
fine.  The XML provides a possibility for something we don't support,
but we can

forbid that combination for the sake of clarity of the other optionthat

_is_
supported.

I'll try to propose my idea based on your patch from Nov 3rd and it
might
improve my communication.  I feels difficult for me to explain myself
without
the code.  I just need to deal with a lot of other emails first.

Thank you in advance. Let's discuss later based on the patch.

Regards,
Luyao

that we rely on operating system to operate memory policy. So when

"migratable" is set, "mode" should not be set. But when I wascoding, I

found "mode" default value is "strict", it is always "strict" even if
"migratable" is yes, that means we configure two different memory

policies at the same time. Then I still need a new option for"mode" tomake it not conflicting with the "migratable", then if we have thenew

option("default") for "mode", it seems we can drop "migratable".

Besides, we can make "mode" being a "one size fits all" solution.,justreject the different "mode" value config in memnode element when"mode"

is "default" in memory element.

I summary it in the new email

https://www.redhat.com/archives/libvir-list/2020-November/msg00084.html


Sorry I didn't make it easy to understand.

Regards,
Luyao

So I need a option to indicate "I don't specify any mode.".

在 2020年10月16日，20:34，Zhong, Luyao <luyao.zh...@intel.com>
写道：

Hi Martin, Peter and other experts,

We got a consensus that we need introducing a new "migratable"

attribute before. But in implementation, I found introducing anew

'default' option for existing mode attribute is still neccessary.

I have a initial patch for 'migratable' and Peter gave some
comments
already.

https://www.redhat.com/archives/libvir-list/2020-October/msg00396.html




Current issue is, if I set 'migratable', any 'mode' should be
ignored. Peter commented that I can't rely on docs to tell users

some config is invalid, I need to reject the config in thecode, I

completely agree with that. But the 'mode' default value is

'strict', it will always conflict with the 'migratable', atthe end

I still need introducing a new option for 'mode' which can be a
legal config when 'migratable' is set.

If we have 'default' option, is 'migratable' still needed then?

FYI.
The 'mode' is corresponding to memory policy, there already a
notion
of default memory policy.
  quote:
    System Default Policy:  this policy is "hard coded" into the
kernel.

(https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt)


So it might be easier to understand if we introduce a 'default'
option directly.

Regards,
Luyao

On 8/26/2020 6:20 AM, Martin Kletzander wrote:

On Tue, Aug 25, 2020 at 09:42:36PM +0800, Zhong, Luyao wrote:


On 8/19/2020 11:24 PM, Martin Kletzander wrote:

On Tue, Aug 18, 2020 at 07:49:30AM +0000, Zang, Rui wrote:

-----Original Message-----
From: Martin Kletzander <mklet...@redhat.com>
Sent: Monday, August 17, 2020 4:58 PM
To: Zhong, Luyao <luyao.zh...@intel.com>
Cc: libvir-list@redhat.com; Zang, Rui <rui.z...@intel.com>;
Michal
Privoznik
<mpriv...@redhat.com>
Subject: Re: [libvirt][RFC PATCH] add a new 'default'
option for
attribute mode
in numatune

On Tue, Aug 11, 2020 at 04:39:42PM +0800, Zhong, Luyaowrote:



On 8/7/2020 4:24 PM, Martin Kletzander wrote:

On Fri, Aug 07, 2020 at 01:27:59PM +0800, Zhong, Luyao
wrote:



On 8/3/2020 7:00 PM, Martin Kletzander wrote:

On Mon, Aug 03, 2020 at 05:31:56PM +0800, Luyao Zhong
wrote:
Hi Libvirt experts,
I would like enhence the numatune snippetconfiguration.
Given a
example snippet:

<domain>
Ã‚Â ...
Ã‚Â <numatune>
Ã‚Â Ã‚Â <memory mode="strict" nodeset="1-4,^3"/>Ã‚Â Ã‚Â
<memnode cellid="0" mode="strict" nodeset="1"/> Ã‚Â Ã‚Â
<memnode
cellid="2" mode="preferred" nodeset="2"/> Ã‚Â</numatune>
Ã‚Â ...
</domain>

Currently, attribute mode is either 'interleave',
'strict', or
'preferred', I propose to add a new 'default'Ã‚Âoption.
I give
the reason as following.
Presume we are using cgroups v1, Libvirt setscpuset.mems
for all
vcpu threads according to 'nodeset' in memory element.
And
translate the memnode element to qemu config options
(--object
memory-backend-ram) for per numa cell, which invoking
mbind()
system call at the end.[1]

But what if we want using default memory policy and
request each
guest numa cell pinned to different host memorynodes? We
can't
use mbind via qemu config options, because (I quotohere)
"For
MPOL_DEFAULT, the nodemask and maxnode argumentsmust be
specify
the empty set of nodes." [2]
So my solution is introducing a new 'default' optionfor
attribute
mode. e.g.

<domain>
Ã‚Â ...
Ã‚Â <numatune>
Ã‚Â Ã‚Â  <memory mode="default" nodeset="1-2"/> Ã‚Â Ã‚Â
<memnode
cellid="0" mode="default" nodeset="1"/> Ã‚Â Ã‚Â<memnode
cellid="1" mode="default" nodeset="2"/> Ã‚Â </numatune>
Ã‚Â ...
</domain>
If the mode is 'default', libvirt should avoidgenerating
qemu
command line '--object memory-backend-ram', and invokes
cgroups to
set cpuset.mems for per guest numa combining with numa
topology
config. Presume the numa topology is :

<cpu>
Ã‚Â ...
Ã‚Â <numa>
Ã‚Â Ã‚Â  <cell id='0' cpus='0-3' memory='512000'
unit='KiB' /> Ã‚Â
Ã‚Â  <cell id='1' cpus='4-7' memory='512000'
unit='KiB' />
Ã‚Â
</numa> Ã‚Â ...
</cpu>
Then libvirt should set cpuset.mems to '1' for vcpus0-3,
and '2'
for vcpus 4-7.


Is this reasonable and feasible? Welcome any comments.
There are couple of problems here.Ã‚Â  The memory is not
(always)
allocated by the vCPU threads.Ã‚Â  I also remember it to
not be
allocated by the process, but in KVM in a way thatwas not
affected
by the cgroup settings.


Thanks for your reply. Maybe I don't get what you mean,
could you
give me more context? But what I proposed will have no
effect on
other memory allocation.

Check how cgroups work.Â We can set the memory nodesthat a

process
will allocate from.Â  However to set the node for the
process
(thread) QEMU needs to be started with the vCPU threads
already
spawned (albeit stopped).Â  And for that QEMU already
allocates some
memory.Â  Moreover if extra memory was allocated after we
set
the
cpuset.mems it is not guaranteed that it will be
allocated by
the
vCPU in that NUMA cell, it might be done in the emulator
instead or
the KVM module in the kernel in which case it might not be
accounted
for the process actually causing the allocation (as we've
already
seen with Linux).Â  In all these cases cgroups will not do
what you
want them to do.Â  The last case might be fixed, the first
ones are
by default not going to work.

That might be
fixed now,
however.

But basically what we have against is all the reasons
why we
started using QEMU's command line arguments for allthat.

I'm not proposing use QEMU's command line arguments, on
contrary I
want using cgroups setting to support a new
config/requirement. I
give a solution about if we require default memory policy
and memory
numa pinning.

And I'm suggesting you look at the commit log to seewhy we

*had* to
add these command line arguments, even though I think I
managed to
describe most of them above already (except for one that
_might_

already be fixed in the kernel).Â I understand the gitlog

is huge
and the code around NUMA memory allocation was changing a
lot, so I
hope my explanation will be enough.

Thank you for detailed explanation, I think I get itnow. We

can't

guarantee memory allocation matching requirement sincethere

is a time
slot before setting cpuset.mems.


That's one of the things, although this one could be
avoided (by
setting a global
cgroup before exec()).

Thanks,
Luyao

Sorry, but I think it will more likely break rather than
fix stuff.
Maybe this
could be dealt with by a switch in `qemu.conf` with ahuge
warning
above it.

I'm not trying to fix something, I propose how to support
a new
requirement just like I stated above.


I guess we should take a couple of steps back, I don't get
what you

are trying to achieve.Â Maybe if you describe your usecase

it will
be easier to reach a conclusion.

Yeah, I do have a usecase I didn't mention before. It's a
feature in
kernel but not merged yet, we call it memory tiering.
(https://lwn.net/Articles/802544/)

If memory tiering is enabled on host, DRAM is top tier
memory,
and

PMEM(persistent memory) is second tier memory, PMEM isshown

as numa
node without cpu. For short, pages can be migrated between
DRAM and
PMEM based on DRAM pressure and how cold/hot they are.

We could configure multiple memory migrating path. For
example, node 0:
DRAM, node 1: DRAM, node 2: PMEM, node 3: PMEM we can make
0+2
to a

group, and 1+3 to a group. In each group, page isallowed to

migrated
down(demotion) and up(promotion).

If **we want our VMs utilizing memory tiering and with NUMA
topology**,

we need handle the guest memory mapping to host memory,that

means we
need bind each guest numa node to a memory nodes group(DRAM
node +

PMEM

node) on host. For example, guest node 0 -> host node 0+2.

However, only cgroups setting can make the memory tiering
work, if we
use mbind() system call, demoted pages will never goback to
DRAM.
That's why I propose to add 'default' option and bypassmbind
in QEMU.

I hope I make myself understandable. I'll appreciate if you
could give
some suggestion.


This comes around every couple of months/years and bites us
in the
back no
matter what way we go (every time there is someone who
wants it
the
other
way).
That's why I think there could be a way for the user to
specify
whether they will
likely move the memory or not and based on that we would
specify `host-

nodes` and `policy` to qemu or not. I think I evensuggested

this
before (or

probably delegated it to someone else for a suggestion sothat

there
is more
discussion), but nobody really replied.

So what we need, I think, is a way for someone to set a
per-domain
information
whether we should bind the memory to nodes in a changeable
fashion or
not.
I'd like to have it in as well.  The way we need to do that
is,
probably, per-
domain, because adding yet another switch for each place in
the
XML
where we
can select a NUMA memory binding would be a suicide.  There
should
also be
no need for this to be enabled per memory-(module, node),
so it
should work
fine.


Thanks for letting us know your vision about this.
 From what I understood, the "changeable fashion" means that
the
guest
numa

cell binding can be changed out of band after initialbinding,

either
by system

admin or the operating system (memory tiering in ourcase), or

whatever the
third party is.  Is that perception correct?

Yes. If the user wants to have the possibility of changingthe

binding,
then we
use *only* cgroups.  Otherwise we use the qemu parameters that
will make
qemu
call mbind() (as that has other pros mentioned above).  The
other
option
would

be extra communication between QEMU and libvirt duringstart to

let us
know when
to set what cgroups etc., but I don't think that's worth it.

It seems to me mbind() or set_mempolicy() system calls do not
offer that
flexibility of changing afterwards. So in case of QEMU/KVM, I
can only
think
of cgroups.
So to be specific, if we had this additional
"memory_binding_changeable"
option specified, we will try to do the guest numa
constraining via
cgroups
whenever possible. There will probably also be conflicts in
options or
things
that cgroups can not do. For such cases we'd fail the domain.


Basically we'll do what we're doing now and skip the qemu
`host-nodes` and
`policy` parameters with the new option.  And of course we can
fail with
a nice
error message if someone wants to move the memory without the
option
selected
and so on.


Thanks for your comments.

I'd like get it more clear about defining the interface indomain

xml,
then I could go into the implementation further.

As you mentioned, per-domain option will be better thanper-node.

I go
through the libvirt doamin format to look for a proper
position to
place

this option. Then I'm thinking we could still utilizingnumatune

element
to configure.

<numatune>
   <memory mode="strict" nodeset="1-4,^3"/>
   <memnode cellid="0" mode="strict" nodeset="1"/>
   <memnode cellid="2" mode="preferred" nodeset="2"/>
</numatune>

coincidentally, the optional memory element specifies how to
allocate

memory for the domain process on a NUMA host. So can weutilizing

this
element, and introducing a new mode like "changeable" or
whatever? Do
you have a better name?

Yeah, I was thinking something along the lines of:
<numatune>
    <memory mode="strict" nodeset="1-4,^3"
movable/migratable="yes/no" />
    <memnode cellid="0" mode="strict" nodeset="1"/>
    <memnode cellid="2" mode="preferred" nodeset="2"/>
</numatune>

If the memory mode is set to 'changeable', we could ignore the
mode
setting for each memnode, and then we only configure by
cgroups. I
have
not diven into code for now, expecting it could work.

Yes, the example above gives the impression of the attributebeing

available
per-node.  But that could be handled in the documentation.
Specifying it per-node seems very weird, why would you want the
memory to be
hard-locked, but for some guest nodes only?

Thanks,
Luyao

If you agree with the direction, I think we can dig deeper to
see what
will
come out.

Regards,
Zang, Rui

Ideally we'd discuss it with others, but I think I amonly one

of a
few people
who dealt with issues in this regard.  Maybe Michal (Cc'd)
also
dealt
with some
things related to the binding, so maybe he can chime in.

regards,
Luyao

Have a nice day,
Martin

Regards,
Luyao

[1]https://github.com/qemu/qemu/blob/f2a1cf9180f63e88bb38ff21c169d

a97c3f2bad5/backends/hostmem.c#L379


[2]https://man7.org/linux/man-pages/man2/mbind.2.html

--
2.25.1

Re: [libvirt][RFC PATCH] add a new 'default' option for attribute mode in numatune

Reply via email to