Re: [Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching

2018-10-12 Thread Jan Beulich
>>> On 11.10.18 at 18:33,  wrote:

> 
>> On Oct 11, 2018, at 5:15 PM, Jan Beulich  wrote:
>> 
> On 11.10.18 at 17:54,  wrote:
>> 
 On Oct 2, 2018, at 1:47 PM, Jan Beulich  wrote:
 
>>> On 02.10.18 at 12:51,  wrote:
 
> This doesn't behave like real hardware, and definitely doesn't behave as
> named - "struct hvmemul_cache" is simply false.  If it were named
> hvmemul_psc (or some other variation on Paging Structure Cache) then it
> wouldn't be so bad, as the individual levels do make more sense in that
> context
 
 As previously pointed out (without any suggestion coming back from
 you), I chose the name "cache" for the lack of a better term. However,
 I certainly disagree with naming it PSC or some such, as its level zero
 is intentionally there to be eventually used for non-paging-structure
 data.
>>> 
>>> I can think of lots of descriptive names which could yield unique 
>>> three-letter acronyms:
>>> 
>>> Logical Read Sequence
>>> Logical Read Series
>>> Logical Read Record
>>> Read Consistency Structure
>>> Consistent Read Structure
>>> Consistent Read Record
>>> Emulation Read Record
>>> […]
>> 
>> Well, I'm not sure LRS, LRR, RCS, CRS, CRR, or ERR would be
>> easily recognizable as what they stand for. To be honest I'd
>> prefer a non-acronym. Did you see my consideration towards
>> "latch”?
> 
> Of course not; that’s why you put the long form name in a comment near the 
> declaration. :-)

Of course I would, but I don't think this would help. You don't
want to always go back to the declaration (in a header) when
you look at a function (in a .c file) using the type. Such names
should be at least half way self-explanatory.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching

2018-10-11 Thread George Dunlap


> On Oct 2, 2018, at 11:51 AM, Andrew Cooper  wrote:
> 
> On 02/10/18 11:36, Jan Beulich wrote:
> On 25.09.18 at 16:14,  wrote:
>>> Emulation requiring device model assistance uses a form of instruction
>>> re-execution, assuming that the second (and any further) pass takes
>>> exactly the same path. This is a valid assumption as far as use of CPU
>>> registers goes (as those can't change without any other instruction
>>> executing in between), but is wrong for memory accesses. In particular
>>> it has been observed that Windows might page out buffers underneath
>>> an instruction currently under emulation (hitting between two passes).
>>> If the first pass translated a linear address successfully, any subsequent
>>> pass needs to do so too, yielding the exact same translation.
>>> 
>>> Introduce a cache (used just by guest page table accesses for now, i.e.
>>> a form of "paging structure cache") to make sure above described
>>> assumption holds. This is a very simplistic implementation for now: Only
>>> exact matches are satisfied (no overlaps or partial reads or anything).
>>> 
>>> There's also some seemingly unrelated cleanup here which was found
>>> desirable on the way.
>>> 
>>> 1: x86/mm: add optional cache to GLA->GFN translation
>>> 2: x86/mm: use optional cache in guest_walk_tables()
>>> 3: x86/HVM: implement memory read caching
>>> 4: x86/HVM: prefill cache with PDPTEs when possible
>>> 
>>> As for v2, I'm omitting "VMX: correct PDPTE load checks" from v3, as I
>>> can't currently find enough time to carry out the requested further
>>> rework.
>> Andrew, George?
> 
> You've not fixed anything from my concerns with v1.
> 
> This doesn't behave like real hardware, and definitely doesn't behave as
> named - "struct hvmemul_cache" is simply false.  If it were named
> hvmemul_psc (or some other variation on Paging Structure Cache) then it
> wouldn't be so bad, as the individual levels do make more sense in that
> context (not that it would make the behaviour any closer to how hardware
> actually works).
> 
> I'm also not overly happy with the conditional nature of the caching, or
> that it isn't a transparent read-through cache.  This leads to a large
> amount of boilerplate code for every user.

What I’m hearing from you are three basic objections:

1. Although it’s closer to real hardware in some ways, it’s still pretty far 
away

2. The name “cache” is confusing

3. Having it non-transparent adds a lot of boilerplate code to the places that 
do need it.

#2 is easily dealt with.  The other two are reasons to look for better options, 
but not reasons to reject Jan’s series if other improvements are a lot of extra 
work (or it’s not clear they’re better).

Since this is a bug fix, unless you can show that Jan’s series introduces worse 
bugs, I think Jan’s request that you either 1) fix it yourself by 4.12, or 2) 
acquiesce to this series (or something close to it) being accepted is 
reasonable.

If you want to say, “I won’t Ack it but I won’t object if someone else does”, 
then I’ll get to it when I get a chance (hopefully before November).

 -George
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching

2018-10-11 Thread George Dunlap


> On Oct 11, 2018, at 5:15 PM, Jan Beulich  wrote:
> 
 On 11.10.18 at 17:54,  wrote:
> 
>>> On Oct 2, 2018, at 1:47 PM, Jan Beulich  wrote:
>>> 
>> On 02.10.18 at 12:51,  wrote:
>>> 
 This doesn't behave like real hardware, and definitely doesn't behave as
 named - "struct hvmemul_cache" is simply false.  If it were named
 hvmemul_psc (or some other variation on Paging Structure Cache) then it
 wouldn't be so bad, as the individual levels do make more sense in that
 context
>>> 
>>> As previously pointed out (without any suggestion coming back from
>>> you), I chose the name "cache" for the lack of a better term. However,
>>> I certainly disagree with naming it PSC or some such, as its level zero
>>> is intentionally there to be eventually used for non-paging-structure
>>> data.
>> 
>> I can think of lots of descriptive names which could yield unique 
>> three-letter acronyms:
>> 
>> Logical Read Sequence
>> Logical Read Series
>> Logical Read Record
>> Read Consistency Structure
>> Consistent Read Structure
>> Consistent Read Record
>> Emulation Read Record
>> […]
> 
> Well, I'm not sure LRS, LRR, RCS, CRS, CRR, or ERR would be
> easily recognizable as what they stand for. To be honest I'd
> prefer a non-acronym. Did you see my consideration towards
> "latch”?

Of course not; that’s why you put the long form name in a comment near the 
declaration. :-)

I don’t think I’ve personally used “latch” with that meaning very frequently 
(at least not in the last 10 years), so to me it sounds a bit obscure.  I would 
probably go with something else myself but I don’t object to it.

 -George
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching

2018-10-11 Thread Jan Beulich
>>> On 11.10.18 at 17:54,  wrote:

>> On Oct 2, 2018, at 1:47 PM, Jan Beulich  wrote:
>> 
> On 02.10.18 at 12:51,  wrote:
>> 
>>> This doesn't behave like real hardware, and definitely doesn't behave as
>>> named - "struct hvmemul_cache" is simply false.  If it were named
>>> hvmemul_psc (or some other variation on Paging Structure Cache) then it
>>> wouldn't be so bad, as the individual levels do make more sense in that
>>> context
>> 
>> As previously pointed out (without any suggestion coming back from
>> you), I chose the name "cache" for the lack of a better term. However,
>> I certainly disagree with naming it PSC or some such, as its level zero
>> is intentionally there to be eventually used for non-paging-structure
>> data.
> 
> I can think of lots of descriptive names which could yield unique 
> three-letter acronyms:
> 
> Logical Read Sequence
> Logical Read Series
> Logical Read Record
> Read Consistency Structure
> Consistent Read Structure
> Consistent Read Record
> Emulation Read Record
> […]

Well, I'm not sure LRS, LRR, RCS, CRS, CRR, or ERR would be
easily recognizable as what they stand for. To be honest I'd
prefer a non-acronym. Did you see my consideration towards
"latch"?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching

2018-10-11 Thread George Dunlap

> On Oct 2, 2018, at 1:47 PM, Jan Beulich  wrote:
> 
 On 02.10.18 at 12:51,  wrote:
> 
>> This doesn't behave like real hardware, and definitely doesn't behave as
>> named - "struct hvmemul_cache" is simply false.  If it were named
>> hvmemul_psc (or some other variation on Paging Structure Cache) then it
>> wouldn't be so bad, as the individual levels do make more sense in that
>> context
> 
> As previously pointed out (without any suggestion coming back from
> you), I chose the name "cache" for the lack of a better term. However,
> I certainly disagree with naming it PSC or some such, as its level zero
> is intentionally there to be eventually used for non-paging-structure
> data.

I can think of lots of descriptive names which could yield unique three-letter 
acronyms:

Logical Read Sequence
Logical Read Series
Logical Read Record
Read Consistency Structure
Consistent Read Structure
Consistent Read Record
Emulation Read Record
[…]

 -George
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching

2018-10-11 Thread Jan Beulich
>>> On 02.10.18 at 12:51,  wrote:
> On 02/10/18 11:36, Jan Beulich wrote:
> On 25.09.18 at 16:14,  wrote:
>>> Emulation requiring device model assistance uses a form of instruction
>>> re-execution, assuming that the second (and any further) pass takes
>>> exactly the same path. This is a valid assumption as far as use of CPU
>>> registers goes (as those can't change without any other instruction
>>> executing in between), but is wrong for memory accesses. In particular
>>> it has been observed that Windows might page out buffers underneath
>>> an instruction currently under emulation (hitting between two passes).
>>> If the first pass translated a linear address successfully, any subsequent
>>> pass needs to do so too, yielding the exact same translation.
>>>
>>> Introduce a cache (used just by guest page table accesses for now, i.e.
>>> a form of "paging structure cache") to make sure above described
>>> assumption holds. This is a very simplistic implementation for now: Only
>>> exact matches are satisfied (no overlaps or partial reads or anything).
>>>
>>> There's also some seemingly unrelated cleanup here which was found
>>> desirable on the way.
>>>
>>> 1: x86/mm: add optional cache to GLA->GFN translation
>>> 2: x86/mm: use optional cache in guest_walk_tables()
>>> 3: x86/HVM: implement memory read caching
>>> 4: x86/HVM: prefill cache with PDPTEs when possible
>>>
>>> As for v2, I'm omitting "VMX: correct PDPTE load checks" from v3, as I
>>> can't currently find enough time to carry out the requested further
>>> rework.
>> Andrew, George?
> 
> You've not fixed anything from my concerns with v1.
> 
> This doesn't behave like real hardware, and definitely doesn't behave as
> named - "struct hvmemul_cache" is simply false.  If it were named
> hvmemul_psc (or some other variation on Paging Structure Cache) then it
> wouldn't be so bad, as the individual levels do make more sense in that
> context (not that it would make the behaviour any closer to how hardware
> actually works).
> 
> I'm also not overly happy with the conditional nature of the caching, or
> that it isn't a transparent read-through cache.  This leads to a large
> amount of boilerplate code for every user.

So after the call yesterday I've been thinking about this some more,
coming to the conclusion that it is actually what you ask for that
would end up being architecturally incorrect.

First of all, there's absolutely no way to implement a general purpose
cache that's mimic-ing hardware behavior. This is simply because we
cannot observe remove (v)CPU-s' writes to the cached areas.

What we instead need is a store where we can retain the result of
every independent memory read. Let me illustrate this at the
example of the gather family of instructions: Let's assume such an
instruction has its [xyz]mm index register all zero. This will produce
up to 16 reads from exactly the same memory location. Each of
these reads is an independent one, and hence each of them is
liable to observe different values (due to the coherent nature of
the processor caches and their protocol), if another CPU updates
the location frequently enough. As a result, to correctly replay
(parts of) such an instruction, we'd have to store up to 16
different values all for the same physical memory slot. Obviously
to distinguish them we'd need to somehow tag them.

This is why the current series does not try to use the "cache" for
other than page table elements (but leaves it possible for this
to be added later, without renaming anything). We'd have to
have the insn emulator layer (or the latest the top level
hvmemul_*() hooks) produce such tags, requiring an extension
to the emulator memory access hooks.

Now the same consideration applies to page table reads: Each
level's read is an independent one, and therefore may observe
a value different from a higher level's read even if the same
physical slot is referenced (via recursive page tables). Here,
however, we're in the "luxury" position that we have both
"natural" tagging (the page table level), and we don't need to
care about subsequent writes (other than the general purpose
cache, the paging structure caches don't "snoop" later writes,
but require active invalidation). It is my understanding that
the different page table levels have, in hardware, entirely
independent paging structure caches, i.e. a level 3 read would
not be satisfied by a level 4 PSC entry; in fact I think the spec
does not exclude either option, but suggests such an
independent behavior as the current choice of implementation.

Similarly there's the write behavior: We specifically would
require insn operand writes to _not_ update the "cache", as
during replay we still need to see the original value; the
aforementioned tagging would prevent this. The exception
here is the setting of A/D bits during page table walks: On
replay we _do_ want to observe them set if the first run
through had to set them. (The same would imo apply to
setting descriptor 

Re: [Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching

2018-10-02 Thread Jan Beulich
>>> On 02.10.18 at 12:51,  wrote:
> On 02/10/18 11:36, Jan Beulich wrote:
> On 25.09.18 at 16:14,  wrote:
>>> Emulation requiring device model assistance uses a form of instruction
>>> re-execution, assuming that the second (and any further) pass takes
>>> exactly the same path. This is a valid assumption as far as use of CPU
>>> registers goes (as those can't change without any other instruction
>>> executing in between), but is wrong for memory accesses. In particular
>>> it has been observed that Windows might page out buffers underneath
>>> an instruction currently under emulation (hitting between two passes).
>>> If the first pass translated a linear address successfully, any subsequent
>>> pass needs to do so too, yielding the exact same translation.
>>>
>>> Introduce a cache (used just by guest page table accesses for now, i.e.
>>> a form of "paging structure cache") to make sure above described
>>> assumption holds. This is a very simplistic implementation for now: Only
>>> exact matches are satisfied (no overlaps or partial reads or anything).
>>>
>>> There's also some seemingly unrelated cleanup here which was found
>>> desirable on the way.
>>>
>>> 1: x86/mm: add optional cache to GLA->GFN translation
>>> 2: x86/mm: use optional cache in guest_walk_tables()
>>> 3: x86/HVM: implement memory read caching
>>> 4: x86/HVM: prefill cache with PDPTEs when possible
>>>
>>> As for v2, I'm omitting "VMX: correct PDPTE load checks" from v3, as I
>>> can't currently find enough time to carry out the requested further
>>> rework.
>> Andrew, George?
> 
> You've not fixed anything from my concerns with v1.

I've responded to your concerns verbally, and you went silent, as is
(I regret having to say so) the case quite often. This simply blocks
any progress. Hence, after enough time went by, I simply took the
liberty to interpret the silence as "the verbal response took care of
my concerns".

> This doesn't behave like real hardware, and definitely doesn't behave as
> named - "struct hvmemul_cache" is simply false.  If it were named
> hvmemul_psc (or some other variation on Paging Structure Cache) then it
> wouldn't be so bad, as the individual levels do make more sense in that
> context

As previously pointed out (without any suggestion coming back from
you), I chose the name "cache" for the lack of a better term. However,
I certainly disagree with naming it PSC or some such, as its level zero
is intentionally there to be eventually used for non-paging-structure
data.

> (not that it would make the behaviour any closer to how hardware
> actually works).

I can certainly appreciate this concern of yours, but the whole issue
the series is aiming to address is something that we can't make
behave like hardware does: Hardware doesn't have the concept of
a device model that it needs to wait for responses from, while trying
to make use of the wait time (i.e. scheduling in another CPU).

Once again (I've said so more than once before) - the use of what
I call cache here is a correctness thing, not a performance
improvement (this, if it so happens, is just a nice side effect).
Nothing like that exists on hardware; I'm merely trying to come
close to paging structure caches. As to them - is it anywhere
spelled out that their data must not come from memory, when
doing a cache fill? We simply have no general purpose cache
that the data could come from.

> I'm also not overly happy with the conditional nature of the caching, or
> that it isn't a transparent read-through cache.  This leads to a large
> amount of boilerplate code for every user.

That's all fine, and I can understand it. Yet I hope you don't mean
to ask that for this correctness fix to become acceptable, you
demand that I implement a general purpose cache sitting
transparently underneath all read/write operations? The more that
it wouldn't even fulfill the purpose, as what the series introduces
specifically also needs to be used for page tables living in MMIO
(i.e. regardless of cachability of the memory accesses involved; I
do realize that this doesn't currently function for other reasons,
but it really should).

As to the amount of boilerplate code: Besides expressing your
dislike, do you have a concrete suggestion as to how you would
envision this to be avoided in a way covering _all_ cases, i.e. in
particular _all_ callers of guest_walk_tables() et al (i.e. all the
functions the first two patches fiddle with)? If I had seen an
obvious and natural way to achieve this, you may rest assured
that I would have tried to avoid introducing the new function
parameters, for which arguments need to be pushed through
the various layers now.

Furthermore I couldn't convince myself that doing this in a
parameter-less way would be a good idea (and in the end
provably correct): Of course we could make caches hang off
of struct vcpu, but then we would need to find a model how
to invalidate them often enough, without invalidating (parts
of) them in ways breaking the 

Re: [Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching

2018-10-02 Thread Andrew Cooper
On 02/10/18 11:36, Jan Beulich wrote:
 On 25.09.18 at 16:14,  wrote:
>> Emulation requiring device model assistance uses a form of instruction
>> re-execution, assuming that the second (and any further) pass takes
>> exactly the same path. This is a valid assumption as far as use of CPU
>> registers goes (as those can't change without any other instruction
>> executing in between), but is wrong for memory accesses. In particular
>> it has been observed that Windows might page out buffers underneath
>> an instruction currently under emulation (hitting between two passes).
>> If the first pass translated a linear address successfully, any subsequent
>> pass needs to do so too, yielding the exact same translation.
>>
>> Introduce a cache (used just by guest page table accesses for now, i.e.
>> a form of "paging structure cache") to make sure above described
>> assumption holds. This is a very simplistic implementation for now: Only
>> exact matches are satisfied (no overlaps or partial reads or anything).
>>
>> There's also some seemingly unrelated cleanup here which was found
>> desirable on the way.
>>
>> 1: x86/mm: add optional cache to GLA->GFN translation
>> 2: x86/mm: use optional cache in guest_walk_tables()
>> 3: x86/HVM: implement memory read caching
>> 4: x86/HVM: prefill cache with PDPTEs when possible
>>
>> As for v2, I'm omitting "VMX: correct PDPTE load checks" from v3, as I
>> can't currently find enough time to carry out the requested further
>> rework.
> Andrew, George?

You've not fixed anything from my concerns with v1.

This doesn't behave like real hardware, and definitely doesn't behave as
named - "struct hvmemul_cache" is simply false.  If it were named
hvmemul_psc (or some other variation on Paging Structure Cache) then it
wouldn't be so bad, as the individual levels do make more sense in that
context (not that it would make the behaviour any closer to how hardware
actually works).

I'm also not overly happy with the conditional nature of the caching, or
that it isn't a transparent read-through cache.  This leads to a large
amount of boilerplate code for every user.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching

2018-10-02 Thread Jan Beulich
>>> On 25.09.18 at 16:14,  wrote:
> Emulation requiring device model assistance uses a form of instruction
> re-execution, assuming that the second (and any further) pass takes
> exactly the same path. This is a valid assumption as far as use of CPU
> registers goes (as those can't change without any other instruction
> executing in between), but is wrong for memory accesses. In particular
> it has been observed that Windows might page out buffers underneath
> an instruction currently under emulation (hitting between two passes).
> If the first pass translated a linear address successfully, any subsequent
> pass needs to do so too, yielding the exact same translation.
> 
> Introduce a cache (used just by guest page table accesses for now, i.e.
> a form of "paging structure cache") to make sure above described
> assumption holds. This is a very simplistic implementation for now: Only
> exact matches are satisfied (no overlaps or partial reads or anything).
> 
> There's also some seemingly unrelated cleanup here which was found
> desirable on the way.
> 
> 1: x86/mm: add optional cache to GLA->GFN translation
> 2: x86/mm: use optional cache in guest_walk_tables()
> 3: x86/HVM: implement memory read caching
> 4: x86/HVM: prefill cache with PDPTEs when possible
> 
> As for v2, I'm omitting "VMX: correct PDPTE load checks" from v3, as I
> can't currently find enough time to carry out the requested further
> rework.

Andrew, George?

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel