Re: [Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching
>>> On 11.10.18 at 18:33, wrote: > >> On Oct 11, 2018, at 5:15 PM, Jan Beulich wrote: >> > On 11.10.18 at 17:54, wrote: >> On Oct 2, 2018, at 1:47 PM, Jan Beulich wrote: >>> On 02.10.18 at 12:51, wrote: > This doesn't behave like real hardware, and definitely doesn't behave as > named - "struct hvmemul_cache" is simply false. If it were named > hvmemul_psc (or some other variation on Paging Structure Cache) then it > wouldn't be so bad, as the individual levels do make more sense in that > context As previously pointed out (without any suggestion coming back from you), I chose the name "cache" for the lack of a better term. However, I certainly disagree with naming it PSC or some such, as its level zero is intentionally there to be eventually used for non-paging-structure data. >>> >>> I can think of lots of descriptive names which could yield unique >>> three-letter acronyms: >>> >>> Logical Read Sequence >>> Logical Read Series >>> Logical Read Record >>> Read Consistency Structure >>> Consistent Read Structure >>> Consistent Read Record >>> Emulation Read Record >>> […] >> >> Well, I'm not sure LRS, LRR, RCS, CRS, CRR, or ERR would be >> easily recognizable as what they stand for. To be honest I'd >> prefer a non-acronym. Did you see my consideration towards >> "latch”? > > Of course not; that’s why you put the long form name in a comment near the > declaration. :-) Of course I would, but I don't think this would help. You don't want to always go back to the declaration (in a header) when you look at a function (in a .c file) using the type. Such names should be at least half way self-explanatory. Jan ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching
> On Oct 2, 2018, at 11:51 AM, Andrew Cooper wrote: > > On 02/10/18 11:36, Jan Beulich wrote: > On 25.09.18 at 16:14, wrote: >>> Emulation requiring device model assistance uses a form of instruction >>> re-execution, assuming that the second (and any further) pass takes >>> exactly the same path. This is a valid assumption as far as use of CPU >>> registers goes (as those can't change without any other instruction >>> executing in between), but is wrong for memory accesses. In particular >>> it has been observed that Windows might page out buffers underneath >>> an instruction currently under emulation (hitting between two passes). >>> If the first pass translated a linear address successfully, any subsequent >>> pass needs to do so too, yielding the exact same translation. >>> >>> Introduce a cache (used just by guest page table accesses for now, i.e. >>> a form of "paging structure cache") to make sure above described >>> assumption holds. This is a very simplistic implementation for now: Only >>> exact matches are satisfied (no overlaps or partial reads or anything). >>> >>> There's also some seemingly unrelated cleanup here which was found >>> desirable on the way. >>> >>> 1: x86/mm: add optional cache to GLA->GFN translation >>> 2: x86/mm: use optional cache in guest_walk_tables() >>> 3: x86/HVM: implement memory read caching >>> 4: x86/HVM: prefill cache with PDPTEs when possible >>> >>> As for v2, I'm omitting "VMX: correct PDPTE load checks" from v3, as I >>> can't currently find enough time to carry out the requested further >>> rework. >> Andrew, George? > > You've not fixed anything from my concerns with v1. > > This doesn't behave like real hardware, and definitely doesn't behave as > named - "struct hvmemul_cache" is simply false. If it were named > hvmemul_psc (or some other variation on Paging Structure Cache) then it > wouldn't be so bad, as the individual levels do make more sense in that > context (not that it would make the behaviour any closer to how hardware > actually works). > > I'm also not overly happy with the conditional nature of the caching, or > that it isn't a transparent read-through cache. This leads to a large > amount of boilerplate code for every user. What I’m hearing from you are three basic objections: 1. Although it’s closer to real hardware in some ways, it’s still pretty far away 2. The name “cache” is confusing 3. Having it non-transparent adds a lot of boilerplate code to the places that do need it. #2 is easily dealt with. The other two are reasons to look for better options, but not reasons to reject Jan’s series if other improvements are a lot of extra work (or it’s not clear they’re better). Since this is a bug fix, unless you can show that Jan’s series introduces worse bugs, I think Jan’s request that you either 1) fix it yourself by 4.12, or 2) acquiesce to this series (or something close to it) being accepted is reasonable. If you want to say, “I won’t Ack it but I won’t object if someone else does”, then I’ll get to it when I get a chance (hopefully before November). -George ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching
> On Oct 11, 2018, at 5:15 PM, Jan Beulich wrote: > On 11.10.18 at 17:54, wrote: > >>> On Oct 2, 2018, at 1:47 PM, Jan Beulich wrote: >>> >> On 02.10.18 at 12:51, wrote: >>> This doesn't behave like real hardware, and definitely doesn't behave as named - "struct hvmemul_cache" is simply false. If it were named hvmemul_psc (or some other variation on Paging Structure Cache) then it wouldn't be so bad, as the individual levels do make more sense in that context >>> >>> As previously pointed out (without any suggestion coming back from >>> you), I chose the name "cache" for the lack of a better term. However, >>> I certainly disagree with naming it PSC or some such, as its level zero >>> is intentionally there to be eventually used for non-paging-structure >>> data. >> >> I can think of lots of descriptive names which could yield unique >> three-letter acronyms: >> >> Logical Read Sequence >> Logical Read Series >> Logical Read Record >> Read Consistency Structure >> Consistent Read Structure >> Consistent Read Record >> Emulation Read Record >> […] > > Well, I'm not sure LRS, LRR, RCS, CRS, CRR, or ERR would be > easily recognizable as what they stand for. To be honest I'd > prefer a non-acronym. Did you see my consideration towards > "latch”? Of course not; that’s why you put the long form name in a comment near the declaration. :-) I don’t think I’ve personally used “latch” with that meaning very frequently (at least not in the last 10 years), so to me it sounds a bit obscure. I would probably go with something else myself but I don’t object to it. -George ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching
>>> On 11.10.18 at 17:54, wrote: >> On Oct 2, 2018, at 1:47 PM, Jan Beulich wrote: >> > On 02.10.18 at 12:51, wrote: >> >>> This doesn't behave like real hardware, and definitely doesn't behave as >>> named - "struct hvmemul_cache" is simply false. If it were named >>> hvmemul_psc (or some other variation on Paging Structure Cache) then it >>> wouldn't be so bad, as the individual levels do make more sense in that >>> context >> >> As previously pointed out (without any suggestion coming back from >> you), I chose the name "cache" for the lack of a better term. However, >> I certainly disagree with naming it PSC or some such, as its level zero >> is intentionally there to be eventually used for non-paging-structure >> data. > > I can think of lots of descriptive names which could yield unique > three-letter acronyms: > > Logical Read Sequence > Logical Read Series > Logical Read Record > Read Consistency Structure > Consistent Read Structure > Consistent Read Record > Emulation Read Record > […] Well, I'm not sure LRS, LRR, RCS, CRS, CRR, or ERR would be easily recognizable as what they stand for. To be honest I'd prefer a non-acronym. Did you see my consideration towards "latch"? Jan ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching
> On Oct 2, 2018, at 1:47 PM, Jan Beulich wrote: > On 02.10.18 at 12:51, wrote: > >> This doesn't behave like real hardware, and definitely doesn't behave as >> named - "struct hvmemul_cache" is simply false. If it were named >> hvmemul_psc (or some other variation on Paging Structure Cache) then it >> wouldn't be so bad, as the individual levels do make more sense in that >> context > > As previously pointed out (without any suggestion coming back from > you), I chose the name "cache" for the lack of a better term. However, > I certainly disagree with naming it PSC or some such, as its level zero > is intentionally there to be eventually used for non-paging-structure > data. I can think of lots of descriptive names which could yield unique three-letter acronyms: Logical Read Sequence Logical Read Series Logical Read Record Read Consistency Structure Consistent Read Structure Consistent Read Record Emulation Read Record […] -George ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching
>>> On 02.10.18 at 12:51, wrote: > On 02/10/18 11:36, Jan Beulich wrote: > On 25.09.18 at 16:14, wrote: >>> Emulation requiring device model assistance uses a form of instruction >>> re-execution, assuming that the second (and any further) pass takes >>> exactly the same path. This is a valid assumption as far as use of CPU >>> registers goes (as those can't change without any other instruction >>> executing in between), but is wrong for memory accesses. In particular >>> it has been observed that Windows might page out buffers underneath >>> an instruction currently under emulation (hitting between two passes). >>> If the first pass translated a linear address successfully, any subsequent >>> pass needs to do so too, yielding the exact same translation. >>> >>> Introduce a cache (used just by guest page table accesses for now, i.e. >>> a form of "paging structure cache") to make sure above described >>> assumption holds. This is a very simplistic implementation for now: Only >>> exact matches are satisfied (no overlaps or partial reads or anything). >>> >>> There's also some seemingly unrelated cleanup here which was found >>> desirable on the way. >>> >>> 1: x86/mm: add optional cache to GLA->GFN translation >>> 2: x86/mm: use optional cache in guest_walk_tables() >>> 3: x86/HVM: implement memory read caching >>> 4: x86/HVM: prefill cache with PDPTEs when possible >>> >>> As for v2, I'm omitting "VMX: correct PDPTE load checks" from v3, as I >>> can't currently find enough time to carry out the requested further >>> rework. >> Andrew, George? > > You've not fixed anything from my concerns with v1. > > This doesn't behave like real hardware, and definitely doesn't behave as > named - "struct hvmemul_cache" is simply false. If it were named > hvmemul_psc (or some other variation on Paging Structure Cache) then it > wouldn't be so bad, as the individual levels do make more sense in that > context (not that it would make the behaviour any closer to how hardware > actually works). > > I'm also not overly happy with the conditional nature of the caching, or > that it isn't a transparent read-through cache. This leads to a large > amount of boilerplate code for every user. So after the call yesterday I've been thinking about this some more, coming to the conclusion that it is actually what you ask for that would end up being architecturally incorrect. First of all, there's absolutely no way to implement a general purpose cache that's mimic-ing hardware behavior. This is simply because we cannot observe remove (v)CPU-s' writes to the cached areas. What we instead need is a store where we can retain the result of every independent memory read. Let me illustrate this at the example of the gather family of instructions: Let's assume such an instruction has its [xyz]mm index register all zero. This will produce up to 16 reads from exactly the same memory location. Each of these reads is an independent one, and hence each of them is liable to observe different values (due to the coherent nature of the processor caches and their protocol), if another CPU updates the location frequently enough. As a result, to correctly replay (parts of) such an instruction, we'd have to store up to 16 different values all for the same physical memory slot. Obviously to distinguish them we'd need to somehow tag them. This is why the current series does not try to use the "cache" for other than page table elements (but leaves it possible for this to be added later, without renaming anything). We'd have to have the insn emulator layer (or the latest the top level hvmemul_*() hooks) produce such tags, requiring an extension to the emulator memory access hooks. Now the same consideration applies to page table reads: Each level's read is an independent one, and therefore may observe a value different from a higher level's read even if the same physical slot is referenced (via recursive page tables). Here, however, we're in the "luxury" position that we have both "natural" tagging (the page table level), and we don't need to care about subsequent writes (other than the general purpose cache, the paging structure caches don't "snoop" later writes, but require active invalidation). It is my understanding that the different page table levels have, in hardware, entirely independent paging structure caches, i.e. a level 3 read would not be satisfied by a level 4 PSC entry; in fact I think the spec does not exclude either option, but suggests such an independent behavior as the current choice of implementation. Similarly there's the write behavior: We specifically would require insn operand writes to _not_ update the "cache", as during replay we still need to see the original value; the aforementioned tagging would prevent this. The exception here is the setting of A/D bits during page table walks: On replay we _do_ want to observe them set if the first run through had to set them. (The same would imo apply to setting descriptor
Re: [Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching
>>> On 02.10.18 at 12:51, wrote: > On 02/10/18 11:36, Jan Beulich wrote: > On 25.09.18 at 16:14, wrote: >>> Emulation requiring device model assistance uses a form of instruction >>> re-execution, assuming that the second (and any further) pass takes >>> exactly the same path. This is a valid assumption as far as use of CPU >>> registers goes (as those can't change without any other instruction >>> executing in between), but is wrong for memory accesses. In particular >>> it has been observed that Windows might page out buffers underneath >>> an instruction currently under emulation (hitting between two passes). >>> If the first pass translated a linear address successfully, any subsequent >>> pass needs to do so too, yielding the exact same translation. >>> >>> Introduce a cache (used just by guest page table accesses for now, i.e. >>> a form of "paging structure cache") to make sure above described >>> assumption holds. This is a very simplistic implementation for now: Only >>> exact matches are satisfied (no overlaps or partial reads or anything). >>> >>> There's also some seemingly unrelated cleanup here which was found >>> desirable on the way. >>> >>> 1: x86/mm: add optional cache to GLA->GFN translation >>> 2: x86/mm: use optional cache in guest_walk_tables() >>> 3: x86/HVM: implement memory read caching >>> 4: x86/HVM: prefill cache with PDPTEs when possible >>> >>> As for v2, I'm omitting "VMX: correct PDPTE load checks" from v3, as I >>> can't currently find enough time to carry out the requested further >>> rework. >> Andrew, George? > > You've not fixed anything from my concerns with v1. I've responded to your concerns verbally, and you went silent, as is (I regret having to say so) the case quite often. This simply blocks any progress. Hence, after enough time went by, I simply took the liberty to interpret the silence as "the verbal response took care of my concerns". > This doesn't behave like real hardware, and definitely doesn't behave as > named - "struct hvmemul_cache" is simply false. If it were named > hvmemul_psc (or some other variation on Paging Structure Cache) then it > wouldn't be so bad, as the individual levels do make more sense in that > context As previously pointed out (without any suggestion coming back from you), I chose the name "cache" for the lack of a better term. However, I certainly disagree with naming it PSC or some such, as its level zero is intentionally there to be eventually used for non-paging-structure data. > (not that it would make the behaviour any closer to how hardware > actually works). I can certainly appreciate this concern of yours, but the whole issue the series is aiming to address is something that we can't make behave like hardware does: Hardware doesn't have the concept of a device model that it needs to wait for responses from, while trying to make use of the wait time (i.e. scheduling in another CPU). Once again (I've said so more than once before) - the use of what I call cache here is a correctness thing, not a performance improvement (this, if it so happens, is just a nice side effect). Nothing like that exists on hardware; I'm merely trying to come close to paging structure caches. As to them - is it anywhere spelled out that their data must not come from memory, when doing a cache fill? We simply have no general purpose cache that the data could come from. > I'm also not overly happy with the conditional nature of the caching, or > that it isn't a transparent read-through cache. This leads to a large > amount of boilerplate code for every user. That's all fine, and I can understand it. Yet I hope you don't mean to ask that for this correctness fix to become acceptable, you demand that I implement a general purpose cache sitting transparently underneath all read/write operations? The more that it wouldn't even fulfill the purpose, as what the series introduces specifically also needs to be used for page tables living in MMIO (i.e. regardless of cachability of the memory accesses involved; I do realize that this doesn't currently function for other reasons, but it really should). As to the amount of boilerplate code: Besides expressing your dislike, do you have a concrete suggestion as to how you would envision this to be avoided in a way covering _all_ cases, i.e. in particular _all_ callers of guest_walk_tables() et al (i.e. all the functions the first two patches fiddle with)? If I had seen an obvious and natural way to achieve this, you may rest assured that I would have tried to avoid introducing the new function parameters, for which arguments need to be pushed through the various layers now. Furthermore I couldn't convince myself that doing this in a parameter-less way would be a good idea (and in the end provably correct): Of course we could make caches hang off of struct vcpu, but then we would need to find a model how to invalidate them often enough, without invalidating (parts of) them in ways breaking the
Re: [Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching
On 02/10/18 11:36, Jan Beulich wrote: On 25.09.18 at 16:14, wrote: >> Emulation requiring device model assistance uses a form of instruction >> re-execution, assuming that the second (and any further) pass takes >> exactly the same path. This is a valid assumption as far as use of CPU >> registers goes (as those can't change without any other instruction >> executing in between), but is wrong for memory accesses. In particular >> it has been observed that Windows might page out buffers underneath >> an instruction currently under emulation (hitting between two passes). >> If the first pass translated a linear address successfully, any subsequent >> pass needs to do so too, yielding the exact same translation. >> >> Introduce a cache (used just by guest page table accesses for now, i.e. >> a form of "paging structure cache") to make sure above described >> assumption holds. This is a very simplistic implementation for now: Only >> exact matches are satisfied (no overlaps or partial reads or anything). >> >> There's also some seemingly unrelated cleanup here which was found >> desirable on the way. >> >> 1: x86/mm: add optional cache to GLA->GFN translation >> 2: x86/mm: use optional cache in guest_walk_tables() >> 3: x86/HVM: implement memory read caching >> 4: x86/HVM: prefill cache with PDPTEs when possible >> >> As for v2, I'm omitting "VMX: correct PDPTE load checks" from v3, as I >> can't currently find enough time to carry out the requested further >> rework. > Andrew, George? You've not fixed anything from my concerns with v1. This doesn't behave like real hardware, and definitely doesn't behave as named - "struct hvmemul_cache" is simply false. If it were named hvmemul_psc (or some other variation on Paging Structure Cache) then it wouldn't be so bad, as the individual levels do make more sense in that context (not that it would make the behaviour any closer to how hardware actually works). I'm also not overly happy with the conditional nature of the caching, or that it isn't a transparent read-through cache. This leads to a large amount of boilerplate code for every user. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
[Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching
>>> On 25.09.18 at 16:14, wrote: > Emulation requiring device model assistance uses a form of instruction > re-execution, assuming that the second (and any further) pass takes > exactly the same path. This is a valid assumption as far as use of CPU > registers goes (as those can't change without any other instruction > executing in between), but is wrong for memory accesses. In particular > it has been observed that Windows might page out buffers underneath > an instruction currently under emulation (hitting between two passes). > If the first pass translated a linear address successfully, any subsequent > pass needs to do so too, yielding the exact same translation. > > Introduce a cache (used just by guest page table accesses for now, i.e. > a form of "paging structure cache") to make sure above described > assumption holds. This is a very simplistic implementation for now: Only > exact matches are satisfied (no overlaps or partial reads or anything). > > There's also some seemingly unrelated cleanup here which was found > desirable on the way. > > 1: x86/mm: add optional cache to GLA->GFN translation > 2: x86/mm: use optional cache in guest_walk_tables() > 3: x86/HVM: implement memory read caching > 4: x86/HVM: prefill cache with PDPTEs when possible > > As for v2, I'm omitting "VMX: correct PDPTE load checks" from v3, as I > can't currently find enough time to carry out the requested further > rework. Andrew, George? Jan ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel