Hi Florian, Thank you for the feedback. Comments inline below (extensive comments in the case of the memory model question - it is complex).
regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander On 20/07/18 23:15, Florian Weimer wrote: > There are a bunch of typos in the JEP (“susbet”, “Rntime”). > Formatting of the headlines looks wrong to me, too. Thanks. Yes, the spelling and, more importantly, format need a lot of work. The JIRA format does not exactly look compelling and the format in the JEP listing is even worse. I am currently working to remedy this. > On the implementation side, I find this a bit concerning: > > + // TODO - remove the following defines > + // for testing we need to define the extra MAP_XXX flags used for > + // persistent memory. they will eventually get defined by the > + // system and included via sys/mman.h > > Do you really want to change implementation behavior based on which > glibc headers (or kernel headers) were used to build the JDK? It's > likely more appropriate to define the constants yourself if they are > not available. There is some architecture variance for the MAP_* > constants, but in case of MAP_SHARED_VALIDATE, even HPPA has the same > definition. No, I guess I don't really want to change implementation behaviour based on which glibc headers (or kernel headers) were used to build the JDK. So, I think I need to modify this so the constants are always defined by FileChannelImpl.c I have followed the Intel libpmem code in defining the values for these constants. Can you provide details of the 'architecture variance' you refer to above? I guess it doesn't matter too much while the code is being compiled only for Linux on x86_64/AArch64. > How does the MappedByteBuffer::force(long, long) method interact with > the memory model? That's a very good question, obviously with different answers for AArch64 and x86_64. I'll note before citing the relevant documentation that in the current design I have explicitly allowed for independent pre and post sync operations to implement whatever memory sync is required wrt to non-writeback memory operations preceding or following the writebacks. This is to make it easy to revise the current barrier generation choices if needed (and this flexibility comes at no cost once the relevant sync methods are intrinsified by the JIT). I have currently implemented both pre and post sync by an mfence on x86_64 whatever the operation used to flush memory, which is playing safe and this choice merits reviewing. This is a stronger memory sync than that employed by the libpmem code for x86_64 which also relies on mfence (rather than sfence) but i) omits the mfence for the post-sync and ii) when using CLFLUSH to implement the writeback also omits the pre-sync mfence. It is not clear to me from the docs that a post-sync can safely be omitted for CLFLUSHOPT or CLWB but I do read it as confirming that pre- and post-sync can be safely omitted when using CLFLUSH. Also, the docs suggest only an sfence is needed rather than an mfence -- which sounds right. However, I have followed libpmem and employed mfence for now (more below). On AArch64 I have currently implemented both pre and post sync by a StoreStore barrier -- a dmb(ISHST). This is weaker than the libpmem pre-sync for AArch64 which emits a dmb(ISH) but also stronger as regards the post-sync in that libpmem does not emit a post-sync fence. On my reading of the ARM Architecture Reference Manual (ARM ARM) I believe dmb(ISHST) pre and post is what is needed and all that is needed (again see below for more). It's worth stating up front that I added the post-sync for both architectures to ensure that the thread doing the writeback cannot update any other NVM before the writeback has completed. This may perhaps be overkill. However, I envisaged the following possibility for error to arise. Imagine that a writeback to 'data' state e.g. a log record might be followed by a write to a NVM memory location that marks the log record as 'live' (it might, say, be linked into the tail of a list of valid log records). The danger here is that the memory system might, independent of any writeback scheduled by the application, flush the cache line for the 'live' write before the prior 'data' writeback was complete. A crash at this point might leave the log record live but, potentially, incomplete/inconsistent. Ok, now here is what the relevant documentation says with my gloss on what it means for the current draft implementation. x86_64: The latest Intel documentation available at https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf describes all 3 available instructions for implementing the cache line writeback: CLFLUSH, CLFLUSHOPT and CLWB. Vol. 2A 3-139 CLFLUSH "The memory attribute of the page containing the affected line has no effect on the behavior of this instruction. It should be noted that processors are free to speculatively fetch and cache data from system memory regions assigned a memory-type allowing for speculative reads (such as, the WB, WC, and WT memory types). PREFETCHh instructions can be used to provide the processor with hints for this speculative behavior. Because this speculative fetching can occur at any time and is not tied to instruction execution, the CLFLUSH instruction is not ordered with respect to PREFETCHh instructions or any of the speculative fetching mechanisms (that is, data can be speculatively loaded into a cache line just before, during, or after the execution of a CLFLUSH instruction that references the cache line). "Executions of the CLFLUSH instruction are ordered with respect to each other and with respect to writes, locked read-modify-write instructions, fence instructions, and executions of CLFLUSHOPT to the same cache line. They are not ordered with respect to executions of CLFLUSHOPT to different cache lines." n.b. the prefetch clauses above (and also repeated in the next two citations) are not important since all a prefetch might do is bring the same flushed data back into cache without affecting validity of completion of the writeback. Ordering of loads can also be ignored for similar reasons. So, I believe the above text indicates that there is no need for a pre-sync or post-sync memory barrier when using CLFLUSH. All writes executed before the CLFLUSH will be completed before the flush is performed, all CLFLUSH operations complete in order of execution before each subsequent CLFLUSH and the last CLFLUSH completes before any subsequently executed writes can complete and become visible. Vol. 2A 3-141 CLFLUSHOPT [note the references to CLFLUSH below /appear/ exactly as occurs in the manual -- I believe the two occurences in the first paragraph are meant to read CLFLUSHOPT] "The memory attribute of the page containing the affected line has no effect on the behavior of this instruction. It should be noted that processors are free to speculatively fetch and cache data from system memory regions assigned a memory-type allowing for speculative reads (such as, the WB, WC, and WT memory types). PREFETCHh instructions can be used to provide the processor with hints for this speculative behavior. Because this speculative fetching can occur at any time and is not tied to instruction execution, the CLFLUSH [sic] instruction is not ordered with respect to PREFETCHh instructions or any of the speculative fetching mechanisms (that is, data can be speculatively loaded into a cache line just before, during, or after the execution of a CLFLUSH [sic] instruction that references the cache line). "Executions of the CLFLUSHOPT instruction are ordered with respect to fence instructions and to locked read- modify-write instructions; they are also ordered with respect to the following accesses to the cache line being invalidated: writes, executions of CLFLUSH, and executions of CLFLUSHOPT. They are not ordered with respect to writes, executions of CLFLUSH, or executions of CLFLUSHOPT that access other cache lines; to enforce ordering with such an operation, software can insert an SFENCE instruction between CFLUSHOPT and that operation." I believe this indicates that an SFENCE (or a stronger MFENCE) is needed to ensure all writes executed before the first CLFLUSHOPT are visible. Successive CLFLUSHOPT instructions can then complete out of order and may not have completed before any subsequently executed write becomes visible. I think this risks the situation I described above where a controlling update may be flushed before a flush of the data it controls. Hence I think we require a post-sync MFENCE (or SFENCE). Vol. 2A 3-146 CLWB "The memory attribute of the page containing the affected line has no effect on the behavior of this instruction. It should be noted that processors are free to speculatively fetch and cache data from system memory regions that are assigned a memory-type allowing for speculative reads (such as, the WB, WC, and WT memory types). PREFETCHh instructions can be used to provide the processor with hints for this speculative behavior. Because this speculative fetching can occur at any time and is not tied to instruction execution, the CLWB instruction is not ordered with respect to PREFETCHh instructions or any of the speculative fetching mechanisms (that is, data can be speculatively loaded into a cache line just before, during, or after the execution of a CLWB instruction that references the cache line). "CLWB instruction is ordered only by store-fencing operations. For example, software can use an SFENCE, MFENCE, XCHG, or LOCK-prefixed instructions to ensure that previous stores are included in the write-back. CLWB instruction need not be ordered by another CLWB or CLFLUSHOPT instruction. CLWB is implicitly ordered with older stores executed by the logical processor to the same address." I think this is the same story as for CLFLUSHOPT i.e. an SFENCE is needed before any CLWB instructions are executed to ensure that pending writes are visible but then successive CLWB instructions can complete out of order and may not have completed before any subsequently executed write become visible. So, I also think this may need an MFENCE (or SFENCE) after the CLWB instructions. AArch64: The latest ARM ARM I have is ARM ARM C available at https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile I am following libpmem in using instruction dc(CVAC) to writeback the data cache lines (clear virtual address to point of coherency). I think it would be preferable if dc(CVAP) were used (clear virtual address to point of persistence). However 1) support for dc(CVAP) does not seem to be a mandatory part of the architecture 2) user-space access to CVAP instruction is not guaranteed and not currently configured by default on Linux 3) CVAC is expected normally to guarantee writeback to physical memory Ordering of cache management instructions is specified in section D3.4.8 under heading "Ordering and completion of data and instruction cache instructions" on page D3-2069 as follows: "All data cache instructions, other than DC ZVA, that specify an address: - Execute in program order relative to loads or stores that access an address in Normal memory with either Inner Write Through or Inner Write Back attributes within the same cache line of minimum size, as indicated by CTR_EL0.DMinLine. - Can execute in any order relative to loads or stores that access any address with the Device memory attribute, or with Normal memory with Inner Non-cacheable attribute unless a DMB or DSB is executed between the instructions. - Execute in program order relative to other data cache maintenance instructions, other than DC ZVA, that specify an address within the same cache line of minimum size, as indicated by CTR_EL0.DMinLine. - Can execute in any order relative to loads or stores that access an address in a different cache line of minimum size, as indicated by CTR_EL0.DMinLine, unless a DMB or DSB is executed between the instructions. - Can execute in any order relative to other data cache maintenance instructions, other than DC ZVA , that specify an address in a different cache line of minimum size, as indicated by CTR_EL0.DMinLine, unless a DMB or DSB is executed between the instructions. - Can execute in any order relative to data cache maintenance instructions that do not specify an address unless a DMB or DSB is executed between the instructions. - Can execute in any order relative to instruction cache maintenance instructions unless a DSB is executed between the instructions." . . . "All data cache maintenance instructions that do not specify an address: - Can execute in any order relative to data cache maintenance instructions that do not specify an address unless a DMB or DSB is executed between the instructions. - Can execute in any order relative to data cache maintenance instructions that specify an address, other than Data Cache Zero, unless a DMB or DSB is executed between the instructions. - Can execute in any order relative to loads or stores unless a DMB or DSB is executed between the instructions. - Can execute in any order relative to instruction cache maintenance instructions unless a DSB is executed between the instructions. All instruction cache instructions can execute in any order relative to other instruction cache instructions, data cache instructions, loads, and stores unless a DSB is executed between the instructions. A cache maintenance instruction can complete at any time after it is executed, but is only guaranteed to be complete, and its effects visible to other observers, following a DSB instruction executed by the PE that executed the cache maintenance instruction. See also the requirements for cache maintenance instructions in Completion and endpoint ordering on page B2-97. In all cases, where the text in this section refers to a DMB or a DSB , this means a DMB or DSB whose required access type is both loads and stores." The link to B2-97 provides some further clarification of what it means for the above operations to be 'complete' (I have omitted the clauses regarding completion of translation table walks and TLB invalidates). "For all memory, the completion rules are defined as: - A read R 1 to a Location is complete for a shareability domain when all of the following are true: - Any write to the same Location by an Observer within the shareability domain will be Coherence-after R 1 . - Any translation table walks associated with R 1 are complete for that shareability domain. - A write W 1 to a Location is complete for a shareability domain when all of the following are true: - Any write to the same Location by an Observer within the shareability domain will be Coherence-after W 1 . - Any read to the same Location by an Observer within the shareability domain will either Reads-from W 1 or Reads-from a write that is Coherence-after W 1 . - Any translation table walks associated with the write are complete for that shareability domain. . . . - A cache maintenance instruction is complete for a shareability domain when the memory effects of the instruction are complete for that shareability domain, and any translation table walks that arise from the instruction are complete for that shareability domain. . . . The completion of any cache or TLB maintenance instruction includes its completion on all PEs that are affected by both the instruction and the DSB operation that is required to guarantee visibility of the maintenance instruction." So, my reading of this is i) a DMB is needed to ensure that writes executed before the dc(CVAC) are completed before the writeback starts ii) a DMB(ISHST) is all that is needed to achieve this pre-ordering iii) a DMB is needed to ensure that pending writebacks are completed before any subsequently executed writes can become visible. iv) a DMB(ISHST) is all that is needed to achieve this post-ordering