Hi Florian,

Thank you for the feedback. Comments inline below (extensive comments in
the case of the memory model question - it is complex).

regards,

Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

On 20/07/18 23:15, Florian Weimer wrote:
> There are a bunch of typos in the JEP (“susbet”, “Rntime”).
> Formatting of the headlines looks wrong to me, too.

Thanks. Yes, the spelling and, more importantly, format need a lot of
work. The JIRA format does not exactly look compelling and the format in
the JEP listing is even worse. I am currently working to remedy this.

> On the implementation side, I find this a bit concerning:
> 
> +    // TODO - remove the following defines
> +    // for testing we need to define the extra MAP_XXX flags used for
> +    // persistent memory. they will eventually get defined by the
> +    // system and included via sys/mman.h
> 
> Do you really want to change implementation behavior based on which
> glibc headers (or kernel headers) were used to build the JDK?  It's
> likely more appropriate to define the constants yourself if they are
> not available.  There is some architecture variance for the MAP_*
> constants, but in case of MAP_SHARED_VALIDATE, even HPPA has the same
> definition.

No, I guess I don't really want to change implementation behaviour based
on which glibc headers (or kernel headers) were used to build the JDK.
So, I think I need to modify this so the constants are always defined by
FileChannelImpl.c

I have followed the Intel libpmem code in defining the values for these
constants. Can you provide details of the 'architecture variance' you
refer to above? I guess it doesn't matter too much while the code is
being compiled only for Linux on x86_64/AArch64.

> How does the MappedByteBuffer::force(long, long) method interact with
> the memory model?
That's a very good question, obviously with different answers for
AArch64 and x86_64.

I'll note before citing the relevant documentation that in the current
design I have explicitly allowed for independent pre and post sync
operations to implement whatever memory sync is required wrt to
non-writeback memory operations preceding or following the writebacks.
This is to make it easy to revise the current barrier generation choices
if needed (and this flexibility comes at no cost once the relevant sync
methods are intrinsified by the JIT).

I have currently implemented both pre and post sync by an mfence on
x86_64 whatever the operation used to flush memory, which is playing
safe and this choice merits reviewing. This is a stronger memory sync
than that employed by the libpmem code for x86_64 which also relies on
mfence (rather than sfence) but i) omits the mfence for the post-sync
and ii) when using CLFLUSH to implement the writeback also omits the
pre-sync mfence.

It is not clear to me from the docs that a post-sync can safely be
omitted for CLFLUSHOPT or CLWB but I do read it as confirming that pre-
and post-sync can be safely omitted when using CLFLUSH. Also, the docs
suggest only an sfence is needed rather than an mfence -- which sounds
right. However, I have followed libpmem and employed mfence for now
(more below).

On AArch64 I have currently implemented both pre and post sync by a
StoreStore barrier -- a dmb(ISHST). This is weaker than the libpmem
pre-sync for AArch64 which emits a dmb(ISH) but also stronger as regards
the post-sync in that libpmem does not emit a post-sync fence. On my
reading of the ARM Architecture Reference Manual (ARM ARM) I believe
dmb(ISHST) pre and post is what is needed and all that is needed (again
see below for more).

It's worth stating up front that I added the post-sync for both
architectures to ensure that the thread doing the writeback cannot
update any other NVM before the writeback has completed. This may
perhaps be overkill. However, I envisaged the following possibility for
error to arise.

Imagine that a writeback to 'data' state e.g. a log record might be
followed by a write to a NVM memory location that marks the log record
as 'live' (it might, say, be linked into the tail of a list of valid log
records). The danger here is that the memory system might, independent
of any writeback scheduled by the application, flush the cache line for
the 'live' write before the prior 'data' writeback was complete. A crash
at this point might leave the log record live but, potentially,
incomplete/inconsistent.

Ok, now here is what the relevant documentation says with my gloss on
what it means for the current draft implementation.

x86_64:

The latest Intel documentation available at
https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf
describes all 3 available instructions for implementing the cache line
writeback: CLFLUSH, CLFLUSHOPT and CLWB.

Vol. 2A 3-139 CLFLUSH

"The memory attribute of the page containing the affected line has no
effect on the behavior of this instruction. It should be noted that
processors are free to speculatively fetch and cache data from system
memory regions assigned a memory-type allowing for speculative reads
(such as, the WB, WC, and WT memory types). PREFETCHh instructions can
be used to provide the processor with hints for this speculative
behavior. Because this speculative fetching can occur at any time and is
not tied to instruction execution, the CLFLUSH instruction is not
ordered with respect to PREFETCHh instructions or any of the speculative
fetching mechanisms (that is, data can be speculatively loaded into a
cache line just before, during, or after the execution of a CLFLUSH
instruction that references the cache line).

"Executions of the CLFLUSH instruction are ordered with respect to each
other and with respect to writes, locked read-modify-write instructions,
fence instructions, and executions of CLFLUSHOPT to the same cache line.
They are not ordered with respect to executions of CLFLUSHOPT to
different cache lines."

n.b. the prefetch clauses above (and also repeated in the next two
citations) are not important since all a prefetch might do is bring the
same flushed data back into cache without affecting validity of
completion of the writeback. Ordering of loads can also be ignored for
similar reasons.

So, I believe the above text indicates that there is no need for a
pre-sync or post-sync memory barrier when using CLFLUSH. All writes
executed before the CLFLUSH will be completed before the flush is
performed, all CLFLUSH operations complete in order of execution before
each subsequent CLFLUSH and the last CLFLUSH completes before any
subsequently executed writes can complete and become visible.

Vol. 2A 3-141 CLFLUSHOPT

[note the references to CLFLUSH below /appear/ exactly as occurs in the
manual -- I believe the two occurences in the first paragraph are meant
to read CLFLUSHOPT]

"The memory attribute of the page containing the affected line has no
effect on the behavior of this instruction. It should be noted that
processors are free to speculatively fetch and cache data from system
memory regions assigned a memory-type allowing for speculative reads
(such as, the WB, WC, and WT memory types). PREFETCHh instructions can
be used to provide the processor with hints for this speculative
behavior. Because this speculative fetching can occur at any time and is
not tied to instruction execution, the CLFLUSH [sic] instruction is not
ordered with respect to PREFETCHh instructions or any of the speculative
fetching mechanisms (that is, data can be speculatively loaded into a
cache line just before, during, or after the execution of a CLFLUSH
[sic] instruction that references the cache line).

"Executions of the CLFLUSHOPT instruction are ordered with respect to
fence instructions and to locked read- modify-write instructions; they
are also ordered with respect to the following accesses to the cache
line being invalidated: writes, executions of CLFLUSH, and executions of
CLFLUSHOPT. They are not ordered with respect to writes, executions of
CLFLUSH, or executions of CLFLUSHOPT that access other cache lines; to
enforce ordering with such an operation, software can insert an SFENCE
instruction between CFLUSHOPT and that operation."

I believe this indicates that an SFENCE (or a stronger MFENCE) is needed
to ensure all writes executed before the first CLFLUSHOPT are visible.
Successive CLFLUSHOPT instructions can then complete out of order and
may not have completed before any subsequently executed write becomes
visible. I think this risks the situation I described above where a
controlling update may be flushed before a flush of the data it
controls. Hence I think we require a post-sync MFENCE (or SFENCE).

Vol. 2A 3-146 CLWB

"The memory attribute of the page containing the affected line has no
effect on the behavior of this instruction. It should be noted that
processors are free to speculatively fetch and cache data from system
memory regions that are assigned a memory-type allowing for speculative
reads (such as, the WB, WC, and WT memory types). PREFETCHh instructions
can be used to provide the processor with hints for this speculative
behavior. Because this speculative fetching can occur at any time and is
not tied to instruction execution, the CLWB instruction is not ordered
with respect to PREFETCHh instructions or any of the speculative
fetching mechanisms (that is, data can be speculatively loaded into a
cache line just before, during, or after the execution of a CLWB
instruction that references the cache line).

"CLWB instruction is ordered only by store-fencing operations. For
example, software can use an SFENCE, MFENCE, XCHG, or LOCK-prefixed
instructions to ensure that previous stores are included in the
write-back. CLWB instruction need not be ordered by another CLWB or
CLFLUSHOPT instruction. CLWB is implicitly ordered with older stores
executed by the logical processor to the same address."

I think this is the same story as for CLFLUSHOPT i.e. an SFENCE is
needed before any CLWB instructions are executed to ensure that pending
writes are visible but then successive CLWB instructions can complete
out of order and may not have completed before any subsequently executed
write become visible. So, I also think this may need an MFENCE (or
SFENCE) after the CLWB instructions.


AArch64:

The latest ARM ARM I have is ARM ARM C available at

https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile

I am following libpmem in using instruction dc(CVAC) to writeback the
data cache lines (clear virtual address to point of coherency). I think
it would be preferable if dc(CVAP) were used (clear virtual address to
point of persistence). However

  1) support for dc(CVAP) does not seem to be a mandatory part of the
architecture
  2) user-space access to CVAP instruction is not guaranteed and not
currently configured by default on Linux
  3) CVAC is expected normally to guarantee writeback to physical memory

Ordering of cache management instructions is specified in section D3.4.8
under heading "Ordering and completion of data and instruction cache
instructions" on page D3-2069 as follows:

"All data cache instructions, other than DC ZVA, that specify an address:

- Execute in program order relative to loads or stores that access an
address in Normal memory with either Inner Write Through or Inner Write
Back attributes within the same cache line of minimum size, as indicated
by CTR_EL0.DMinLine.

- Can execute in any order relative to loads or stores that access any
address with the Device memory attribute, or with Normal memory with
Inner Non-cacheable attribute unless a DMB or DSB is executed between
the instructions.

- Execute in program order relative to other data cache maintenance
instructions, other than DC ZVA, that specify an address within the same
cache line of minimum size, as indicated by CTR_EL0.DMinLine.

- Can execute in any order relative to loads or stores that access an
address in a different cache line of minimum size, as indicated by
CTR_EL0.DMinLine, unless a DMB or DSB is executed between the instructions.

- Can execute in any order relative to other data cache maintenance
instructions, other than DC ZVA , that specify an address in a different
cache line of minimum size, as indicated by CTR_EL0.DMinLine, unless a
DMB or DSB is executed between the instructions.

- Can execute in any order relative to data cache maintenance
instructions that do not specify an address unless a DMB or DSB is
executed between the instructions.

- Can execute in any order relative to instruction cache maintenance
instructions unless a DSB is executed between the instructions."

  . . .


"All data cache maintenance instructions that do not specify an address:

- Can execute in any order relative to data cache maintenance
instructions that do not specify an address unless a DMB or DSB is
executed between the instructions.

- Can execute in any order relative to data cache maintenance
instructions that specify an address, other than Data Cache Zero, unless
a DMB or DSB is executed between the instructions.

- Can execute in any order relative to loads or stores unless a DMB or
DSB is executed between the instructions.

- Can execute in any order relative to instruction cache maintenance
instructions unless a DSB is executed between the instructions.

All instruction cache instructions can execute in any order relative to
other instruction cache instructions, data cache instructions, loads,
and stores unless a DSB is executed between the instructions.

A cache maintenance instruction can complete at any time after it is
executed, but is only guaranteed to be complete, and its effects visible
to other observers, following a DSB instruction executed by the PE that
executed the cache maintenance instruction. See also the requirements
for cache maintenance instructions in Completion and endpoint ordering
on page B2-97.

In all cases, where the text in this section refers to a DMB or a DSB ,
this means a DMB or DSB whose required access type is both loads and
stores."

The link to B2-97 provides some further clarification of what it means
for the above operations to be 'complete' (I have omitted the clauses
regarding completion of translation table walks and TLB invalidates).

"For all memory, the completion rules are defined as:
- A read R 1 to a Location is complete for a shareability domain when
all of the following are true:
  - Any write to the same Location by an Observer within the
shareability domain will be Coherence-after R 1 .
  - Any translation table walks associated with R 1 are complete for
that shareability domain.

- A write W 1 to a Location is complete for a shareability domain when
all of the following are true:
  - Any write to the same Location by an Observer within the
shareability domain will be Coherence-after W 1 .
  - Any read to the same Location by an Observer within the shareability
domain will either Reads-from W 1 or Reads-from a write that is
Coherence-after W 1 .
  - Any translation table walks associated with the write are complete
for that shareability domain.

  . . .

- A cache maintenance instruction is complete for a shareability domain
when the memory effects of the instruction are complete for that
shareability domain, and any translation table walks that arise from the
instruction are complete for that shareability domain.

  . . .

The completion of any cache or TLB maintenance instruction includes its
completion on all PEs that are affected by both the instruction and the
DSB operation that is required to guarantee visibility of the
maintenance instruction."

So, my reading of this is

i) a DMB is needed to ensure that writes executed before the dc(CVAC)
are completed before the writeback starts

ii) a DMB(ISHST) is all that is needed to achieve this pre-ordering

iii) a DMB is needed to ensure that pending writebacks are completed
before any subsequently executed writes can become visible.

iv) a DMB(ISHST) is all that is needed to achieve this post-ordering

Reply via email to