On 4/29/20 2:41 AM, Maurizio Cimadamore wrote:
On 28/04/2020 21:27, Peter Levart wrote:
Hi,
The problem with current implementation of MemoryScope is that if a
child scope is frequently acquired and closed (which increments and
then decrements the parent scope counter atomically using CAS), and
that is performed from multiple concurrent threads, contention might
become prohibitive. And I think that is precisely what happens when a
parallel pipeline is such that it might short-circuit the stream:
final boolean forEachWithCancel(Spliterator<P_OUT> spliterator,
Sink<P_OUT> sink) {
boolean cancelled;
do { } while (!(cancelled = sink.cancellationRequested()) &&
spliterator.tryAdvance(sink));
return cancelled;
}
1st spliterators are created by trySplit (all of them inherit the
same MemoryScope) and then FJPool threads are busy concurrently
executing above method which calls tryAdvance for each element of the
particular spliterator which does the following:
public boolean tryAdvance(Consumer<? super MemorySegment>
action) {
Objects.requireNonNull(action);
if (currentIndex < elemCount) {
AbstractMemorySegmentImpl acquired = segment.acquire();
try {
action.accept(acquired.asSliceNoCheck(currentIndex * elementSize,
elementSize));
} finally {
acquired.closeNoCheck();
currentIndex++;
if (currentIndex == elemCount) {
segment = null;
}
}
return true;
} else {
return false;
}
}
... acquire/close at each call. If the Stream is played to the end
(i.e. it can't short-circuit), then forEachRemaining is used which
performs just one acquire/close for the whole remaining spliterator.
So for short-circuiting streams it might be important to have a
MemoryScope that is scalable. Here's one such attempt using a pair of
scalable counters (just one pair per root memory scope):
The current implementation has performances that are on par with the
previous acquire-based implementation, and also on par with what can
be achieved with Unsafe. We do have a micro benchmark in the patch
(see ParallelSum (**)) which tests this, and I get _identical_ numbers
even if I _comment_ the body of acquire/release - so that no
contention can happen; so, I'm a bit skeptical overall that contention
on acquire/release is the main factor at play here - but perhaps we
need more targeted benchmarks.
Right, summing is typically not a short-circuiting operation, so I bet
forEachRemaining is used in the leaf spliterators. FindAny or findFirst
might be the ones to test. I'll prepare a test and see what
experimenting with alternative MemoryScope can do...
(**) - your email caused me to look deeper at the ParallelSum
benchmark which, as currently written seems to favor Unsafe over the
MemorySegment API - but in reality, as I discovered, that is down to
an issue in the implementation of the unsafe spliterator, which
doesn't sum all the elements; I will fix the benchmark in an upcoming
iteration
So, while I'm open to suggestion as to how to reduce contention on the
acquire counter, I think we need more evidence that this is indeed an
issue (or the _main_ issue, when it comes to parallel computation).
That said, your implementation looks interesting - some questions
inline and also below:
answers inline...
import java.util.concurrent.atomic.LongAdder;
/**
* @author Peter Levart
*/
public abstract class MemoryScope {
public static MemoryScope create(Object ref, Runnable
cleanupAction) {
return new Root(ref, cleanupAction);
}
MemoryScope() {}
public abstract MemoryScope acquire();
public abstract void close();
private static class Root extends MemoryScope {
private final LongAdder enters = new LongAdder();
private final LongAdder exits = new LongAdder();
private volatile boolean closed;
private final Object ref;
private final Runnable cleanupAction;
Root(Object ref, Runnable cleanupAction) {
this.ref = ref;
this.cleanupAction = cleanupAction;
}
@Override
public MemoryScope acquire() {
// increment enters 1st
enters.increment();
// check closed flag 2nd
if (closed) {
exits.increment();
throw new IllegalStateException("This scope is
already closed");
}
return new MemoryScope() {
@Override
public MemoryScope acquire() {
return Root.this.acquire();
}
@Override
public void close() {
exits.increment();
Here -- don't you mean Root.this.exits? Otherwise Root.exists is gonna
remain != from Root.enters?
'exits' field is declared in Root only, so 'exits' in anonymous inner
class actually refers to Root.this.exits...
}
};
}
private final Object lock = new Object();
@Override
public void close() {
synchronized (lock) {
Why the lock? If we are here we're already in the owner thread - e.g.
it's not like multiple threads can call this at the same time. Or are
you trying to make the code more robust in the case a segment is
created w/o a confinement thread (e.g. via the unsafe API) ?
That lock is unnecessary. A leftover.
// modify closed flag 1st
closed = true;
// check for no more active acquired children 2nd
// IMPORTANT: 1st sum exits, then sum enters !!!
if (exits.sum() != enters.sum()) {
throw new IllegalStateException("Cannot close
this scope as it has active acquired children");
}
}
if (cleanupAction != null) {
cleanupAction.run();
}
}
}
}
This MemoryScope is just 2-level. The root is the one that is to be
created when the memory segment is allocated. A child is always a
child of the root and has no own children. So a call to
child.acquire() gets forwarded to the Root. The Root.acquire() 1st
increments 'enters' scalable counter then checks the 'closed' flag.
The child.close() just increments the 'exits' scalable counter. The
Root.close() 1st modifies the 'closed' flag then checks to see that
the sum of 'exits' equals the sum of 'enters' - the important thing
here is that 'exits' are summed 1st and then 'enters'. These
orderings guarantee that either a child scope is successfully
acquired or the root scope is successfully closed but never both.
I guess what you mean here is that, by construction, exits <= enters.
So, we first read exists, then we read enters - and there can be only
two cases:
* exits < enters, in which case it means some other thread has
acquired but not closed (possibly even *after* the call to exits.sum())
* exits == enters, in which case there's no pending acquire and we're
golden
While this all seems very clever there are some things that don't 100%
convince me - for instance, I note that `closed` will stay set even if
we later throw an ISE during close(). I suppose we *could* reset
closed = false in the throwing code path, but then there's a
possibility of having generated spurious ISE in MemoryScope::acquire
in the time span where `closed = true`.
Right, as you saw in a private Email, I did exactly that in a revised
version (posted below). The spurious ISE may happen but only when close
is called prematurely relative to all child scope close(s) that were or
are still active. So we could say the other way: if close was not called
prematurely, the ISE on acquire would not be spurious - so ordering of
close relative to later acquire was wrong anyway and the exception is an
"indication" of that wrong ordering
Unless we want to use close() to "probe" the scope whether it is still
active or not, I don't think this should present a problem.
In other words, one of the big realization of the current
synchronization mechanism behind acquire() is that if we fold the
"closed" state with the "count" state, we then have to worry only
about one access, which makes it easier to reason about the
implementation. Here it seems that races between updates to
exits/enters/closed would be possible, and I'm not sure we can fully
protect against those w/o adding more locks?
The only problem I see with (the revised) version is this spurious ISE
that might be thrown on acquire executed concurrently with premature
close. I see no other races. For any interesting readers, I'm posting
this revised version here so it doesn't look like I'm talking about some
secret:
import java.lang.invoke.MethodHandles;
import java.lang.invoke.VarHandle;
import java.util.concurrent.atomic.LongAdder;
/**
* @author Peter Levart
*/
public abstract class MemoryScope {
public static MemoryScope create(Object ref, Runnable cleanupAction) {
return new Root(ref, cleanupAction);
}
boolean closed;
private static final VarHandle CLOSED;
static {
try {
CLOSED = MethodHandles.lookup().findVarHandle(Root.class,
"closed", boolean.class);
} catch (Throwable ex) {
throw new ExceptionInInitializerError(ex);
}
}
MemoryScope() {}
public abstract MemoryScope acquire();
public abstract void close();
public final boolean isAliveThreadSafe() {
return !((boolean) CLOSED.getVolatile(this));
}
public final boolean isAliveConfined() {
return !closed;
}
private static final class Root extends MemoryScope {
private final LongAdder acquires = new LongAdder();
private final LongAdder releases = new LongAdder();
private final Object ref;
private final Runnable cleanupAction;
private Root(Object ref, Runnable cleanupAction) {
this.ref = ref;
this.cleanupAction = cleanupAction;
}
@Override
public MemoryScope acquire() {
// increment acquires 1st
acquires.increment();
// check closed flag 2nd
if ((boolean) CLOSED.getVolatile(this)) {
releases.increment();
throw new IllegalStateException("This scope is already
closed");
}
return new Child();
}
@Override
public void close() {
if (closed) {
throw new IllegalStateException("This scope is already
closed");
}
// modify closed flag 1st
CLOSED.setVolatile(this, true);
// check for no more active acquired children 2nd
// IMPORTANT: 1st sum releases, then sum acquires !!!
if (releases.sum() != acquires.sum()) {
CLOSED.setVolatile(this, false); // undo before failing
throw new IllegalStateException("Cannot close this
scope as it has active acquired children");
}
if (cleanupAction != null) {
cleanupAction.run();
}
}
private final class Child extends MemoryScope {
private Child() { }
@Override
public MemoryScope acquire() {
return Root.this.acquire();
}
@Override
public void close() {
if (closed) {
throw new IllegalStateException("This scope is
already closed");
}
closed = true;
// following acts as a volatile write after plain write
above so
// plain write gets flushed too (which is important for
isAliveThreadSafe())
Root.this.releases.increment();
}
}
}
}
Maurizio
Regards, Peter
WDYT?
Regards, Peter
On 4/28/20 6:12 PM, Peter Levart wrote:
Hi Maurizio,
I'm checking out the thread-confinement in the parallel stream case.
I see the Spliterator.trySplit() is calling
AbstractMemorySegmentImpl's:
102 private AbstractMemorySegmentImpl asSliceNoCheck(long
offset, long newSize) {
103 return dup(offset, newSize, mask, owner, scope);
104 }
...so here the "owner" of the slice is still the same as that of
parent segment...
But then later in tryAdvance or forEachRemaining, the segment is
acquired/closed for each element of the stream (in case of
tryAdvance) or for the whole chunk to the end of spliterator (in
case of forEachRemaining). So some pipelines will be more optimal
than others...
So I'm thinking. Would it be possible to "lazily" acquire scope just
once in tryAdvance and then re-use the scope until the end?
Unfortunately Spliterator does not have a close() method to be
called when the pipeline is done with it. Perhaps it could be added
to the API? This is not the 1st time I wished Spliterator had a
close method. I had a similar problem when trying to create a
Spliterator with a database backend. When using JDBC API a separate
transaction (Connection) is typically required for each thread of
execution since several frameworks bind it to the ThreadLocal.
WDYT?
Regards, Peter
On 4/23/20 10:33 PM, Maurizio Cimadamore wrote:
Hi,
time has come for another round of foreign memory access API
incubation (see JEP 383 [3]). This iteration aims at polishing some
of the rough edges of the API, and adds some of the functionalities
that developers have been asking for during this first round of
incubation. The revised API tightens the thread-confinement
constraints (by removing the MemorySegment::acquire method) and
instead provides more targeted support for parallel computation via
a segment spliterator. The API also adds a way to create a custom
native segment; this is, essentially, an unsafe API point, very
similar in spirit to the JNI NewDirectByteBuffer functionality [1].
By using this bit of API, power-users will be able to add support,
via MemorySegment, to *their own memory sources* (e.g. think of a
custom allocator written in C/C++). For now, this API point is
called off as "restricted" and a special read-only JDK property
will have to be set on the command line for calls to this method to
succeed. We are aware there's no precedent for something like this
in the Java SE API - but if Project Panama is to remain true about
its ultimate goal of replacing bits of JNI code with (low level)
Java code, stuff like this has to be *possible*. We anticipate
that, at some point, this property will become a true launcher
flag, and that the foreign restricted machinery will be integrated
more neatly into the module system.
A list of the API, implementation and test changes is provided
below. If you have any questions, or need more detailed
explanations, I (and the rest of the Panama team) will be happy to
point at existing discussions, and/or to provide the feedback
required.
Thanks
Maurizio
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary.html
CSR:
https://bugs.openjdk.java.net/browse/JDK-8243496
API changes
===========
* MemorySegment
- drop support for acquire() method - in its place now you can
obtain a spliterator from a segment, which supports divide-and-conquer
- revamped support for views - e.g. isReadOnly - now segments
have access modes
- added API to do serial confinement hand-off
(MemorySegment::withOwnerThread)
- added unsafe factory to construct a native segment out of an
existing address; this API is "restricted" and only available if
the program is executed using the -Dforeign.unsafe=permit flag.
- the MemorySegment::mapFromPath now returns a MappedMemorySegment
* MappedMemorySegment
- small sub-interface which provides extra capabilities for
mapped segments (load(), unload() and force())
* MemoryAddress
- added distinction between *checked* and *unchecked* addresses;
*unchecked* addresses do not have a segment, so they cannot be
dereferenced
- added NULL memory address (it's an unchecked address)
- added factory to construct MemoryAddress from long value
(result is also an unchecked address)
- added API point to get raw address value (where possible - e.g.
if this is not an address pointing to a heap segment)
* MemoryLayout
- Added support for layout "attributes" - e.g. store metadata
inside MemoryLayouts
- Added MemoryLayout::isPadding predicate
- Added helper function to SequenceLayout to rehape/flatten
sequence layouts (a la NDArray [4])
* MemoryHandles
- add support for general VarHandle combinators (similar to MH
combinators)
- add a combinator to turn a long-VH into a MemoryAddress VH (the
resulting MemoryAddress is also *unchecked* and cannot be
dereferenced)
Implementation changes
======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a
set of method handles (one for each access mode supported by the
var handle) that are lazily linked. This gives us a relatively
simple idea upon which to build support for custom var handle
adapters: we could create a VarHandle by passing an existing var
handle and also specify the set of adaptations that should be
applied to the method handle for a given access mode in the
original var handle. The result is a new VarHandle which might
support a different carrier type and more, or less coordinate
types. Adding this support was relatively easy - and it only
required one low-level surgery of the lambda forms generated for
adapted var handle (this is required so that the "right" var handle
receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really
defined inside VarHandles) are really just a bunch of MH adapters
that are stitched together into a brand new VH. The only caveat is
that, we could have a checked exception mismatch: the VarHandle API
methods are specified not to throw any checked exception, whereas
method handles can throw any throwable. This means that,
potentially, calling get() on an adapted VarHandle could result in
a checked exception being thrown; to solve this gnarly issue, we
decided to scan all the filter functions passed to the VH
combinators and look for direct method handles which throw checked
exceptions. If such MHs are found (these can be deeply nested,
since the MHs can be adapted on their own), adaptation of the
target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First,
we have added support for retrieval of "mapped" properties
associated with a ByteBuffer (e.g. the file descriptor, etc.). This
is crucial if we want to be able to turn an existing byte buffer
into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers
given existing parameters - which is needed when going from
(mapped) segment to a buffer. These two pieces together allow us to
go from segment to buffer and back w/o losing any information about
the underlying memory mapping (which was an issue in the previous
implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the
memory mapped supporting functionalities have been moved into a
common helper class so that MappedMemorySegmentImpl can reuse that
(e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In
this round we aimed at splitting the various implementation classes
so that we have a class for heap segments (HeapMemorySegmentImpl),
one for native segments (NativeMemorySegmentImpl) and one for
memory mapped segments (MappedMemorySegmentImpl, which extends from
NativeMemorySegmentImpl). Not much to see here - although one
important point is that, by doing this, we have been able to speed
up performances quite a bit, since now e.g. native/mapped segments
are _guaranteed_ to have a null "base". We have also done few
tricks to make sure that the "base" accessor for heap segment is
sharply typed and also NPE checked, which allows C2 to speculate
more and hoist. With these changes _all_ segment types have
comparable performances and hoisting guarantees (unlike in the old
implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl
to special case "small segments" so that VM can apply bound check
elimination
This is another important piece which allows to get very good
performances out of indexes memory access var handles; as you might
know, the JIT compiler has troubles in optimizing loops where the
loop variable is a long [2]. To make up for that, in this round we
add an optimization which allows the API to detect whether a
segment is *small* or *large*. For small segments, the API realizes
that there's no need to perform long computation (e.g. to perform
bound checks, or offset additions), so it falls back to integer
logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory
access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes
=============
In addition to the tests for the new API changes, we've also added
some stress tests for var handle combinators - e.g. there's a flag
that can be enabled which turns on some "dummy" var handle
adaptations on all var handles created by the runtime. We've used
this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired
the new segment spliterator with the existing spliterator test
harness.
We have also added several micro benchmarks for the memory segment
API (and made some changes to the build script so that native
libraries would be handled correctly).
[1] -
https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newdirectbytebuffer
[2] - https://bugs.openjdk.java.net/browse/JDK-8223051
[3] - https://openjdk.java.net/jeps/383
[4] -
https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#numpy.reshape